Skip to content

Implement DWG converter endpoint#1

Open
willy1989cv wants to merge 5 commits into
mainfrom
feat/dwg-converter
Open

Implement DWG converter endpoint#1
willy1989cv wants to merge 5 commits into
mainfrom
feat/dwg-converter

Conversation

@willy1989cv
Copy link
Copy Markdown
Member

@willy1989cv willy1989cv commented May 31, 2026

Summary by CodeRabbit

  • New Features

    • Added DWG file preview capability with PDF export functionality
    • New API endpoint for converting DWG resources to PDF previews
  • Documentation

    • Expanded README with detailed configuration options, Docker setup instructions, supported formats, and installation guidelines
  • Chores

    • Updated project dependencies and ignored build artifacts

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a DWG-to-PDF preview feature to the Malmö CKAN extension. It introduces a complete conversion pipeline that validates DWG resources, stages them locally, converts them to DXF using an external ODA executable, loads and selects optimal layouts, and renders previews to PDF using ezdxf and PyMuPDF. The feature is exposed as a CKAN action and HTTP endpoint with configurable timeouts, size limits, and rendering parameters.

Changes

DWG Preview Conversion

Layer / File(s) Summary
Dependencies and configuration
.gitignore, requirements.txt, setup.py
Python cache patterns added to .gitignore. New runtime dependencies (requests, ezdxf, PyMuPDF) added to both requirements.txt and setup.py install_requires.
DWG preview builder implementation
ckanext/malmo/dwg_preview.py
Core 682-line module implements the full preview pipeline: validates DWG resources, stages uploaded files via uploader backend (with optional remote signed-URL fallback) or downloads from HTTP(S) URLs with size enforcement, converts DWG→DXF via ODA executable (with optional xvfb-run wrapping), loads DXF with ezdxf and recovery retry, selects optimal layouts with entity-count complexity guard, and renders to PDF using ezdxf rendering + PyMuPDF backend. Includes helpers for subprocess execution, filename sanitization, and robust configuration parsing.
CKAN action wrapper
ckanext/malmo/logic/action.py
Exposes dwg_preview_convert as a side-effect-free CKAN action that delegates to the preview builder and returns the preview payload dictionary.
Plugin blueprint and action registration
ckanext/malmo/plugin.py
Imports new modules, declares IBlueprint interface, registers dwg_preview_convert action in get_actions(), and adds get_blueprint() method returning blueprint from malmo_views.
HTTP endpoint and request handling
ckanext/malmo/views.py
New Flask blueprint with /api/3/action/dwg_preview_convert endpoint (GET/POST): builds CKAN action context from current user, calls dwg_preview_convert action, returns preview bytes with Content-Disposition and cache-control headers, maps ValidationError to HTTP 409 with JSON error payload, and maps unexpected exceptions to HTTP 500.
Feature documentation
README.md
Expanded with DWG preview requirements (ODA and rendering dependencies on PATH), configuration sections (timeouts, download limits, ODA executable/version, page margins, xvfb settings), Docker setup for ODA AppImage override via ckan/vendor/oda/, and supported asset formats.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit's ode to previews now in sight,
DWG files transform to PDF bright!
ODA dances, ezdxf renders true,
From upload or download, the pipeline flew.
Configuration tweaks and error guards along,
This feature builds a conversion song! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Implement DWG converter endpoint' clearly summarizes the main change—adding a new DWG preview conversion endpoint. It accurately reflects the core functionality introduced across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dwg-converter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willy1989cv willy1989cv requested a review from mpolidori May 31, 2026 20:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
ckanext/malmo/views.py (1)

49-50: ⚡ Quick win

Escape the preview filename before putting it in Content-Disposition.

payload["filename"] comes from resource metadata, so quotes, CR/LF, or non-ASCII characters can generate a broken header here. Normalizing the fallback filename and adding an RFC 5987 filename* value avoids malformed responses for perfectly valid resource names.

Suggested fix
+from urllib.parse import quote
+
@@
 def dwg_preview_convert() -> flask.Response:
@@
     response = flask.Response(payload["content"], mimetype=payload["mimetype"])
-    response.headers["Content-Disposition"] = f'inline; filename="{payload["filename"]}"'
+    response.headers["Content-Disposition"] = _build_content_disposition(payload["filename"])
     response.headers["Cache-Control"] = "private, no-store, max-age=0"
     return response
@@
 def _validation_error_response(error: ValidationError) -> tuple[flask.Response, int]:
@@
     )
+
+
+def _build_content_disposition(filename: str) -> str:
+    fallback = (
+        filename.replace("\\", "_")
+        .replace('"', "_")
+        .replace("\r", "_")
+        .replace("\n", "_")
+    )
+    return f'inline; filename="{fallback}"; filename*=UTF-8\'\'{quote(filename)}'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ckanext/malmo/views.py` around lines 49 - 50, The Content-Disposition header
uses raw payload["filename"] which may contain quotes, CR/LF or non-ASCII chars;
sanitize and normalize it before use: strip/control-character and quote chars
from payload["filename"], produce an ASCII fallback via
unicodedata.normalize('NFKD') and encoding errors='ignore' (or a safe fallback
like "preview"), percent-encode the original UTF-8 name per RFC 5987 for
filename* (e.g. "UTF-8''%..." form), and set
response.headers["Content-Disposition"] to include both a safe quoted filename
fallback and the RFC5987 filename* alternative (while keeping the existing
response creation code that uses payload["content"] and payload["mimetype"]).
Ensure these changes are applied where response is created and header is set
(the response variable and Content-Disposition assignment).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ckanext/malmo/dwg_preview.py`:
- Around line 138-148: resource["url"] is passed directly into _download_to_path
which allows SSRF via redirects or hostnames resolving to
loopback/RFC1918/metadata addresses; add server-side destination validation
before calling _download_to_path and enforce it for every redirect hop by either
(a) resolving the URL hostname(s) with socket.getaddrinfo/ipaddress and
rejecting addresses in loopback/private/link-local/multicast/reserved ranges, or
(b) disabling automatic redirects and validating the Location header host for
each redirect then resolving and checking its IPs before following; implement
the check as a reusable helper (e.g., validate_download_target(url) used where
resource_url is read) and update _download_to_path or its requests usage to
ensure per-redirect validation is applied.
- Line 141: The current log call logs raw external URLs (log.info("Downloading
external DWG resource=%s url=%s", resource.get("id"), resource_url)) which can
leak presigned query strings or credentials; change it to avoid logging the full
resource_url by either logging only resource.get("id") or a redacted URL (strip
query string and any userinfo/credentials and optionally keep only host/path)
before passing to log.info; implement a small helper (e.g.,
redact_url(resource_url) or redacted_resource_url variable) that uses
urllib.parse to remove .query and .username/.password and replace the existing
log.info call to use the redacted value or just the resource id.

In `@README.md`:
- Line 3: Replace the garbled "Malm?" string in the README text ("Customizations
for the City of Malm? CKAN instance.") with the correct "Malmö" and ensure the
README.md is saved with UTF-8 encoding (or re-encode the file to UTF-8) so the
"ö" character is preserved; update the literal in README.md and verify the
repo/file encoding settings to prevent future character corruption.
- Around line 43-47: Update the README entries for
ckanext.malmo.dwg_preview_download_timeout and
ckanext.malmo.dwg_preview_max_download_bytes to include their default values
from the implementation: DEFAULT_DOWNLOAD_TIMEOUT_SECONDS (30 seconds) and
DEFAULT_MAX_DOWNLOAD_BYTES (100 * 1024 * 1024 bytes). Locate the two bullet
items describing these config keys and append the default values and units
(e.g., "default: 30s" and "default: 100 MB") so the docs directly reflect the
constants DEFAULT_DOWNLOAD_TIMEOUT_SECONDS and DEFAULT_MAX_DOWNLOAD_BYTES.

In `@requirements.txt`:
- Around line 1-4: Update requirements.txt to pin package versions for security
and reproducibility: change the requests entry to "requests>=2.33.0,<3" and
PyMuPDF to "PyMuPDF>=1.26.7,<2", and also pin html2text and ezdxf to specific
stable ranges (for example a recent compatible release like
"html2text>=2020.1.16,<2024" and "ezdxf>=0.18.3,<1") so installs are
deterministic and avoid known vulnerabilities; ensure you replace the unpinned
package lines (html2text, requests, ezdxf, PyMuPDF) with these pinned specifiers
in requirements.txt.

In `@setup.py`:
- Around line 18-21: The install_requires list in setup.py currently lists
unpinned packages ('html2text', 'requests', 'ezdxf', 'PyMuPDF'); update the
install_requires entry to pin each dependency to a safe, reproducible version
range (e.g., using exact or range pins like >= and <) for 'html2text',
'requests', 'ezdxf', and 'PyMuPDF' to prevent pulling insecure or breaking
releases—choose vetted versions consistent with your CKAN compatibility and
mirror the locks used in requirements.txt or the verification script.

---

Nitpick comments:
In `@ckanext/malmo/views.py`:
- Around line 49-50: The Content-Disposition header uses raw payload["filename"]
which may contain quotes, CR/LF or non-ASCII chars; sanitize and normalize it
before use: strip/control-character and quote chars from payload["filename"],
produce an ASCII fallback via unicodedata.normalize('NFKD') and encoding
errors='ignore' (or a safe fallback like "preview"), percent-encode the original
UTF-8 name per RFC 5987 for filename* (e.g. "UTF-8''%..." form), and set
response.headers["Content-Disposition"] to include both a safe quoted filename
fallback and the RFC5987 filename* alternative (while keeping the existing
response creation code that uses payload["content"] and payload["mimetype"]).
Ensure these changes are applied where response is created and header is set
(the response variable and Content-Disposition assignment).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8be10b1-157c-4052-905c-b0f2b6288c40

📥 Commits

Reviewing files that changed from the base of the PR and between f4c6695 and 94fdcf2.

📒 Files selected for processing (8)
  • .gitignore
  • README.md
  • ckanext/malmo/dwg_preview.py
  • ckanext/malmo/logic/action.py
  • ckanext/malmo/plugin.py
  • ckanext/malmo/views.py
  • requirements.txt
  • setup.py

Comment on lines +138 to +148
resource_url = str(resource.get("url") or "").strip()
if not resource_url:
raise ValidationError({"resource_id": ["Resource does not have a downloadable URL"]})
log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url)
_download_to_path(
resource_url,
source_path,
max_download_bytes=max_download_bytes,
download_timeout=download_timeout,
source_label="external DWG resource",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Block SSRF targets before fetching resource URLs.

resource["url"] flows straight into requests.get() and redirects are followed automatically. Any user who can create or edit a DWG resource can point it at localhost, RFC1918 space, or cloud metadata endpoints and make CKAN fetch them server-side. Validate the resolved destination before the first request and on every redirect hop, and reject loopback/private/link-local/reserved addresses.

Suggested hardening
+import ipaddress
+import socket
+from urllib.parse import urljoin, urlparse
+
+def _validate_remote_url(url: str, source_label: str) -> str:
+    parsed = urlparse(url)
+    if parsed.scheme.lower() not in {"http", "https"}:
+        raise ValidationError({"resource_id": [f"Unsupported URL scheme for {source_label}"]})
+    if not parsed.hostname:
+        raise ValidationError({"resource_id": [f"Invalid URL for {source_label}"]})
+
+    port = parsed.port or (443 if parsed.scheme.lower() == "https" else 80)
+    for _family, _type, _proto, _canonname, sockaddr in socket.getaddrinfo(parsed.hostname, port, type=socket.SOCK_STREAM):
+        ip = ipaddress.ip_address(sockaddr[0])
+        if (
+            ip.is_private
+            or ip.is_loopback
+            or ip.is_link_local
+            or ip.is_reserved
+            or ip.is_multicast
+            or ip.is_unspecified
+        ):
+            raise ValidationError({"resource_id": [f"Refusing to download {source_label} from a private address"]})
+    return url
+
 def _download_to_path(
     url: str,
     destination_path: str,
@@
-) -> None:
-    parsed_url = urlparse(url)
-    if parsed_url.scheme.lower() not in {"http", "https"}:
-        raise ValidationError({"resource_id": [f"Unsupported URL scheme for {source_label}"]})
+) -> None:
+    current_url = _validate_remote_url(url, source_label)

     bytes_downloaded = 0
     try:
-        with requests.get(
-            url,
-            stream=True,
-            timeout=(10, download_timeout),
-            allow_redirects=True,
-        ) as response:
-            response.raise_for_status()
-            with open(destination_path, "wb") as destination_file:
-                for chunk in response.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
-                    if not chunk:
-                        continue
-                    bytes_downloaded += len(chunk)
-                    if bytes_downloaded > max_download_bytes:
-                        raise ValidationError(
-                            {
-                                "resource_id": [
-                                    f"{source_label.capitalize()} exceeds the maximum allowed size of {max_download_bytes} bytes"
-                                ]
-                            }
-                        )
-                    destination_file.write(chunk)
+        for _ in range(5):
+            with requests.get(
+                current_url,
+                stream=True,
+                timeout=(10, download_timeout),
+                allow_redirects=False,
+            ) as response:
+                if 300 <= response.status_code < 400 and "Location" in response.headers:
+                    current_url = _validate_remote_url(urljoin(current_url, response.headers["Location"]), source_label)
+                    continue
+                response.raise_for_status()
+                with open(destination_path, "wb") as destination_file:
+                    for chunk in response.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
+                        if not chunk:
+                            continue
+                        bytes_downloaded += len(chunk)
+                        if bytes_downloaded > max_download_bytes:
+                            raise ValidationError(
+                                {
+                                    "resource_id": [
+                                        f"{source_label.capitalize()} exceeds the maximum allowed size of {max_download_bytes} bytes"
+                                    ]
+                                }
+                            )
+                        destination_file.write(chunk)
+                break
+        else:
+            raise ValidationError({"resource_id": [f"Too many redirects for {source_label}"]})

Also applies to: 267-278

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ckanext/malmo/dwg_preview.py` around lines 138 - 148, resource["url"] is
passed directly into _download_to_path which allows SSRF via redirects or
hostnames resolving to loopback/RFC1918/metadata addresses; add server-side
destination validation before calling _download_to_path and enforce it for every
redirect hop by either (a) resolving the URL hostname(s) with
socket.getaddrinfo/ipaddress and rejecting addresses in
loopback/private/link-local/multicast/reserved ranges, or (b) disabling
automatic redirects and validating the Location header host for each redirect
then resolving and checking its IPs before following; implement the check as a
reusable helper (e.g., validate_download_target(url) used where resource_url is
read) and update _download_to_path or its requests usage to ensure per-redirect
validation is applied.

resource_url = str(resource.get("url") or "").strip()
if not resource_url:
raise ValidationError({"resource_id": ["Resource does not have a downloadable URL"]})
log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't write raw external URLs to application logs.

Logging the full resource_url will persist signed query strings or embedded credentials if the resource points at a pre-signed object store URL. Log a redacted host/path or just the resource id instead.

Suggested fix
-        log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url)
+        parsed_url = urlparse(resource_url)
+        log.info(
+            "Downloading external DWG resource=%s host=%s path=%s",
+            resource.get("id"),
+            parsed_url.netloc,
+            parsed_url.path,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ckanext/malmo/dwg_preview.py` at line 141, The current log call logs raw
external URLs (log.info("Downloading external DWG resource=%s url=%s",
resource.get("id"), resource_url)) which can leak presigned query strings or
credentials; change it to avoid logging the full resource_url by either logging
only resource.get("id") or a redacted URL (strip query string and any
userinfo/credentials and optionally keep only host/path) before passing to
log.info; implement a small helper (e.g., redact_url(resource_url) or
redacted_resource_url variable) that uses urllib.parse to remove .query and
.username/.password and replace the existing log.info call to use the redacted
value or just the resource id.

Comment thread README.md
# ckanext-malmo

Customizations for the City of Malmö CKAN instance.
Customizations for the City of Malm? CKAN instance.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix character encoding issue.

The text displays "Malm?" instead of "Malmö". This appears to be a character encoding issue.

📝 Proposed fix
-Customizations for the City of Malm? CKAN instance.
+Customizations for the City of Malmö CKAN instance.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Customizations for the City of Malm? CKAN instance.
Customizations for the City of Malmö CKAN instance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 3, Replace the garbled "Malm?" string in the README text
("Customizations for the City of Malm? CKAN instance.") with the correct "Malmö"
and ensure the README.md is saved with UTF-8 encoding (or re-encode the file to
UTF-8) so the "ö" character is preserved; update the literal in README.md and
verify the repo/file encoding settings to prevent future character corruption.

Comment thread README.md
Comment on lines +43 to +47
- `ckanext.malmo.dwg_preview_download_timeout`
Download timeout in seconds for remote DWG resources.

- `ckanext.malmo.dwg_preview_max_download_bytes`
Maximum DWG download size in bytes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document default values for download configuration.

The defaults for dwg_preview_download_timeout and dwg_preview_max_download_bytes are missing from the documentation. According to the implementation, DEFAULT_DOWNLOAD_TIMEOUT_SECONDS = 30 and DEFAULT_MAX_DOWNLOAD_BYTES = 100 * 1024 * 1024.

📝 Proposed fix
 - `ckanext.malmo.dwg_preview_download_timeout`
-  Download timeout in seconds for remote DWG resources.
+  Download timeout in seconds for remote DWG resources. Default: `30`.
 
 - `ckanext.malmo.dwg_preview_max_download_bytes`
-  Maximum DWG download size in bytes.
+  Maximum DWG download size in bytes. Default: `104857600` (100 MB).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `ckanext.malmo.dwg_preview_download_timeout`
Download timeout in seconds for remote DWG resources.
- `ckanext.malmo.dwg_preview_max_download_bytes`
Maximum DWG download size in bytes.
- `ckanext.malmo.dwg_preview_download_timeout`
Download timeout in seconds for remote DWG resources. Default: `30`.
- `ckanext.malmo.dwg_preview_max_download_bytes`
Maximum DWG download size in bytes. Default: `104857600` (100 MB).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 43 - 47, Update the README entries for
ckanext.malmo.dwg_preview_download_timeout and
ckanext.malmo.dwg_preview_max_download_bytes to include their default values
from the implementation: DEFAULT_DOWNLOAD_TIMEOUT_SECONDS (30 seconds) and
DEFAULT_MAX_DOWNLOAD_BYTES (100 * 1024 * 1024 bytes). Locate the two bullet
items describing these config keys and append the default values and units
(e.g., "default: 30s" and "default: 100 MB") so the docs directly reflect the
constants DEFAULT_DOWNLOAD_TIMEOUT_SECONDS and DEFAULT_MAX_DOWNLOAD_BYTES.

Comment thread requirements.txt
Comment on lines +1 to +4
html2text
requests
ezdxf
PyMuPDF
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check latest versions from PyPI and scan for vulnerabilities

echo "=== Checking latest PyPI versions ==="
for pkg in html2text requests ezdxf PyMuPDF; do
  echo -n "$pkg: "
  curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version'
done

echo -e "\n=== Scanning for security advisories ==="
for pkg in html2text requests ezdxf PyMuPDF; do
  echo "--- $pkg ---"
  gh api graphql -f query="
  {
    securityVulnerabilities(first: 5, ecosystem: PIP, package: \"$pkg\") {
      nodes {
        advisory {
          summary
          severity
          publishedAt
        }
        vulnerableVersionRange
        firstPatchedVersion {
          identifier
        }
      }
    }
  }" | jq '.data.securityVulnerabilities.nodes[] | select(.vulnerableVersionRange != null)'
done

Repository: datopian/ckanext-malmo

Length of output: 1955


Pin dependency versions in requirements.txt (security + reproducibility).

requirements.txt lines 1-4 are unpinned, so installs can pull in vulnerable/non-reproducible versions.

  • requests: security advisories are fixed starting at 2.33.0 → use requests>=2.33.0,<3
  • PyMuPDF: vulnerability fixed starting at 1.26.7 (range >=1.26.5,<1.26.7) → use PyMuPDF>=1.26.7,<2
  • Also pin html2text and ezdxf (no advisories returned by the scan, but unpinned versions still break reproducibility)
🧰 Tools
🪛 OSV Scanner (2.3.8)

[HIGH] 2-2: requests 2.9.2: undefined

(PYSEC-2018-28)


[HIGH] 2-2: requests 2.9.2: undefined

(PYSEC-2023-74)


[HIGH] 2-2: requests 2.9.2: Requests vulnerable to .netrc credentials leak via malicious URLs

(GHSA-9hjg-9r4m-mvj7)


[HIGH] 2-2: requests 2.9.2: Requests Session object does not verify requests after making first request with verify=False

(GHSA-9wx4-h78v-vm56)


[HIGH] 2-2: requests 2.9.2: Requests has Insecure Temp File Reuse in its extract_zipped_paths() utility function

(GHSA-gc5v-m9x4-r6x2)


[HIGH] 2-2: requests 2.9.2: Unintended leak of Proxy-Authorization header in requests

(GHSA-j8r2-6x86-q33q)


[HIGH] 2-2: requests 2.9.2: Insufficiently Protected Credentials in Requests

(GHSA-x84v-xcm2-53pg)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.txt` around lines 1 - 4, Update requirements.txt to pin package
versions for security and reproducibility: change the requests entry to
"requests>=2.33.0,<3" and PyMuPDF to "PyMuPDF>=1.26.7,<2", and also pin
html2text and ezdxf to specific stable ranges (for example a recent compatible
release like "html2text>=2020.1.16,<2024" and "ezdxf>=0.18.3,<1") so installs
are deterministic and avoid known vulnerabilities; ensure you replace the
unpinned package lines (html2text, requests, ezdxf, PyMuPDF) with these pinned
specifiers in requirements.txt.

Comment thread setup.py
Comment on lines +18 to +21
'html2text',
'requests',
'ezdxf',
'PyMuPDF',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin dependency versions in install_requires to ensure secure and reproducible installs.

The install_requires list lacks version constraints for all four dependencies, creating the same security and reproducibility risks flagged in requirements.txt:

  • Security: Unpinned requests can install versions with known CVEs (credentials leaks, session handling flaws).
  • Reproducibility: Different environments may resolve different versions.
  • Stability: Breaking changes in dependencies can silently break the extension.

Since setup.py is the authoritative source for package installation, version pins here are critical.

🔒 Recommended: Add version constraints
     install_requires=[
-        'html2text',
-        'requests',
-        'ezdxf',
-        'PyMuPDF',
+        'html2text>=2023.10.0,<2024',
+        'requests>=2.32.0,<3',
+        'ezdxf>=1.1.0,<2',
+        'PyMuPDF>=1.23.0,<2',
     ],

These are example constraints based on typical versions; verify the latest secure releases and compatibility with your CKAN environment before applying.

Note: The verification script in the requirements.txt review will help identify the latest secure versions for all four dependencies.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'html2text',
'requests',
'ezdxf',
'PyMuPDF',
'html2text>=2023.10.0,<2024',
'requests>=2.32.0,<3',
'ezdxf>=1.1.0,<2',
'PyMuPDF>=1.23.0,<2',
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@setup.py` around lines 18 - 21, The install_requires list in setup.py
currently lists unpinned packages ('html2text', 'requests', 'ezdxf', 'PyMuPDF');
update the install_requires entry to pin each dependency to a safe, reproducible
version range (e.g., using exact or range pins like >= and <) for 'html2text',
'requests', 'ezdxf', and 'PyMuPDF' to prevent pulling insecure or breaking
releases—choose vetted versions consistent with your CKAN compatibility and
mirror the locks used in requirements.txt or the verification script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants