Implement DWG converter endpoint#1
Conversation
📝 WalkthroughWalkthroughThis PR adds a DWG-to-PDF preview feature to the Malmö CKAN extension. It introduces a complete conversion pipeline that validates DWG resources, stages them locally, converts them to DXF using an external ODA executable, loads and selects optimal layouts, and renders previews to PDF using ezdxf and PyMuPDF. The feature is exposed as a CKAN action and HTTP endpoint with configurable timeouts, size limits, and rendering parameters. ChangesDWG Preview Conversion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
ckanext/malmo/views.py (1)
49-50: ⚡ Quick winEscape the preview filename before putting it in
Content-Disposition.
payload["filename"]comes from resource metadata, so quotes, CR/LF, or non-ASCII characters can generate a broken header here. Normalizing the fallback filename and adding an RFC 5987filename*value avoids malformed responses for perfectly valid resource names.Suggested fix
+from urllib.parse import quote + @@ def dwg_preview_convert() -> flask.Response: @@ response = flask.Response(payload["content"], mimetype=payload["mimetype"]) - response.headers["Content-Disposition"] = f'inline; filename="{payload["filename"]}"' + response.headers["Content-Disposition"] = _build_content_disposition(payload["filename"]) response.headers["Cache-Control"] = "private, no-store, max-age=0" return response @@ def _validation_error_response(error: ValidationError) -> tuple[flask.Response, int]: @@ ) + + +def _build_content_disposition(filename: str) -> str: + fallback = ( + filename.replace("\\", "_") + .replace('"', "_") + .replace("\r", "_") + .replace("\n", "_") + ) + return f'inline; filename="{fallback}"; filename*=UTF-8\'\'{quote(filename)}'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ckanext/malmo/views.py` around lines 49 - 50, The Content-Disposition header uses raw payload["filename"] which may contain quotes, CR/LF or non-ASCII chars; sanitize and normalize it before use: strip/control-character and quote chars from payload["filename"], produce an ASCII fallback via unicodedata.normalize('NFKD') and encoding errors='ignore' (or a safe fallback like "preview"), percent-encode the original UTF-8 name per RFC 5987 for filename* (e.g. "UTF-8''%..." form), and set response.headers["Content-Disposition"] to include both a safe quoted filename fallback and the RFC5987 filename* alternative (while keeping the existing response creation code that uses payload["content"] and payload["mimetype"]). Ensure these changes are applied where response is created and header is set (the response variable and Content-Disposition assignment).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ckanext/malmo/dwg_preview.py`:
- Around line 138-148: resource["url"] is passed directly into _download_to_path
which allows SSRF via redirects or hostnames resolving to
loopback/RFC1918/metadata addresses; add server-side destination validation
before calling _download_to_path and enforce it for every redirect hop by either
(a) resolving the URL hostname(s) with socket.getaddrinfo/ipaddress and
rejecting addresses in loopback/private/link-local/multicast/reserved ranges, or
(b) disabling automatic redirects and validating the Location header host for
each redirect then resolving and checking its IPs before following; implement
the check as a reusable helper (e.g., validate_download_target(url) used where
resource_url is read) and update _download_to_path or its requests usage to
ensure per-redirect validation is applied.
- Line 141: The current log call logs raw external URLs (log.info("Downloading
external DWG resource=%s url=%s", resource.get("id"), resource_url)) which can
leak presigned query strings or credentials; change it to avoid logging the full
resource_url by either logging only resource.get("id") or a redacted URL (strip
query string and any userinfo/credentials and optionally keep only host/path)
before passing to log.info; implement a small helper (e.g.,
redact_url(resource_url) or redacted_resource_url variable) that uses
urllib.parse to remove .query and .username/.password and replace the existing
log.info call to use the redacted value or just the resource id.
In `@README.md`:
- Line 3: Replace the garbled "Malm?" string in the README text ("Customizations
for the City of Malm? CKAN instance.") with the correct "Malmö" and ensure the
README.md is saved with UTF-8 encoding (or re-encode the file to UTF-8) so the
"ö" character is preserved; update the literal in README.md and verify the
repo/file encoding settings to prevent future character corruption.
- Around line 43-47: Update the README entries for
ckanext.malmo.dwg_preview_download_timeout and
ckanext.malmo.dwg_preview_max_download_bytes to include their default values
from the implementation: DEFAULT_DOWNLOAD_TIMEOUT_SECONDS (30 seconds) and
DEFAULT_MAX_DOWNLOAD_BYTES (100 * 1024 * 1024 bytes). Locate the two bullet
items describing these config keys and append the default values and units
(e.g., "default: 30s" and "default: 100 MB") so the docs directly reflect the
constants DEFAULT_DOWNLOAD_TIMEOUT_SECONDS and DEFAULT_MAX_DOWNLOAD_BYTES.
In `@requirements.txt`:
- Around line 1-4: Update requirements.txt to pin package versions for security
and reproducibility: change the requests entry to "requests>=2.33.0,<3" and
PyMuPDF to "PyMuPDF>=1.26.7,<2", and also pin html2text and ezdxf to specific
stable ranges (for example a recent compatible release like
"html2text>=2020.1.16,<2024" and "ezdxf>=0.18.3,<1") so installs are
deterministic and avoid known vulnerabilities; ensure you replace the unpinned
package lines (html2text, requests, ezdxf, PyMuPDF) with these pinned specifiers
in requirements.txt.
In `@setup.py`:
- Around line 18-21: The install_requires list in setup.py currently lists
unpinned packages ('html2text', 'requests', 'ezdxf', 'PyMuPDF'); update the
install_requires entry to pin each dependency to a safe, reproducible version
range (e.g., using exact or range pins like >= and <) for 'html2text',
'requests', 'ezdxf', and 'PyMuPDF' to prevent pulling insecure or breaking
releases—choose vetted versions consistent with your CKAN compatibility and
mirror the locks used in requirements.txt or the verification script.
---
Nitpick comments:
In `@ckanext/malmo/views.py`:
- Around line 49-50: The Content-Disposition header uses raw payload["filename"]
which may contain quotes, CR/LF or non-ASCII chars; sanitize and normalize it
before use: strip/control-character and quote chars from payload["filename"],
produce an ASCII fallback via unicodedata.normalize('NFKD') and encoding
errors='ignore' (or a safe fallback like "preview"), percent-encode the original
UTF-8 name per RFC 5987 for filename* (e.g. "UTF-8''%..." form), and set
response.headers["Content-Disposition"] to include both a safe quoted filename
fallback and the RFC5987 filename* alternative (while keeping the existing
response creation code that uses payload["content"] and payload["mimetype"]).
Ensure these changes are applied where response is created and header is set
(the response variable and Content-Disposition assignment).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d8be10b1-157c-4052-905c-b0f2b6288c40
📒 Files selected for processing (8)
.gitignoreREADME.mdckanext/malmo/dwg_preview.pyckanext/malmo/logic/action.pyckanext/malmo/plugin.pyckanext/malmo/views.pyrequirements.txtsetup.py
| resource_url = str(resource.get("url") or "").strip() | ||
| if not resource_url: | ||
| raise ValidationError({"resource_id": ["Resource does not have a downloadable URL"]}) | ||
| log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url) | ||
| _download_to_path( | ||
| resource_url, | ||
| source_path, | ||
| max_download_bytes=max_download_bytes, | ||
| download_timeout=download_timeout, | ||
| source_label="external DWG resource", | ||
| ) |
There was a problem hiding this comment.
Block SSRF targets before fetching resource URLs.
resource["url"] flows straight into requests.get() and redirects are followed automatically. Any user who can create or edit a DWG resource can point it at localhost, RFC1918 space, or cloud metadata endpoints and make CKAN fetch them server-side. Validate the resolved destination before the first request and on every redirect hop, and reject loopback/private/link-local/reserved addresses.
Suggested hardening
+import ipaddress
+import socket
+from urllib.parse import urljoin, urlparse
+
+def _validate_remote_url(url: str, source_label: str) -> str:
+ parsed = urlparse(url)
+ if parsed.scheme.lower() not in {"http", "https"}:
+ raise ValidationError({"resource_id": [f"Unsupported URL scheme for {source_label}"]})
+ if not parsed.hostname:
+ raise ValidationError({"resource_id": [f"Invalid URL for {source_label}"]})
+
+ port = parsed.port or (443 if parsed.scheme.lower() == "https" else 80)
+ for _family, _type, _proto, _canonname, sockaddr in socket.getaddrinfo(parsed.hostname, port, type=socket.SOCK_STREAM):
+ ip = ipaddress.ip_address(sockaddr[0])
+ if (
+ ip.is_private
+ or ip.is_loopback
+ or ip.is_link_local
+ or ip.is_reserved
+ or ip.is_multicast
+ or ip.is_unspecified
+ ):
+ raise ValidationError({"resource_id": [f"Refusing to download {source_label} from a private address"]})
+ return url
+
def _download_to_path(
url: str,
destination_path: str,
@@
-) -> None:
- parsed_url = urlparse(url)
- if parsed_url.scheme.lower() not in {"http", "https"}:
- raise ValidationError({"resource_id": [f"Unsupported URL scheme for {source_label}"]})
+) -> None:
+ current_url = _validate_remote_url(url, source_label)
bytes_downloaded = 0
try:
- with requests.get(
- url,
- stream=True,
- timeout=(10, download_timeout),
- allow_redirects=True,
- ) as response:
- response.raise_for_status()
- with open(destination_path, "wb") as destination_file:
- for chunk in response.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
- if not chunk:
- continue
- bytes_downloaded += len(chunk)
- if bytes_downloaded > max_download_bytes:
- raise ValidationError(
- {
- "resource_id": [
- f"{source_label.capitalize()} exceeds the maximum allowed size of {max_download_bytes} bytes"
- ]
- }
- )
- destination_file.write(chunk)
+ for _ in range(5):
+ with requests.get(
+ current_url,
+ stream=True,
+ timeout=(10, download_timeout),
+ allow_redirects=False,
+ ) as response:
+ if 300 <= response.status_code < 400 and "Location" in response.headers:
+ current_url = _validate_remote_url(urljoin(current_url, response.headers["Location"]), source_label)
+ continue
+ response.raise_for_status()
+ with open(destination_path, "wb") as destination_file:
+ for chunk in response.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
+ if not chunk:
+ continue
+ bytes_downloaded += len(chunk)
+ if bytes_downloaded > max_download_bytes:
+ raise ValidationError(
+ {
+ "resource_id": [
+ f"{source_label.capitalize()} exceeds the maximum allowed size of {max_download_bytes} bytes"
+ ]
+ }
+ )
+ destination_file.write(chunk)
+ break
+ else:
+ raise ValidationError({"resource_id": [f"Too many redirects for {source_label}"]})Also applies to: 267-278
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ckanext/malmo/dwg_preview.py` around lines 138 - 148, resource["url"] is
passed directly into _download_to_path which allows SSRF via redirects or
hostnames resolving to loopback/RFC1918/metadata addresses; add server-side
destination validation before calling _download_to_path and enforce it for every
redirect hop by either (a) resolving the URL hostname(s) with
socket.getaddrinfo/ipaddress and rejecting addresses in
loopback/private/link-local/multicast/reserved ranges, or (b) disabling
automatic redirects and validating the Location header host for each redirect
then resolving and checking its IPs before following; implement the check as a
reusable helper (e.g., validate_download_target(url) used where resource_url is
read) and update _download_to_path or its requests usage to ensure per-redirect
validation is applied.
| resource_url = str(resource.get("url") or "").strip() | ||
| if not resource_url: | ||
| raise ValidationError({"resource_id": ["Resource does not have a downloadable URL"]}) | ||
| log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url) |
There was a problem hiding this comment.
Don't write raw external URLs to application logs.
Logging the full resource_url will persist signed query strings or embedded credentials if the resource points at a pre-signed object store URL. Log a redacted host/path or just the resource id instead.
Suggested fix
- log.info("Downloading external DWG resource=%s url=%s", resource.get("id"), resource_url)
+ parsed_url = urlparse(resource_url)
+ log.info(
+ "Downloading external DWG resource=%s host=%s path=%s",
+ resource.get("id"),
+ parsed_url.netloc,
+ parsed_url.path,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ckanext/malmo/dwg_preview.py` at line 141, The current log call logs raw
external URLs (log.info("Downloading external DWG resource=%s url=%s",
resource.get("id"), resource_url)) which can leak presigned query strings or
credentials; change it to avoid logging the full resource_url by either logging
only resource.get("id") or a redacted URL (strip query string and any
userinfo/credentials and optionally keep only host/path) before passing to
log.info; implement a small helper (e.g., redact_url(resource_url) or
redacted_resource_url variable) that uses urllib.parse to remove .query and
.username/.password and replace the existing log.info call to use the redacted
value or just the resource id.
| # ckanext-malmo | ||
|
|
||
| Customizations for the City of Malmö CKAN instance. | ||
| Customizations for the City of Malm? CKAN instance. |
There was a problem hiding this comment.
Fix character encoding issue.
The text displays "Malm?" instead of "Malmö". This appears to be a character encoding issue.
📝 Proposed fix
-Customizations for the City of Malm? CKAN instance.
+Customizations for the City of Malmö CKAN instance.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Customizations for the City of Malm? CKAN instance. | |
| Customizations for the City of Malmö CKAN instance. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 3, Replace the garbled "Malm?" string in the README text
("Customizations for the City of Malm? CKAN instance.") with the correct "Malmö"
and ensure the README.md is saved with UTF-8 encoding (or re-encode the file to
UTF-8) so the "ö" character is preserved; update the literal in README.md and
verify the repo/file encoding settings to prevent future character corruption.
| - `ckanext.malmo.dwg_preview_download_timeout` | ||
| Download timeout in seconds for remote DWG resources. | ||
|
|
||
| - `ckanext.malmo.dwg_preview_max_download_bytes` | ||
| Maximum DWG download size in bytes. |
There was a problem hiding this comment.
Document default values for download configuration.
The defaults for dwg_preview_download_timeout and dwg_preview_max_download_bytes are missing from the documentation. According to the implementation, DEFAULT_DOWNLOAD_TIMEOUT_SECONDS = 30 and DEFAULT_MAX_DOWNLOAD_BYTES = 100 * 1024 * 1024.
📝 Proposed fix
- `ckanext.malmo.dwg_preview_download_timeout`
- Download timeout in seconds for remote DWG resources.
+ Download timeout in seconds for remote DWG resources. Default: `30`.
- `ckanext.malmo.dwg_preview_max_download_bytes`
- Maximum DWG download size in bytes.
+ Maximum DWG download size in bytes. Default: `104857600` (100 MB).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `ckanext.malmo.dwg_preview_download_timeout` | |
| Download timeout in seconds for remote DWG resources. | |
| - `ckanext.malmo.dwg_preview_max_download_bytes` | |
| Maximum DWG download size in bytes. | |
| - `ckanext.malmo.dwg_preview_download_timeout` | |
| Download timeout in seconds for remote DWG resources. Default: `30`. | |
| - `ckanext.malmo.dwg_preview_max_download_bytes` | |
| Maximum DWG download size in bytes. Default: `104857600` (100 MB). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 43 - 47, Update the README entries for
ckanext.malmo.dwg_preview_download_timeout and
ckanext.malmo.dwg_preview_max_download_bytes to include their default values
from the implementation: DEFAULT_DOWNLOAD_TIMEOUT_SECONDS (30 seconds) and
DEFAULT_MAX_DOWNLOAD_BYTES (100 * 1024 * 1024 bytes). Locate the two bullet
items describing these config keys and append the default values and units
(e.g., "default: 30s" and "default: 100 MB") so the docs directly reflect the
constants DEFAULT_DOWNLOAD_TIMEOUT_SECONDS and DEFAULT_MAX_DOWNLOAD_BYTES.
| html2text | ||
| requests | ||
| ezdxf | ||
| PyMuPDF |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check latest versions from PyPI and scan for vulnerabilities
echo "=== Checking latest PyPI versions ==="
for pkg in html2text requests ezdxf PyMuPDF; do
echo -n "$pkg: "
curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version'
done
echo -e "\n=== Scanning for security advisories ==="
for pkg in html2text requests ezdxf PyMuPDF; do
echo "--- $pkg ---"
gh api graphql -f query="
{
securityVulnerabilities(first: 5, ecosystem: PIP, package: \"$pkg\") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}" | jq '.data.securityVulnerabilities.nodes[] | select(.vulnerableVersionRange != null)'
doneRepository: datopian/ckanext-malmo
Length of output: 1955
Pin dependency versions in requirements.txt (security + reproducibility).
requirements.txt lines 1-4 are unpinned, so installs can pull in vulnerable/non-reproducible versions.
requests: security advisories are fixed starting at 2.33.0 → userequests>=2.33.0,<3PyMuPDF: vulnerability fixed starting at 1.26.7 (range>=1.26.5,<1.26.7) → usePyMuPDF>=1.26.7,<2- Also pin
html2textandezdxf(no advisories returned by the scan, but unpinned versions still break reproducibility)
🧰 Tools
🪛 OSV Scanner (2.3.8)
[HIGH] 2-2: requests 2.9.2: undefined
(PYSEC-2018-28)
[HIGH] 2-2: requests 2.9.2: undefined
(PYSEC-2023-74)
[HIGH] 2-2: requests 2.9.2: Requests vulnerable to .netrc credentials leak via malicious URLs
[HIGH] 2-2: requests 2.9.2: Requests Session object does not verify requests after making first request with verify=False
[HIGH] 2-2: requests 2.9.2: Requests has Insecure Temp File Reuse in its extract_zipped_paths() utility function
[HIGH] 2-2: requests 2.9.2: Unintended leak of Proxy-Authorization header in requests
[HIGH] 2-2: requests 2.9.2: Insufficiently Protected Credentials in Requests
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@requirements.txt` around lines 1 - 4, Update requirements.txt to pin package
versions for security and reproducibility: change the requests entry to
"requests>=2.33.0,<3" and PyMuPDF to "PyMuPDF>=1.26.7,<2", and also pin
html2text and ezdxf to specific stable ranges (for example a recent compatible
release like "html2text>=2020.1.16,<2024" and "ezdxf>=0.18.3,<1") so installs
are deterministic and avoid known vulnerabilities; ensure you replace the
unpinned package lines (html2text, requests, ezdxf, PyMuPDF) with these pinned
specifiers in requirements.txt.
| 'html2text', | ||
| 'requests', | ||
| 'ezdxf', | ||
| 'PyMuPDF', |
There was a problem hiding this comment.
Pin dependency versions in install_requires to ensure secure and reproducible installs.
The install_requires list lacks version constraints for all four dependencies, creating the same security and reproducibility risks flagged in requirements.txt:
- Security: Unpinned
requestscan install versions with known CVEs (credentials leaks, session handling flaws). - Reproducibility: Different environments may resolve different versions.
- Stability: Breaking changes in dependencies can silently break the extension.
Since setup.py is the authoritative source for package installation, version pins here are critical.
🔒 Recommended: Add version constraints
install_requires=[
- 'html2text',
- 'requests',
- 'ezdxf',
- 'PyMuPDF',
+ 'html2text>=2023.10.0,<2024',
+ 'requests>=2.32.0,<3',
+ 'ezdxf>=1.1.0,<2',
+ 'PyMuPDF>=1.23.0,<2',
],These are example constraints based on typical versions; verify the latest secure releases and compatibility with your CKAN environment before applying.
Note: The verification script in the requirements.txt review will help identify the latest secure versions for all four dependencies.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 'html2text', | |
| 'requests', | |
| 'ezdxf', | |
| 'PyMuPDF', | |
| 'html2text>=2023.10.0,<2024', | |
| 'requests>=2.32.0,<3', | |
| 'ezdxf>=1.1.0,<2', | |
| 'PyMuPDF>=1.23.0,<2', |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@setup.py` around lines 18 - 21, The install_requires list in setup.py
currently lists unpinned packages ('html2text', 'requests', 'ezdxf', 'PyMuPDF');
update the install_requires entry to pin each dependency to a safe, reproducible
version range (e.g., using exact or range pins like >= and <) for 'html2text',
'requests', 'ezdxf', and 'PyMuPDF' to prevent pulling insecure or breaking
releases—choose vetted versions consistent with your CKAN compatibility and
mirror the locks used in requirements.txt or the verification script.
Summary by CodeRabbit
New Features
Documentation
Chores