[DCP - Ingestion] DCP Bridge Mode improvements#519
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces JsonLdStreamDb to stream triples and observations directly to JSON-LD shards on GCS or local disk, bypassing SQLite. It also adds thread safety to the Nodes and ImportReporter classes and enables parallel ingestion of data files in DCP_BRIDGE mode. The review feedback highlights several critical issues in simple/stats/db.py, including potential AttributeError crashes in _uri_ref when handling non-string values, O(N^2) performance bottlenecks when slicing lists in chunk generators, and type overwriting when a subject has multiple typeOf predicates.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces JsonLdStreamDb to stream JSON-LD shards directly to GCS or local disk, optimizing data ingestion in DCP_BRIDGE mode. It also adds parallel file ingestion, thread-safety decorators, and a bug fix for reporting failures. The review feedback highlights three key improvement opportunities: deduplicating non-@type predicate values in the fast node sharding path to ensure parity with rdflib, catching specific json.JSONDecodeError exceptions instead of a broad Exception when parsing custom properties, and removing redundant, risky private API access on the GCS client.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 2 medium 1 minor |
| Documentation | 8 minor |
| ErrorProne | 3 high |
| CodeStyle | 29 minor |
| Complexity | 3 medium |
🟢 Metrics 107 complexity · 8 duplication
Metric Results Complexity 107 Duplication 8
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
| if not observations_df.empty: | ||
| records = observations_df.to_records(index=False).tolist() | ||
| with self.lock: | ||
| self._obs_records.extend(records) |
There was a problem hiding this comment.
This dataframe looks like it could get pretty big and use a lot of memory. can it write to disk as it goes?
Refactor Simple Importer Parallel Ingestion, GCS Exporter & Decouple JSON-LD Streaming
Summary
This PR optimizes the parallel ingestion pipeline, decouples JSON-LD streaming, and resolves critical connection pooling and CPU-blocking bottlenecks to speed up the GCS bulk upload phase.
Key Changes
Code Health & Thread Safety:
Wrapped child progress reporter updates with the parent reentrant lock.
Fixed a critical logging bug where failed files incorrectly reported Status
Moved all inline imports (threading, requests, tempfile) to top-level.