[DCP - Ingestion] DCP Bridge Mode improvements by gmechali · Pull Request #519 · datacommonsorg/import

gmechali · 2026-06-04T19:50:20Z

Refactor Simple Importer Parallel Ingestion, GCS Exporter & Decouple JSON-LD Streaming

Summary

This PR optimizes the parallel ingestion pipeline, decouples JSON-LD streaming, and resolves critical connection pooling and CPU-blocking bottlenecks to speed up the GCS bulk upload phase.

Key Changes

GCS Upload Speedup: Replaced PyFilesystem GCSFS client (serialized by instance locks) with the native google-cloud-storage client. Configured the internal HTTP session pool size to match upload thread concurrency (32).
Reduces GCS upload time by more than half
Decoupled JSON-LD Exporter: Extracted JsonLdStreamDb into its own module (stats/jsonld_stream_db.py) to eliminate circular import dependencies between relational DB logic and GCS streaming exporters.
CPU Bottleneck Fixes: Moved Pandas to_records() conversions out of the sequential generator into worker threads, and removed redundant blocking gc.collect() calls from chunk generators.
Orchestration: Deferred workflow auto-trigger execution to the very end of Runner.run() to guarantee that Cloud Workflows/Dataflow starts only after all JSON-LD files are fully uploaded to GCS.

Code Health & Thread Safety:

Wrapped child progress reporter updates with the parent reentrant lock.
Fixed a critical logging bug where failed files incorrectly reported Status
Moved all inline imports (threading, requests, tempfile) to top-level.

gemini-code-assist

Code Review

This pull request introduces JsonLdStreamDb to stream triples and observations directly to JSON-LD shards on GCS or local disk, bypassing SQLite. It also adds thread safety to the Nodes and ImportReporter classes and enables parallel ingestion of data files in DCP_BRIDGE mode. The review feedback highlights several critical issues in simple/stats/db.py, including potential AttributeError crashes in _uri_ref when handling non-string values, O(N^2) performance bottlenecks when slicing lists in chunk generators, and type overwriting when a subject has multiple typeOf predicates.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gmechali · 2026-06-04T21:59:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces JsonLdStreamDb to stream JSON-LD shards directly to GCS or local disk, optimizing data ingestion in DCP_BRIDGE mode. It also adds parallel file ingestion, thread-safety decorators, and a bug fix for reporting failures. The review feedback highlights three key improvement opportunities: deduplicating non-@type predicate values in the fast node sharding path to ensure parity with rdflib, catching specific json.JSONDecodeError exceptions instead of a broad Exception when parsing custom properties, and removing redundant, risky private API access on the GCS client.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

codacy-production · 2026-06-04T23:33:26Z

Not up to standards ⛔

🔴 Issues 3 high · 5 medium · 38 minor

Alerts:
⚠ 46 issues (≤ 0 issues of at least minor severity)

Results:
46 new issues

Category Results

UnusedCode 2 medium
1 minor

Documentation 8 minor

ErrorProne 3 high

CodeStyle 29 minor

Complexity 3 medium

View in Codacy

🟢 Metrics 107 complexity · 8 duplication

Metric Results

Complexity 107

Duplication 8

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

dwnoble · 2026-06-05T22:48:00Z

+    if not observations_df.empty:
+      records = observations_df.to_records(index=False).tolist()
+      with self.lock:
+        self._obs_records.extend(records)


This dataframe looks like it could get pretty big and use a lot of memory. can it write to disk as it goes?

gmechali added 16 commits June 3, 2026 16:31

First try at sqlite-less

d38bf46

Multi threaded

87a2509

More speedu p

e7436d8

Logging buffering

db0877d

Threadit

bd4865e

more parallelization

4a7bfad

fix timing for workflow trigger

fc01fc8

Export faster

6a1dbec

Adds a global counter

97b2e81

Adds gardbage collection to help performance, and prevent ooms

3bae59d

Bypass rdflib for speed.

8044e94

Remove gccollect

15eff19

Use native GCS

a6eb2a5

More performance fix

6013345

Minor cleanup

85a305e

Lint

e420a5a

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread simple/stats/db.py Outdated

Comment thread simple/stats/db.py Outdated

Comment thread simple/stats/db.py Outdated

Comment thread simple/stats/db.py Outdated

Comment thread simple/stats/db.py Outdated

gmechali added 5 commits June 4, 2026 16:24

Major cleanup on the code

40ba5a4

Cleanup the JSONLD parsing code.

7a45722

Minor cleanup

37aab53

Lint

cd8dc1f

More cleanups and comments

bae4f61

gmechali requested a review from dwnoble June 4, 2026 21:56

gmechali and others added 2 commits June 4, 2026 17:58

Merge branch 'master' into newv

1f67f4b

Gemini comments round 1

569eac5

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread simple/stats/jsonld_stream_db.py

Comment thread simple/stats/jsonld_stream_db.py Outdated

Comment thread simple/stats/jsonld_stream_db.py

gmechali added 2 commits June 4, 2026 18:04

Dedup the type

bf38fee

Lint

314507d

Merge branch 'master' into newv

ddc69df

dwnoble approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DCP - Ingestion] DCP Bridge Mode improvements#519

[DCP - Ingestion] DCP Bridge Mode improvements#519
gmechali wants to merge 26 commits into
datacommonsorg:masterfrom
gmechali:newv

gmechali commented Jun 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmechali commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

dwnoble Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmechali commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor Simple Importer Parallel Ingestion, GCS Exporter & Decouple JSON-LD Streaming

Summary

Key Changes

Code Health & Thread Safety:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmechali commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

dwnoble Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmechali commented Jun 4, 2026 •

edited

Loading

codacy-production Bot commented Jun 4, 2026 •

edited

Loading