This solution provides a comprehensive, agentic framework for analyzing healthcare documents—specifically Clinical Study Summary Reports (CSSR) and Patient Profiles—using a "Zero-Copy RAG" architecture on Google Cloud. It leverages BigQuery as the central data and AI orchestration engine, integrating Document AI, Vertex AI (Gemini), and BigQuery Graph to transform unstructured PDF reports into a structured, searchable, and relational Knowledge Graph.
This architecture demonstrates a streamlined, in-warehouse approach to unstructured data analysis:
- Unstructured Data (TXT and PDF files) resides securely in Cloud Storage.
- Object Tables in BigQuery provide direct access to these files without duplicating the data.
- Multimodal LLMs (Vertex AI) connect directly to BigQuery, empowering the data warehouse to process the unstructured content natively.
- Vector Embeddings are generated for semantic searches, while LLMs extract structured entities into Denormalized Tables.
- Finally, the structured entities and relationships are modeled into a Graph within BigQuery for complex graph traversal searches.
Traditional RAG (Retrieval-Augmented Generation) often requires moving data between silos (extracting text, sending to an external vector DB, then to an LLM). This solution implements Zero-Copy RAG, where:
- Data Stays in BigQuery: PDF documents and Text files are stored in Google Cloud Storage and exposed to BigQuery via Object Tables.
- In-Warehouse Processing:
- Document AI Integration: BigQuery uses the
AI.PARSE_DOCUMENTfunction to extract structured chunks and parse complex PDFs without moving the data. - Generative Extraction:
AI.GENERATE(utilizing Gemini 2.5 models via AI Remote Connections) extracts structured entities directly from the document content.
- Document AI Integration: BigQuery uses the
- Seamless Joins: Extracted data is immediately joined with existing structured clinical datasets in BigQuery, creating a "Golden Record".
To make deploying unstructured ingestion pipelines simple and scalable, we have adopted a Two-Tier Parameterization Architecture:
- Tier 1 (Data Analysts): Users only need to modify
config.yamlto specify their Google Cloud project ID, dataset, and GCS bucket locations. No SQL or Python knowledge is required. - Tier 2 (Data Engineers): The
scripts/parameterize.pyscript automatically ingests the YAML, performs robust escaping of BigQuery AI prompts, and generates deployableStored Procedures(Parameterized_Patient_Profiles.sqlandParameterized_Clinical_Trials.sql). These procedures can then be easily orchestrated via Cloud Run, Cloud Composer, or Airflow usingscripts/orchestrate_ingestion.py.
This platform extensively utilizes the newest, native BigQuery AI functions to perform ML tasks directly within standard SQL queries without pipeline engineering:
AI.PARSE_DOCUMENT: Performs OCR, layout analysis, and entity extraction on unstructured PDFs natively in the warehouse.AI.EMBED: Automatically generates high-dimensional (768-dim) vector representations of medical text (e.g., trial titles, condition descriptions).AI.GENERATE: Prompts Gemini 2.5 Pro and Flash directly insideSELECTstatements to generate patient-friendly summaries, cross-trial insights, and extract structured JSON directly from text.AI.SEARCH: Performs both semantic (cosine vector similarity) and hybrid (vector + lexical) searches across the embedded datasets.
- External Object Tables: Unified access to PDF reports in GCS.
- Dynamic Parameterization Engine:
config.yamlandparameterize.pygenerate reusable BigQuery Stored Procedures for ingestion. - Gemini-Powered Extraction: High-fidelity extraction of complex clinical fields using Gemini 2.5 Pro/Flash.
The solution constructs a BigQuery Property Graph to map the complex ecosystem of clinical research:
- Nodes: Trial, Drug, Disorder, Mechanism of Action (MOA), Company, Phase, Status, Criteria.
- Relationships:
Drug->MayTreat->DisorderTrial->Uses->Drug
To support development and testing, the solution includes reusable Gemini CLI Plans for:
- Synthetic CSSR Generation: Creating multi-page clinical protocols.
- Patient Profile Generation: Creating synthetic medical records.
Interactive Jupyter notebooks demonstrate querying the Property Graph using GQL (Graph Query Language) and performing semantic search.
- Google Cloud Project with BigQuery, Vertex AI, and Document AI enabled.
- BigQuery Remote Connections configured for LLM and Embedding models.
- Python 3.10+ and the
google-cloud-bigquerylibrary (if using the orchestration scripts).
See INSTALLATION.md for step-by-step deployment instructions, or run ./quick-install.sh.
This is not an officially supported Google product.
