An AI-powered email-to-Drive filing system. Documents arriving in a shared inbox are automatically read by a large language model, tagged with structured metadata, renamed per a configurable template, and filed into the correct folder of a shared Drive. Anything the model isn't confident about goes to a human review queue instead of being filed.
The system is document-type and team-agnostic — the same code runs for any team. What varies between teams is configuration: which fields to extract, how to name the file, where to put it. This README is the build guide for a Google Cloud + Gemini reference implementation.
Built by SEAD in collaboration with PLCS.
- What it does
- Why this approach
- Architecture
- How a document gets filed
- The AI extractor
- Confidence thresholds and the exception queue
- Filename engine
- Folder routing
- Deduplication
- Format handling
- Auth model
- Audit log spreadsheet
- Reporting and monitoring
- Build guide
- Configuration changes after launch
- Operational runbook (light)
- Cost
- Security and compliance
- What you need to customize for your team
A new email arrives in your team's shared intake inbox. The bot extracts each attachment, asks an LLM to read the document and return a structured set of metadata fields (whichever fields your team needs), checks whether the model is confident enough, and then either:
- Files it. The bot assembles a filename per your template, picks the target folder using your routing rules, and moves the document into your shared Drive — logging every action.
- Routes it for review. The document goes to a Staging folder with a
[REVIEW]prefix, and a row is added to a human-facing review queue.
Everything is configurable: extraction fields, filename template, folder routing, confidence thresholds, what counts as a trusted source, what to skip entirely.
Three things drove the architecture choice over a no-code automation platform:
- Security. Documents and AI inference both stay inside your own Google Cloud organization. Nothing routes through a third-party automation platform.
- Cost. Estimated ongoing infrastructure cost is under $10/month at typical small-team volume (tens to low hundreds of documents per month). A no-code alternative is roughly an order of magnitude more expensive at the same volume.
- Control. All code lives in a repo your team owns. No vendor lock-in. Full audit trail in Cloud Logging.
The trade-off is that a no-code platform is faster to stand up. This implementation is meant to be the long-term home, not a prototype.
Shared intake inbox (Google Group)
|
v
Bot user mailbox --(Gmail watch)--> Pub/Sub topic
|
v
Cloud Function: filing
|
+------------------------------+------------------------------+
v v v
Gmail API Vertex AI (Gemini) Drive API
(read message, (extract structured (move + rename file)
attachments) metadata) |
| v
v Audit Log Sheet
Confidence >= threshold? (Filing log,
yes -> file to Drive Exception Queue,
no -> Staging + queue Dashboard,
live config)
^
|
Cloud Scheduler (cron) --> Cloud Function: Gmail watch renewer (every 6 days)
Cloud Scheduler (cron) --> Cloud Function: monthly spot-check audit
| Component | Purpose |
|---|---|
Cloud Function document-filing |
Receives Pub/Sub events, extracts metadata, files |
Cloud Function watch-renewer |
Renews the Gmail push-notification watch every 6 days |
Cloud Scheduler gmail-watch-renewer |
Cron (0 6 */6 * *) that hits the watch renewer |
Cloud Scheduler spot-check-audit |
Monthly cron that pulls a random sample of auto-filed docs for QA |
| Pub/Sub topic | Receives Gmail watch notifications |
| Service account | Runs the Cloud Function with domain-wide delegation |
| Audit log spreadsheet | Filing records, exception queue, dashboard, routing config |
| Filed-documents index | SHA-256 lookup so duplicate emails don't double-file (Sheet or Firestore) |
Google Cloud Functions, 2nd gen, Python 3.11+. Serverless, scales to zero, pay-per-invocation. Move to Cloud Run if you ever need request timeouts longer than 9 minutes.
End-to-end:
- New email arrives at the team's intake group address (e.g.,
intake@yourcompany.com, a Google Group). - The Group delivers it to a dedicated bot user's mailbox.
- A Gmail watch on that mailbox fires a Pub/Sub notification with
(emailAddress, historyId). - The Cloud Function calls
users.history.list(startHistoryId=last_known)to enumerate new messages since the last checkpoint. - For each new message:
- Extract attachments.
- Compute a SHA-256 hash of each attachment and check it against the filed-documents index. If already filed, log and skip.
- Normalize formats the LLM can't read natively (see Format handling).
- Send the document content + email metadata to the LLM for extraction.
- Apply the confidence threshold for the matched intake path.
- If above threshold: assemble the filename per the template, move to the resolved folder, log to the Filing Log.
- If below threshold: move to a Staging folder with a
[REVIEW]filename prefix and add a row to the Exception Queue.
- Advance the Gmail history checkpoint after successful enumeration.
The implementation uses Gemini 2.5 Pro via Vertex AI. Earlier versions used 2.5 Flash; Pro materially improved accuracy on complex documents at the cost of a few cents per extraction.
The model is prompted to read the document end to end and return a structured JSON object. The exact fields are whatever your team needs to track and route by — they're defined in the prompt, not hard-coded in the system. Common fields across most use cases:
| Field | Purpose |
|---|---|
document_type |
Your team's taxonomy of document types (whatever categories matter to you) |
document_date |
The relevant date on the document, YYYY-MM-DD |
primary_entity |
Which of your entities the document belongs to (subsidiary, branded entity, division, etc.) — if applicable |
vendor |
The other party named in the document (vendor, customer, employee, etc.) — name as appropriate to your domain |
summary |
One- to two-sentence plain-English description |
confidence_score |
0.00 – 1.00 |
folder_path |
The model's suggested destination folder |
flags |
Array of edge-case indicators (vague subject, unclear entity, non-standard format, etc.) |
Add or remove fields freely. A finance team might extract invoice number and amount; an HR team might extract employee ID and document category; a procurement team might extract PO number and supplier. The system passes whatever fields you define through to the filename template, the routing logic, and the audit log columns.
Skip rules. Some attachments are intentionally dropped — no Drive upload, no log entry — because they aren't records your team wants in the filing system. Configure these per team. Common patterns: drafts of in-flight documents, automated bounce notifications, abuse-report forwards, marketing emails. Anything matching the configured skip rules is logged in Cloud Logging only and never reaches the audit Sheet.
The confidence threshold is a gate, not an accuracy target. The model only auto-files when it's at least N% confident. Below that, the document goes to a human. This keeps a human in the loop for anything ambiguous.
Thresholds vary by intake path because some sources are more structured and trustworthy than others:
| Intake path | Threshold | Triggered when |
|---|---|---|
trusted_source |
95% | Email comes from a recognized sender pattern your team has whitelisted (e.g., an e-signature platform, a known internal forwarding alias, a specific vendor portal) |
external |
99% | A configured "external" tag appears in the subject line |
unknown |
99% | Anything else |
The trusted_source path is where you plug in your team's tooling. Define one entry per source you want to trust (sender pattern + threshold + any subject-line parser); add or remove sources as your team's stack changes.
Three confidence-boosting mechanisms run before the threshold check:
- Intake-path thresholds. Trusted sources auto-file at 95% instead of 99% because the surrounding metadata is highly structured.
- Known-vendor validation. The model's extraction is cross-referenced against a maintained list of known vendors/parties. A match boosts confidence by 3–5%.
- Email-subject parsing. Many automated completion emails contain structured metadata in the subject line. For each trusted sender pattern you configure, you can also configure a subject-line parser. When the model's extracted fields match two or more subject-parsed fields, confidence is boosted.
Documents that don't clear the threshold land in:
- A row in the Exception Queue tab of the audit log spreadsheet
- A copy of the file in a
0. Stagingfolder of the Drive, with a[REVIEW]prefix - A Drive link in the queue row so a reviewer can open it in one click
The bot enforces whatever filename convention your team already uses — it doesn't impose one. The engine takes the LLM's extracted fields and assembles a filename per a template you define in src/shared/naming.py.
The template is a simple format string referencing the extraction fields:
NAME_TEMPLATE = "{document_date} {primary_entity} - {document_type} ({vendor}).{ext}"That's the only place the format lives. Change the template, redeploy, done.
Practical notes the engine handles for you:
- Field normalization. Dates are normalized to whatever format the template expects. Strings are trimmed and case-normalized.
- Missing-field fallbacks. When an extracted field is empty, the engine substitutes a configurable fallback (e.g.,
"Unknown") rather than producing a malformed filename. - Filesystem safety. Slashes, colons, and other reserved characters in extracted strings are stripped or replaced.
- Conflict resolution. When a target filename already exists in the destination folder, the engine appends a disambiguator (
-2,-3, etc.).
Field-extraction reliability, in order:
- Document content (most reliable for entity, document type, and the other party)
- Email metadata (subject, sender, recipients, date — useful as fallback and cross-reference)
When document and email dates conflict, the engine prefers the date from the document content.
Folder routing is split between sheet-driven config (your team's domain experts edit a tab) and code-driven config (an engineer ships a code change and redeploys).
Lives in tabs of the audit log spreadsheet. Changes go live within ~5 minutes — the bot reads the relevant tab on every extraction.
| Change | How |
|---|---|
| Add or edit a routing rule (which folder a doc type lands in) | Edit the Routing Tree tab. Add a row with node ID, category, folder key, triggers, subject matter, multi-folder rules, scope, and notes. |
| Add a known vendor (boosts confidence on match) | Add a row to the Known Vendors tab. |
| Create a new subfolder | Create the folder in Drive directly. The bot fuzzy-matches names to existing subfolders. |
Subfolder rules. The Routing Tree supports a per-row Subfolder? flag. When set, the bot auto-creates a subfolder named after one of the extracted fields (e.g., per-vendor or per-entity subfolders) so related documents group together rather than flat-piling in the parent folder.
| Change | What's involved |
|---|---|
| Add a new entity to the entity list | Edit ENTITIES in src/shared/config.py, redeploy. |
| Add a new top-level folder category | New entry in DRIVE_FOLDERS, possibly a new routing-tree node, redeploy. |
| Change a confidence threshold | Edit INTAKE_THRESHOLDS in src/shared/confidence.py, redeploy. |
| Add a new file format the LLM can't read natively | Add a handler in src/shared/format_handlers.py, redeploy. |
| Change the extraction prompt or filename template | Edit classify.py or naming.py, redeploy. |
| Change the AI model | Edit LLM_MODEL in config.py, redeploy. |
Some documents legitimately belong in more than one folder. A small rules engine handles these:
- The extractor returns
requires_multi_filing: trueand a list of candidate folders. - The rules engine consults a config table of known multi-folder document types.
- The bot creates a copy in each target folder and logs both filings.
Some destination folders have permissions the service account intentionally doesn't hold (e.g., HR-only, executive-only, sensitive program folders). When the routing logic sends a document to one of these, the bot does not attempt to write — it routes to Staging with a [REVIEW: limited-access folder] tag so a member of the owning team can move it manually.
The bot maintains a filed-documents index keyed by SHA-256 of the attachment content. The index stores: hash, original message ID, filed Drive path, and filing timestamp.
This handles the common duplicate patterns:
- BCC back to the intake inbox after forwarding internally
- Auto-forwards from internal aliases to the records inbox
- Platforms that auto-deposit and send completion notifications to the same inbox
Storage: Google Sheet for simplicity, or Cloud Firestore for slightly better latency at scale. Both are within the free tier at typical volume.
The LLM only reads PDF, images, and plain text natively. Other formats are normalized first:
| Source format | Normalization |
|---|---|
.docx / .doc / .rtf / .odt |
Uploaded to Drive as a Google Doc, exported as PDF |
.eml |
Parsed; inner attachments extracted and recursed; if no attachments, the plain-text body is sent |
application/octet-stream |
Magic-byte sniff (catches PDFs that senders mislabeled as generic binary) |
| Anything else | Routed to Staging with an [UNSUPPORTED-FORMAT] tag |
The bot runs as a dedicated service account with domain-wide delegation (DWD) in Google Workspace. DWD lets the service account impersonate specific users for specific scopes.
In practice, the service account impersonates two users:
- Drive + Sheets → an admin user that holds the folder/sheet shares (e.g.,
bot-admin@yourcompany.com) - Gmail → the bot user that's a member of the intake Google Group (e.g.,
filing-bot@yourcompany.com)
This split keeps the impersonation scope minimal — the bot user only ever reads its own mailbox.
Authorized OAuth scopes:
https://www.googleapis.com/auth/drive
https://www.googleapis.com/auth/gmail.modify
https://www.googleapis.com/auth/spreadsheets
Configured in: Google Admin Console → Security → API Controls → Domain-wide Delegation.
Credentials handling. Service account keys are stored in Google Secret Manager, not in the function's environment variables, and rotated on a schedule.
The audit log is a single Google Sheet that does triple duty: it's the write-side log of everything the bot does, the read-side config the bot consults on every extraction, and the human-facing review surface. Reviewers, ops people, and the bot all share one source of truth.
Eight tabs total, in three groups:
Audit / review (what the bot writes)
- Filing Log
- Exception Queue
- Spot-Check Audit
- Dashboard
Live config (what the bot reads on every extraction)
- Routing Tree
- Known Vendors
Documentation / state
- Routing Logic
- Gmail State
A bootstrap script (shared/dashboard.py::bootstrap) creates all eight tabs with the correct headers if they're missing — point it at a fresh sheet and it stands up the whole structure.
The exact column lists below reflect a generic schema. Add or rename columns to match the extraction fields your team configures — the audit log is meant to mirror your team's metadata.
Append-only history of every filing action. One row per document. Columns include the bot's metadata plus whatever extraction fields you configured.
| Column | Notes |
|---|---|
| Timestamp | ISO-8601 UTC |
| Status | auto-filed, exception-queue, skipped |
| Generated Filename | The renamed filename per the template |
| (Your extraction fields) | One column per field in your extraction schema (e.g., document type, primary entity, vendor, document date, summary) |
| Confidence | 0.00–1.00 |
| Target Folder(s) | Resolved folder path(s) — supports multi-folder filing |
| Drive File ID(s) | Comma-separated Drive file IDs |
| Drive Link | Clickable hyperlink to the filed file |
| Email Subject | From the source message |
| Email Sender | From the source message |
| Intake Path | trusted_source, external, unknown |
| Original Filename | The attachment's name as it arrived |
| Processing Time (s) | Receipt to filing complete |
| Notes | Errors, warnings, routing-decision rationale |
The reviewer's worklist. One row per document that didn't clear the threshold. The Action Taken column is the only field the reviewer fills in.
| Column | Notes |
|---|---|
| Timestamp, Original Filename, Suggested Filename | What it was, what the model thought it should be |
| (Your extraction fields) + Confidence | Model output (may be partial/wrong) |
| Email Subject, Email Sender, Intake Path | Source context |
| Notes | Why it was flagged (e.g., "UNCLASSIFIABLE (confidence 0%)", "missing date") |
| Review Link | Clickable hyperlink to the staged file in Drive |
| Action Taken | Reviewer fills in: approved, corrected, rejected, or duplicate |
The queue auto-refreshes after each new exception is logged.
Monthly QA sample for already-auto-filed documents. A Cloud Scheduler cron picks 15+ random rows from that month's Filing Log on the 1st of each month and lands them here.
| Column | Notes |
|---|---|
| Audit Month | YYYY-MM-01 |
| Timestamp, Status, Generated Filename, (your extraction fields), Confidence, Target Folder(s) | Copied from the original Filing Log row |
| Drive Link | Clickable for verification |
| Verified Correct? | Reviewer fills in: True / False |
| Reviewer Notes | Free text |
A simple metric/value/details rollup that auto-refreshes. Sectioned with --- HEADER --- divider rows for readability:
--- VOLUME ---
Total Documents Processed
Auto-Filed
Exception Queue
--- RATES ---
Auto-File Rate (Auto-filed / (auto-filed + exceptions))
Exception Rate (Target: <= 5%)
Filing Accuracy (from spot-check; Target: >= 95%)
Avg Processing Time (Target: <= 120s)
--- THIS MONTH (YYYY-MM) ---
Month Auto-Filed
Month Exceptions
--- TOP DOC TYPES --- (top 10 by frequency, from Filing Log)
--- TOP ENTITIES --- (top entities by filing volume)
The authoritative routing table — the bot reads this on every extraction. Adding/editing a row is a no-deploy change. See Folder routing.
| Column | Purpose |
|---|---|
| Node ID | Stable internal key |
| Category | Top-level grouping |
| Folder Key | Maps to a Drive folder ID in config.py |
| Folder Path (Display) | Human-readable path for the sheet |
| Triggers / Doc Types | Comma-separated document types that route here |
| Subject Matter Description | Plain-language description fed to the model as context |
| Multi-Folder Copy | Optional — secondary folder to also drop a copy into |
| Subfolder? | YES / YES (entity name) / blank — controls auto-creation of subfolders |
| Scope | active / inactive — flip to disable a row without deleting it |
| Notes | Free text |
Two columns: Vendor Name, Notes (optional). Matches against this list boost extraction confidence by 3–5%.
A tab whose only purpose is to document the extraction-and-filing pipeline so reviewers and new admins can understand the system without reading code. The bot does not read this tab — it's purely human-facing. Each row is one step; columns are: Step, Name, What It Does, Owner (which file/system implements it), Fail-Safe (what happens if it fails).
Reference content (your team's pipeline may differ slightly):
| Step | Name | Owner | Fail-Safe |
|---|---|---|---|
| 0 | Intake Detection | shared/confidence.py |
Unknown path uses 99% threshold |
| 1 | Fact Extraction | LLM prompt in classify.py |
Missing required fields cap confidence at 50% |
| 2 | Category Classification | LLM with Routing Tree context | Below threshold → staging with [REVIEW] prefix |
| 3 | Confidence Boosting | shared/confidence.py (deterministic, no AI) |
Boosts additive, capped at 100% |
| 4 | Folder Resolution | shared/routing_tree.py |
Unknown folder key falls back to staging |
| 5 | Filename Construction | shared/naming.py |
Missing fields filled with configured fallbacks |
| 6 | Subfolder Creation | shared/file.py |
Reuses existing subfolder if present |
| 7 | Multi-Folder Copy | shared/file.py |
Non-blocking — primary filing still succeeds |
| 8 | Deduplication | shared/file.py |
Prevents staging from growing unbounded on reprocessing |
| 9 | Audit Logging | shared/audit_log.py |
Sheets failure falls back to Cloud Logging |
| 10 | Dashboard Refresh | shared/dashboard.py |
Non-blocking — refresh failure doesn't block filing |
Two columns, one row per watched mailbox: emailAddress, lastHistoryId. This is the Gmail watch checkpoint — the cursor that tells the bot how far through Gmail's history it has read. The Cloud Function reads this row on every invocation, calls users.history.list(startHistoryId=lastHistoryId) to enumerate new messages, and writes the new historyId back after successful enumeration.
This is the most operationally critical tab in the spreadsheet. If you lose this value, the bot doesn't know where it left off and either re-processes everything from scratch or skips messages. Two implications when rebuilding:
- Don't mix it into a tab with other data — keep it isolated so an accidental edit can't corrupt it.
- If you'd rather use Cloud Firestore for this single piece of state, that's a reasonable swap — Firestore gives you better durability guarantees and stays out of human view.
| Metric | Target | Definition |
|---|---|---|
| Filing accuracy | ≥ 95% | Auto-filed / (auto-filed + exceptions) |
| Exception rate | ≤ 5% | Exceptions / total processed |
| Avg processing time | ≤ 120 s | Seconds from email receipt to filing complete |
| Monthly filing count | (set per team) | Sanity check against historical volume |
Each month a Cloud Scheduler job picks 15+ random documents from that month's auto-filed items and lands them on the Spot-Check Audit tab. A human verifies each one is in the correct folder, correctly named, and that content matches the extraction. If accuracy drops below 95%, the system administrator is paged to look at model tuning or threshold adjustment.
- Cloud Logging — every extraction and filing action emits structured logs (correlation ID, hash, intake path, confidence, decision)
- Cloud Audit Logs — every API call against Gmail/Drive/Sheets is logged automatically
- A defensive
AUDIT|log line is emitted in addition to writing to the Sheet, so audit history survives even if the Sheets API call fails
Rough scope for a single engineer who already has Google Workspace + Google Cloud admin access: about two weeks end-to-end, including a few days of integration testing.
- A Google Workspace organization where you can create service accounts and grant domain-wide delegation
- A Google Cloud project (separate from your main org workloads is recommended — easier to scope IAM)
- A shared Drive your team owns (you'll add the service account as Content Manager later)
- A Google Group that will be your team's intake inbox
- A bot user account that's a member of that Group
- An LLM API enabled — Vertex AI Gemini for this implementation
- Python 3.11+ on the build machine, plus
gcloudCLI
- Create the Google Cloud project. Enable: Cloud Functions, Cloud Build, Cloud Run, Cloud Scheduler, Pub/Sub, Vertex AI, Gmail API, Drive API, Sheets API, Secret Manager, Cloud Logging.
- Create a service account (e.g.,
filing-bot@<project>.iam.gserviceaccount.com). - Generate a JSON key and store it in Secret Manager (do not commit it; the function will read it at startup).
- In the Google Workspace Admin Console, register the service account's client ID for Domain-wide Delegation with the three OAuth scopes listed in Auth model.
- Add the service account as Content Manager to the shared Drive that holds your filing structure.
- Add the bot user (the one the service account will impersonate for Gmail) to your team's intake Google Group.
src/
filing/ # Pub/Sub-triggered filing
main.py # Cloud Function entry point
pipeline.py # extract -> dedup -> classify -> file
shared/
config.py # ENTITIES, DRIVE_FOLDERS, intake-path detection, etc.
confidence.py # INTAKE_THRESHOLDS and boosting logic
classify.py # the LLM call and extraction prompt
naming.py # filename template engine
format_handlers.py # docx -> PDF, eml unpack, octet-stream sniff
routing.py # routing-tree lookup against the spreadsheet
drive.py # Drive API helpers
sheets.py # Sheets API helpers (audit log)
dedup.py # SHA-256 index
dashboard.py # generate_spot_check, refresh_exception_queue, refresh_dashboard
watch_renewer/
main.py # renews the Gmail watch
deploy/
cloudbuild.yaml
scheduler.yaml # cron jobs: watch-renewer, spot-check
- Create a Pub/Sub topic (e.g.,
filing-email-events). - Grant the Gmail service account publish permissions on the topic.
- From a script running as the bot user, call
users.watchto register the watch:service.users().watch(userId='me', body={ 'topicName': 'projects/<PROJECT>/topics/filing-email-events', 'labelIds': ['INBOX'], 'labelFilterAction': 'include', }).execute()
- The watch expires after 7 days, so deploy the watch renewer Cloud Function and the Cloud Scheduler job that hits it every 6 days (
0 6 */6 * *).
gcloud functions deploy document-filing \
--gen2 \
--runtime=python311 \
--region=us-central1 \
--source=src/filing \
--entry-point=on_pubsub \
--trigger-topic=filing-email-events \
--service-account=filing-bot@<PROJECT>.iam.gserviceaccount.com \
--set-secrets=SA_KEY=filing-bot-key:latest \
--memory=1Gi --timeout=540sCreate a Google Sheet and run shared/dashboard.py::bootstrap to lay out the eight tabs described in Audit log spreadsheet with the correct headers. Share the sheet with the service account's impersonation user. Put the spreadsheet ID into config.py.
Strongly recommended: stand the system up against a test Drive that mirrors your production folder structure before pointing it at the real one. The reference implementation has a single config flag (DRIVE_FOLDERS_PROD vs DRIVE_FOLDERS_TEST) that flips between them.
A reasonable test plan:
- Send a known structured email from a trusted source (e-signature platform, internal alias, etc.) with a real attachment — verify it auto-files correctly.
- Send an email with the configured "external" tag — verify the higher threshold and routing.
- Send a deliberately ambiguous attachment — verify it lands in the Exception Queue with a reasonable suggested filename.
- Forward a
.docxand an.eml— verify both normalize. - Send the same attachment twice from different addresses — verify dedup.
Three-step cutover:
- Add the service account as Content Manager on the production shared Drive.
- Add the bot user to your team's production intake Group so it receives mail.
- Flip the config toggle from test to production folders and redeploy.
Roll back by flipping the toggle back. Drive's version history reverses any moves/renames.
| Frequency | Change | Owner |
|---|---|---|
| Most common | Add a routing rule, known vendor, or subfolder | Sheet edit by domain expert; live in ~5 min |
| Occasional | Add a new entity, top-level folder category, or document type | Code change + redeploy by engineer |
| Rare | Change the LLM model, prompt, or thresholds | Code change + redeploy by engineer |
Open the Audit Log spreadsheet → Exception Queue tab. Each row is a doc that needs human review.
For each row:
- Click the Review Link to open the file in Drive.
- Verify the suggested filename — is the extracted metadata correct?
- If correct: approve and let the bot finish filing.
- If wrong: rename and file manually.
- Fill in the Action Taken column:
approved,corrected,rejected, orduplicate.
- Scanned/image-only PDFs the model can't reliably read
- Unusual document types not represented in the routing tree
- Missing required fields in the document
- Multiple candidates for the same field (e.g., several entity names mentioned)
- Foreign-language documents
| Band | Likely cause | What to do |
|---|---|---|
| 80–98% | Model mostly sure but below the threshold | Usually verify and approve — model is often right at this range |
| 50–79% | Uncertain about one or more fields | Check each field carefully; one of the extracted values may be wrong |
| < 50% | Couldn't extract reliably | Manual review; may be an unusual doc type to add to config |
| Symptom | First thing to check |
|---|---|
unauthorized_client errors |
DWD scopes in Admin Console; client ID may need re-authorizing |
| Documents not being processed | Cloud Function logs; Vertex AI API enabled; service account key valid |
| Files not appearing in Drive | Service account has Content Manager on the shared Drive; target folder ID exists; not silently routed to Staging |
| Audit log not updating | Sheets API scope present in DWD; spreadsheet ID matches config.py; check fallback AUDIT| lines in Cloud Logging |
| No new docs being processed | Gmail watch likely expired — Cloud Scheduler should auto-renew every 6 days; trigger the renewer manually if cron failed |
Order-of-magnitude monthly cost at ~50–100 documents/month:
| Item | Estimate |
|---|---|
| Cloud Functions | ~$0 (free tier covers 2M invocations/mo) |
| Vertex AI (Gemini Pro) | ~$3–10 depending on document length and volume |
| Cloud Scheduler | ~$0.10 (2 cron jobs) |
| Cloud Pub/Sub | ~$0 (free tier) |
| Cloud Logging | ~$0 (free tier) |
| Filed-documents index (Sheet or Firestore) | ~$0 (free tier) |
| Total | ~$3–10/month |
- Data residency. All document content and AI inference stay inside your Google Cloud organization. Vertex AI inference happens in the GCP region you select. Customer data is not used to train Google's foundation models (per Google Cloud's data processing terms).
- Compliance. Vertex AI and the rest of GCP carry SOC 2 and ISO 27001. If your team has additional regulatory requirements (HIPAA, etc.), check Google's coverage matrix per service.
- Access control. A dedicated service account with least-privilege IAM. No domain-wide admin access required. Scope the service account to specific Google Groups, Drive folders, and Sheets. Access fully revocable at any time.
- Audit trail. Cloud Audit Logs (every API call), processing logs in the audit Sheet (every document), Cloud Logging (structured application logs).
- Rollback. Drive version history reverses any individual move/rename. Cloud Function versioning gives you instant restore. Processing logs let you do targeted bulk rollback.
- Secrets. Service account keys live in Secret Manager and rotate on a schedule. Cloud Functions read them at cold start, never log them.
Most of this is config you fill in once. The reference implementation is built so the same code runs for any team — you just point it at different folders, schemas, and naming rules.
| Customization | Where it lives |
|---|---|
| Your team's extraction schema (the JSON fields the LLM returns) | The classifier prompt in src/shared/classify.py |
| Your team's document taxonomy (the categories the model classifies into) | The classifier prompt + the Routing Tree tab |
| Your team's folder structure | DRIVE_FOLDERS in src/shared/config.py and the Routing Tree tab |
| Your team's filename template | NAME_TEMPLATE in src/shared/naming.py |
| Your team's entities (subsidiaries, branded entities, divisions) | ENTITIES in src/shared/config.py |
| Intake paths specific to your inbox (the senders / subject patterns that warrant lower-threshold trusted handling) | src/shared/intake.py |
| Skip rules (what your team explicitly does not want filed) | SKIP_RULES in src/shared/config.py |
| Confidence thresholds | INTAKE_THRESHOLDS in src/shared/confidence.py |
| Known vendors / parties (boosts confidence on match) | The Known Vendors tab |
| Limited-access folders (folders the bot must not write into) | LIMITED_ACCESS_FOLDERS in src/shared/config.py |
A reasonable rollout sequence for a new team:
- Document your team's existing filing structure and filename convention. The bot enforces them — it doesn't invent them.
- Define your extraction schema: which fields the LLM should return for each document. These become your filename pieces, your routing inputs, and your audit log columns.
- Identify your top 5–10 most common document types and the folders they should land in. That's enough to seed the Routing Tree.
- Build against a test Drive that mirrors your production structure.
- Run a 30-day shadow period: bot files into the test Drive, humans continue manual filing, compare results.
- Cut over to production; expect to spend the first month tuning the Routing Tree and Known Vendors as edge cases surface.
Reference implementation: Python on Google Cloud Functions (2nd gen) + Vertex AI (Gemini 2.5 Pro) + Google Workspace (Gmail, Drive, Sheets).