Skip to content

protocol/document-filing-automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Document Filing Automation Bot

An AI-powered email-to-Drive filing system. Documents arriving in a shared inbox are automatically read by a large language model, tagged with structured metadata, renamed per a configurable template, and filed into the correct folder of a shared Drive. Anything the model isn't confident about goes to a human review queue instead of being filed.

The system is document-type and team-agnostic — the same code runs for any team. What varies between teams is configuration: which fields to extract, how to name the file, where to put it. This README is the build guide for a Google Cloud + Gemini reference implementation.

Built by SEAD in collaboration with PLCS.


Table of Contents

  1. What it does
  2. Why this approach
  3. Architecture
  4. How a document gets filed
  5. The AI extractor
  6. Confidence thresholds and the exception queue
  7. Filename engine
  8. Folder routing
  9. Deduplication
  10. Format handling
  11. Auth model
  12. Audit log spreadsheet
  13. Reporting and monitoring
  14. Build guide
  15. Configuration changes after launch
  16. Operational runbook (light)
  17. Cost
  18. Security and compliance
  19. What you need to customize for your team

1. What it does

A new email arrives in your team's shared intake inbox. The bot extracts each attachment, asks an LLM to read the document and return a structured set of metadata fields (whichever fields your team needs), checks whether the model is confident enough, and then either:

  • Files it. The bot assembles a filename per your template, picks the target folder using your routing rules, and moves the document into your shared Drive — logging every action.
  • Routes it for review. The document goes to a Staging folder with a [REVIEW] prefix, and a row is added to a human-facing review queue.

Everything is configurable: extraction fields, filename template, folder routing, confidence thresholds, what counts as a trusted source, what to skip entirely.

2. Why this approach

Three things drove the architecture choice over a no-code automation platform:

  • Security. Documents and AI inference both stay inside your own Google Cloud organization. Nothing routes through a third-party automation platform.
  • Cost. Estimated ongoing infrastructure cost is under $10/month at typical small-team volume (tens to low hundreds of documents per month). A no-code alternative is roughly an order of magnitude more expensive at the same volume.
  • Control. All code lives in a repo your team owns. No vendor lock-in. Full audit trail in Cloud Logging.

The trade-off is that a no-code platform is faster to stand up. This implementation is meant to be the long-term home, not a prototype.

3. Architecture

                Shared intake inbox (Google Group)
                              |
                              v
          Bot user mailbox  --(Gmail watch)-->  Pub/Sub topic
                                                     |
                                                     v
                                          Cloud Function: filing
                                                     |
                       +------------------------------+------------------------------+
                       v                              v                              v
                  Gmail API                    Vertex AI (Gemini)                 Drive API
              (read message,                (extract structured              (move + rename file)
               attachments)                  metadata)                              |
                                                     |                              v
                                                     v                       Audit Log Sheet
                                          Confidence >= threshold?           (Filing log,
                                            yes -> file to Drive              Exception Queue,
                                            no  -> Staging + queue            Dashboard,
                                                                              live config)
                              ^
                              |
   Cloud Scheduler (cron) --> Cloud Function: Gmail watch renewer (every 6 days)
   Cloud Scheduler (cron) --> Cloud Function: monthly spot-check audit

Components

Component Purpose
Cloud Function document-filing Receives Pub/Sub events, extracts metadata, files
Cloud Function watch-renewer Renews the Gmail push-notification watch every 6 days
Cloud Scheduler gmail-watch-renewer Cron (0 6 */6 * *) that hits the watch renewer
Cloud Scheduler spot-check-audit Monthly cron that pulls a random sample of auto-filed docs for QA
Pub/Sub topic Receives Gmail watch notifications
Service account Runs the Cloud Function with domain-wide delegation
Audit log spreadsheet Filing records, exception queue, dashboard, routing config
Filed-documents index SHA-256 lookup so duplicate emails don't double-file (Sheet or Firestore)

Compute layer

Google Cloud Functions, 2nd gen, Python 3.11+. Serverless, scales to zero, pay-per-invocation. Move to Cloud Run if you ever need request timeouts longer than 9 minutes.

4. How a document gets filed

End-to-end:

  1. New email arrives at the team's intake group address (e.g., intake@yourcompany.com, a Google Group).
  2. The Group delivers it to a dedicated bot user's mailbox.
  3. A Gmail watch on that mailbox fires a Pub/Sub notification with (emailAddress, historyId).
  4. The Cloud Function calls users.history.list(startHistoryId=last_known) to enumerate new messages since the last checkpoint.
  5. For each new message:
    1. Extract attachments.
    2. Compute a SHA-256 hash of each attachment and check it against the filed-documents index. If already filed, log and skip.
    3. Normalize formats the LLM can't read natively (see Format handling).
    4. Send the document content + email metadata to the LLM for extraction.
    5. Apply the confidence threshold for the matched intake path.
    6. If above threshold: assemble the filename per the template, move to the resolved folder, log to the Filing Log.
    7. If below threshold: move to a Staging folder with a [REVIEW] filename prefix and add a row to the Exception Queue.
  6. Advance the Gmail history checkpoint after successful enumeration.

5. The AI extractor

The implementation uses Gemini 2.5 Pro via Vertex AI. Earlier versions used 2.5 Flash; Pro materially improved accuracy on complex documents at the cost of a few cents per extraction.

The model is prompted to read the document end to end and return a structured JSON object. The exact fields are whatever your team needs to track and route by — they're defined in the prompt, not hard-coded in the system. Common fields across most use cases:

Field Purpose
document_type Your team's taxonomy of document types (whatever categories matter to you)
document_date The relevant date on the document, YYYY-MM-DD
primary_entity Which of your entities the document belongs to (subsidiary, branded entity, division, etc.) — if applicable
vendor The other party named in the document (vendor, customer, employee, etc.) — name as appropriate to your domain
summary One- to two-sentence plain-English description
confidence_score 0.00 – 1.00
folder_path The model's suggested destination folder
flags Array of edge-case indicators (vague subject, unclear entity, non-standard format, etc.)

Add or remove fields freely. A finance team might extract invoice number and amount; an HR team might extract employee ID and document category; a procurement team might extract PO number and supplier. The system passes whatever fields you define through to the filename template, the routing logic, and the audit log columns.

Skip rules. Some attachments are intentionally dropped — no Drive upload, no log entry — because they aren't records your team wants in the filing system. Configure these per team. Common patterns: drafts of in-flight documents, automated bounce notifications, abuse-report forwards, marketing emails. Anything matching the configured skip rules is logged in Cloud Logging only and never reaches the audit Sheet.

6. Confidence thresholds and the exception queue

The confidence threshold is a gate, not an accuracy target. The model only auto-files when it's at least N% confident. Below that, the document goes to a human. This keeps a human in the loop for anything ambiguous.

Thresholds vary by intake path because some sources are more structured and trustworthy than others:

Intake path Threshold Triggered when
trusted_source 95% Email comes from a recognized sender pattern your team has whitelisted (e.g., an e-signature platform, a known internal forwarding alias, a specific vendor portal)
external 99% A configured "external" tag appears in the subject line
unknown 99% Anything else

The trusted_source path is where you plug in your team's tooling. Define one entry per source you want to trust (sender pattern + threshold + any subject-line parser); add or remove sources as your team's stack changes.

Three confidence-boosting mechanisms run before the threshold check:

  1. Intake-path thresholds. Trusted sources auto-file at 95% instead of 99% because the surrounding metadata is highly structured.
  2. Known-vendor validation. The model's extraction is cross-referenced against a maintained list of known vendors/parties. A match boosts confidence by 3–5%.
  3. Email-subject parsing. Many automated completion emails contain structured metadata in the subject line. For each trusted sender pattern you configure, you can also configure a subject-line parser. When the model's extracted fields match two or more subject-parsed fields, confidence is boosted.

Documents that don't clear the threshold land in:

  • A row in the Exception Queue tab of the audit log spreadsheet
  • A copy of the file in a 0. Staging folder of the Drive, with a [REVIEW] prefix
  • A Drive link in the queue row so a reviewer can open it in one click

7. Filename engine

The bot enforces whatever filename convention your team already uses — it doesn't impose one. The engine takes the LLM's extracted fields and assembles a filename per a template you define in src/shared/naming.py.

The template is a simple format string referencing the extraction fields:

NAME_TEMPLATE = "{document_date} {primary_entity} - {document_type} ({vendor}).{ext}"

That's the only place the format lives. Change the template, redeploy, done.

Practical notes the engine handles for you:

  • Field normalization. Dates are normalized to whatever format the template expects. Strings are trimmed and case-normalized.
  • Missing-field fallbacks. When an extracted field is empty, the engine substitutes a configurable fallback (e.g., "Unknown") rather than producing a malformed filename.
  • Filesystem safety. Slashes, colons, and other reserved characters in extracted strings are stripped or replaced.
  • Conflict resolution. When a target filename already exists in the destination folder, the engine appends a disambiguator (-2, -3, etc.).

Field-extraction reliability, in order:

  1. Document content (most reliable for entity, document type, and the other party)
  2. Email metadata (subject, sender, recipients, date — useful as fallback and cross-reference)

When document and email dates conflict, the engine prefers the date from the document content.

8. Folder routing

Folder routing is split between sheet-driven config (your team's domain experts edit a tab) and code-driven config (an engineer ships a code change and redeploys).

Sheet-driven (no code, no deploy)

Lives in tabs of the audit log spreadsheet. Changes go live within ~5 minutes — the bot reads the relevant tab on every extraction.

Change How
Add or edit a routing rule (which folder a doc type lands in) Edit the Routing Tree tab. Add a row with node ID, category, folder key, triggers, subject matter, multi-folder rules, scope, and notes.
Add a known vendor (boosts confidence on match) Add a row to the Known Vendors tab.
Create a new subfolder Create the folder in Drive directly. The bot fuzzy-matches names to existing subfolders.

Subfolder rules. The Routing Tree supports a per-row Subfolder? flag. When set, the bot auto-creates a subfolder named after one of the extracted fields (e.g., per-vendor or per-entity subfolders) so related documents group together rather than flat-piling in the parent folder.

Code-driven (engineer ships it)

Change What's involved
Add a new entity to the entity list Edit ENTITIES in src/shared/config.py, redeploy.
Add a new top-level folder category New entry in DRIVE_FOLDERS, possibly a new routing-tree node, redeploy.
Change a confidence threshold Edit INTAKE_THRESHOLDS in src/shared/confidence.py, redeploy.
Add a new file format the LLM can't read natively Add a handler in src/shared/format_handlers.py, redeploy.
Change the extraction prompt or filename template Edit classify.py or naming.py, redeploy.
Change the AI model Edit LLM_MODEL in config.py, redeploy.

Multi-folder routing

Some documents legitimately belong in more than one folder. A small rules engine handles these:

  • The extractor returns requires_multi_filing: true and a list of candidate folders.
  • The rules engine consults a config table of known multi-folder document types.
  • The bot creates a copy in each target folder and logs both filings.

Limited-access folders

Some destination folders have permissions the service account intentionally doesn't hold (e.g., HR-only, executive-only, sensitive program folders). When the routing logic sends a document to one of these, the bot does not attempt to write — it routes to Staging with a [REVIEW: limited-access folder] tag so a member of the owning team can move it manually.

9. Deduplication

The bot maintains a filed-documents index keyed by SHA-256 of the attachment content. The index stores: hash, original message ID, filed Drive path, and filing timestamp.

This handles the common duplicate patterns:

  • BCC back to the intake inbox after forwarding internally
  • Auto-forwards from internal aliases to the records inbox
  • Platforms that auto-deposit and send completion notifications to the same inbox

Storage: Google Sheet for simplicity, or Cloud Firestore for slightly better latency at scale. Both are within the free tier at typical volume.

10. Format handling

The LLM only reads PDF, images, and plain text natively. Other formats are normalized first:

Source format Normalization
.docx / .doc / .rtf / .odt Uploaded to Drive as a Google Doc, exported as PDF
.eml Parsed; inner attachments extracted and recursed; if no attachments, the plain-text body is sent
application/octet-stream Magic-byte sniff (catches PDFs that senders mislabeled as generic binary)
Anything else Routed to Staging with an [UNSUPPORTED-FORMAT] tag

11. Auth model

The bot runs as a dedicated service account with domain-wide delegation (DWD) in Google Workspace. DWD lets the service account impersonate specific users for specific scopes.

In practice, the service account impersonates two users:

  • Drive + Sheets → an admin user that holds the folder/sheet shares (e.g., bot-admin@yourcompany.com)
  • Gmail → the bot user that's a member of the intake Google Group (e.g., filing-bot@yourcompany.com)

This split keeps the impersonation scope minimal — the bot user only ever reads its own mailbox.

Authorized OAuth scopes:

https://www.googleapis.com/auth/drive
https://www.googleapis.com/auth/gmail.modify
https://www.googleapis.com/auth/spreadsheets

Configured in: Google Admin Console → Security → API Controls → Domain-wide Delegation.

Credentials handling. Service account keys are stored in Google Secret Manager, not in the function's environment variables, and rotated on a schedule.

12. Audit log spreadsheet

The audit log is a single Google Sheet that does triple duty: it's the write-side log of everything the bot does, the read-side config the bot consults on every extraction, and the human-facing review surface. Reviewers, ops people, and the bot all share one source of truth.

Eight tabs total, in three groups:

Audit / review (what the bot writes)

  1. Filing Log
  2. Exception Queue
  3. Spot-Check Audit
  4. Dashboard

Live config (what the bot reads on every extraction)

  1. Routing Tree
  2. Known Vendors

Documentation / state

  1. Routing Logic
  2. Gmail State

A bootstrap script (shared/dashboard.py::bootstrap) creates all eight tabs with the correct headers if they're missing — point it at a fresh sheet and it stands up the whole structure.

The exact column lists below reflect a generic schema. Add or rename columns to match the extraction fields your team configures — the audit log is meant to mirror your team's metadata.

12.1 Filing Log

Append-only history of every filing action. One row per document. Columns include the bot's metadata plus whatever extraction fields you configured.

Column Notes
Timestamp ISO-8601 UTC
Status auto-filed, exception-queue, skipped
Generated Filename The renamed filename per the template
(Your extraction fields) One column per field in your extraction schema (e.g., document type, primary entity, vendor, document date, summary)
Confidence 0.00–1.00
Target Folder(s) Resolved folder path(s) — supports multi-folder filing
Drive File ID(s) Comma-separated Drive file IDs
Drive Link Clickable hyperlink to the filed file
Email Subject From the source message
Email Sender From the source message
Intake Path trusted_source, external, unknown
Original Filename The attachment's name as it arrived
Processing Time (s) Receipt to filing complete
Notes Errors, warnings, routing-decision rationale

12.2 Exception Queue

The reviewer's worklist. One row per document that didn't clear the threshold. The Action Taken column is the only field the reviewer fills in.

Column Notes
Timestamp, Original Filename, Suggested Filename What it was, what the model thought it should be
(Your extraction fields) + Confidence Model output (may be partial/wrong)
Email Subject, Email Sender, Intake Path Source context
Notes Why it was flagged (e.g., "UNCLASSIFIABLE (confidence 0%)", "missing date")
Review Link Clickable hyperlink to the staged file in Drive
Action Taken Reviewer fills in: approved, corrected, rejected, or duplicate

The queue auto-refreshes after each new exception is logged.

12.3 Spot-Check Audit

Monthly QA sample for already-auto-filed documents. A Cloud Scheduler cron picks 15+ random rows from that month's Filing Log on the 1st of each month and lands them here.

Column Notes
Audit Month YYYY-MM-01
Timestamp, Status, Generated Filename, (your extraction fields), Confidence, Target Folder(s) Copied from the original Filing Log row
Drive Link Clickable for verification
Verified Correct? Reviewer fills in: True / False
Reviewer Notes Free text

12.4 Dashboard

A simple metric/value/details rollup that auto-refreshes. Sectioned with --- HEADER --- divider rows for readability:

--- VOLUME ---
  Total Documents Processed
  Auto-Filed
  Exception Queue

--- RATES ---
  Auto-File Rate            (Auto-filed / (auto-filed + exceptions))
  Exception Rate            (Target: <= 5%)
  Filing Accuracy           (from spot-check; Target: >= 95%)
  Avg Processing Time       (Target: <= 120s)

--- THIS MONTH (YYYY-MM) ---
  Month Auto-Filed
  Month Exceptions

--- TOP DOC TYPES ---       (top 10 by frequency, from Filing Log)
--- TOP ENTITIES ---        (top entities by filing volume)

12.5 Routing Tree (live config)

The authoritative routing table — the bot reads this on every extraction. Adding/editing a row is a no-deploy change. See Folder routing.

Column Purpose
Node ID Stable internal key
Category Top-level grouping
Folder Key Maps to a Drive folder ID in config.py
Folder Path (Display) Human-readable path for the sheet
Triggers / Doc Types Comma-separated document types that route here
Subject Matter Description Plain-language description fed to the model as context
Multi-Folder Copy Optional — secondary folder to also drop a copy into
Subfolder? YES / YES (entity name) / blank — controls auto-creation of subfolders
Scope active / inactive — flip to disable a row without deleting it
Notes Free text

12.6 Known Vendors (live config)

Two columns: Vendor Name, Notes (optional). Matches against this list boost extraction confidence by 3–5%.

12.7 Routing Logic (documentation)

A tab whose only purpose is to document the extraction-and-filing pipeline so reviewers and new admins can understand the system without reading code. The bot does not read this tab — it's purely human-facing. Each row is one step; columns are: Step, Name, What It Does, Owner (which file/system implements it), Fail-Safe (what happens if it fails).

Reference content (your team's pipeline may differ slightly):

Step Name Owner Fail-Safe
0 Intake Detection shared/confidence.py Unknown path uses 99% threshold
1 Fact Extraction LLM prompt in classify.py Missing required fields cap confidence at 50%
2 Category Classification LLM with Routing Tree context Below threshold → staging with [REVIEW] prefix
3 Confidence Boosting shared/confidence.py (deterministic, no AI) Boosts additive, capped at 100%
4 Folder Resolution shared/routing_tree.py Unknown folder key falls back to staging
5 Filename Construction shared/naming.py Missing fields filled with configured fallbacks
6 Subfolder Creation shared/file.py Reuses existing subfolder if present
7 Multi-Folder Copy shared/file.py Non-blocking — primary filing still succeeds
8 Deduplication shared/file.py Prevents staging from growing unbounded on reprocessing
9 Audit Logging shared/audit_log.py Sheets failure falls back to Cloud Logging
10 Dashboard Refresh shared/dashboard.py Non-blocking — refresh failure doesn't block filing

12.8 Gmail State (operational state)

Two columns, one row per watched mailbox: emailAddress, lastHistoryId. This is the Gmail watch checkpoint — the cursor that tells the bot how far through Gmail's history it has read. The Cloud Function reads this row on every invocation, calls users.history.list(startHistoryId=lastHistoryId) to enumerate new messages, and writes the new historyId back after successful enumeration.

This is the most operationally critical tab in the spreadsheet. If you lose this value, the bot doesn't know where it left off and either re-processes everything from scratch or skips messages. Two implications when rebuilding:

  • Don't mix it into a tab with other data — keep it isolated so an accidental edit can't corrupt it.
  • If you'd rather use Cloud Firestore for this single piece of state, that's a reasonable swap — Firestore gives you better durability guarantees and stays out of human view.

13. Reporting and monitoring

Dashboard metrics

Metric Target Definition
Filing accuracy ≥ 95% Auto-filed / (auto-filed + exceptions)
Exception rate ≤ 5% Exceptions / total processed
Avg processing time ≤ 120 s Seconds from email receipt to filing complete
Monthly filing count (set per team) Sanity check against historical volume

Spot-check audit

Each month a Cloud Scheduler job picks 15+ random documents from that month's auto-filed items and lands them on the Spot-Check Audit tab. A human verifies each one is in the correct folder, correctly named, and that content matches the extraction. If accuracy drops below 95%, the system administrator is paged to look at model tuning or threshold adjustment.

Logs

  • Cloud Logging — every extraction and filing action emits structured logs (correlation ID, hash, intake path, confidence, decision)
  • Cloud Audit Logs — every API call against Gmail/Drive/Sheets is logged automatically
  • A defensive AUDIT| log line is emitted in addition to writing to the Sheet, so audit history survives even if the Sheets API call fails

14. Build guide

Rough scope for a single engineer who already has Google Workspace + Google Cloud admin access: about two weeks end-to-end, including a few days of integration testing.

14.1 Prerequisites

  • A Google Workspace organization where you can create service accounts and grant domain-wide delegation
  • A Google Cloud project (separate from your main org workloads is recommended — easier to scope IAM)
  • A shared Drive your team owns (you'll add the service account as Content Manager later)
  • A Google Group that will be your team's intake inbox
  • A bot user account that's a member of that Group
  • An LLM API enabled — Vertex AI Gemini for this implementation
  • Python 3.11+ on the build machine, plus gcloud CLI

14.2 Project setup

  1. Create the Google Cloud project. Enable: Cloud Functions, Cloud Build, Cloud Run, Cloud Scheduler, Pub/Sub, Vertex AI, Gmail API, Drive API, Sheets API, Secret Manager, Cloud Logging.
  2. Create a service account (e.g., filing-bot@<project>.iam.gserviceaccount.com).
  3. Generate a JSON key and store it in Secret Manager (do not commit it; the function will read it at startup).
  4. In the Google Workspace Admin Console, register the service account's client ID for Domain-wide Delegation with the three OAuth scopes listed in Auth model.
  5. Add the service account as Content Manager to the shared Drive that holds your filing structure.
  6. Add the bot user (the one the service account will impersonate for Gmail) to your team's intake Google Group.

14.3 Repo layout

src/
  filing/            # Pub/Sub-triggered filing
    main.py          # Cloud Function entry point
    pipeline.py      # extract -> dedup -> classify -> file
  shared/
    config.py        # ENTITIES, DRIVE_FOLDERS, intake-path detection, etc.
    confidence.py    # INTAKE_THRESHOLDS and boosting logic
    classify.py      # the LLM call and extraction prompt
    naming.py        # filename template engine
    format_handlers.py  # docx -> PDF, eml unpack, octet-stream sniff
    routing.py       # routing-tree lookup against the spreadsheet
    drive.py         # Drive API helpers
    sheets.py        # Sheets API helpers (audit log)
    dedup.py         # SHA-256 index
    dashboard.py     # generate_spot_check, refresh_exception_queue, refresh_dashboard
  watch_renewer/
    main.py          # renews the Gmail watch
deploy/
  cloudbuild.yaml
  scheduler.yaml     # cron jobs: watch-renewer, spot-check

14.4 Wire up the Gmail push notification

  1. Create a Pub/Sub topic (e.g., filing-email-events).
  2. Grant the Gmail service account publish permissions on the topic.
  3. From a script running as the bot user, call users.watch to register the watch:
    service.users().watch(userId='me', body={
        'topicName': 'projects/<PROJECT>/topics/filing-email-events',
        'labelIds': ['INBOX'],
        'labelFilterAction': 'include',
    }).execute()
  4. The watch expires after 7 days, so deploy the watch renewer Cloud Function and the Cloud Scheduler job that hits it every 6 days (0 6 */6 * *).

14.5 Deploy the filing function

gcloud functions deploy document-filing \
  --gen2 \
  --runtime=python311 \
  --region=us-central1 \
  --source=src/filing \
  --entry-point=on_pubsub \
  --trigger-topic=filing-email-events \
  --service-account=filing-bot@<PROJECT>.iam.gserviceaccount.com \
  --set-secrets=SA_KEY=filing-bot-key:latest \
  --memory=1Gi --timeout=540s

14.6 Stand up the audit log spreadsheet

Create a Google Sheet and run shared/dashboard.py::bootstrap to lay out the eight tabs described in Audit log spreadsheet with the correct headers. Share the sheet with the service account's impersonation user. Put the spreadsheet ID into config.py.

14.7 Test on a test Drive first

Strongly recommended: stand the system up against a test Drive that mirrors your production folder structure before pointing it at the real one. The reference implementation has a single config flag (DRIVE_FOLDERS_PROD vs DRIVE_FOLDERS_TEST) that flips between them.

A reasonable test plan:

  • Send a known structured email from a trusted source (e-signature platform, internal alias, etc.) with a real attachment — verify it auto-files correctly.
  • Send an email with the configured "external" tag — verify the higher threshold and routing.
  • Send a deliberately ambiguous attachment — verify it lands in the Exception Queue with a reasonable suggested filename.
  • Forward a .docx and an .eml — verify both normalize.
  • Send the same attachment twice from different addresses — verify dedup.

14.8 Production switchover

Three-step cutover:

  1. Add the service account as Content Manager on the production shared Drive.
  2. Add the bot user to your team's production intake Group so it receives mail.
  3. Flip the config toggle from test to production folders and redeploy.

Roll back by flipping the toggle back. Drive's version history reverses any moves/renames.

15. Configuration changes after launch

Frequency Change Owner
Most common Add a routing rule, known vendor, or subfolder Sheet edit by domain expert; live in ~5 min
Occasional Add a new entity, top-level folder category, or document type Code change + redeploy by engineer
Rare Change the LLM model, prompt, or thresholds Code change + redeploy by engineer

16. Operational runbook (light)

Reviewing the exception queue

Open the Audit Log spreadsheet → Exception Queue tab. Each row is a doc that needs human review.

For each row:

  1. Click the Review Link to open the file in Drive.
  2. Verify the suggested filename — is the extracted metadata correct?
  3. If correct: approve and let the bot finish filing.
  4. If wrong: rename and file manually.
  5. Fill in the Action Taken column: approved, corrected, rejected, or duplicate.

Common low-confidence reasons

  • Scanned/image-only PDFs the model can't reliably read
  • Unusual document types not represented in the routing tree
  • Missing required fields in the document
  • Multiple candidates for the same field (e.g., several entity names mentioned)
  • Foreign-language documents

Triage by confidence band

Band Likely cause What to do
80–98% Model mostly sure but below the threshold Usually verify and approve — model is often right at this range
50–79% Uncertain about one or more fields Check each field carefully; one of the extracted values may be wrong
< 50% Couldn't extract reliably Manual review; may be an unusual doc type to add to config

Common operational issues

Symptom First thing to check
unauthorized_client errors DWD scopes in Admin Console; client ID may need re-authorizing
Documents not being processed Cloud Function logs; Vertex AI API enabled; service account key valid
Files not appearing in Drive Service account has Content Manager on the shared Drive; target folder ID exists; not silently routed to Staging
Audit log not updating Sheets API scope present in DWD; spreadsheet ID matches config.py; check fallback AUDIT| lines in Cloud Logging
No new docs being processed Gmail watch likely expired — Cloud Scheduler should auto-renew every 6 days; trigger the renewer manually if cron failed

17. Cost

Order-of-magnitude monthly cost at ~50–100 documents/month:

Item Estimate
Cloud Functions ~$0 (free tier covers 2M invocations/mo)
Vertex AI (Gemini Pro) ~$3–10 depending on document length and volume
Cloud Scheduler ~$0.10 (2 cron jobs)
Cloud Pub/Sub ~$0 (free tier)
Cloud Logging ~$0 (free tier)
Filed-documents index (Sheet or Firestore) ~$0 (free tier)
Total ~$3–10/month

18. Security and compliance

  • Data residency. All document content and AI inference stay inside your Google Cloud organization. Vertex AI inference happens in the GCP region you select. Customer data is not used to train Google's foundation models (per Google Cloud's data processing terms).
  • Compliance. Vertex AI and the rest of GCP carry SOC 2 and ISO 27001. If your team has additional regulatory requirements (HIPAA, etc.), check Google's coverage matrix per service.
  • Access control. A dedicated service account with least-privilege IAM. No domain-wide admin access required. Scope the service account to specific Google Groups, Drive folders, and Sheets. Access fully revocable at any time.
  • Audit trail. Cloud Audit Logs (every API call), processing logs in the audit Sheet (every document), Cloud Logging (structured application logs).
  • Rollback. Drive version history reverses any individual move/rename. Cloud Function versioning gives you instant restore. Processing logs let you do targeted bulk rollback.
  • Secrets. Service account keys live in Secret Manager and rotate on a schedule. Cloud Functions read them at cold start, never log them.

19. What you need to customize for your team

Most of this is config you fill in once. The reference implementation is built so the same code runs for any team — you just point it at different folders, schemas, and naming rules.

Customization Where it lives
Your team's extraction schema (the JSON fields the LLM returns) The classifier prompt in src/shared/classify.py
Your team's document taxonomy (the categories the model classifies into) The classifier prompt + the Routing Tree tab
Your team's folder structure DRIVE_FOLDERS in src/shared/config.py and the Routing Tree tab
Your team's filename template NAME_TEMPLATE in src/shared/naming.py
Your team's entities (subsidiaries, branded entities, divisions) ENTITIES in src/shared/config.py
Intake paths specific to your inbox (the senders / subject patterns that warrant lower-threshold trusted handling) src/shared/intake.py
Skip rules (what your team explicitly does not want filed) SKIP_RULES in src/shared/config.py
Confidence thresholds INTAKE_THRESHOLDS in src/shared/confidence.py
Known vendors / parties (boosts confidence on match) The Known Vendors tab
Limited-access folders (folders the bot must not write into) LIMITED_ACCESS_FOLDERS in src/shared/config.py

A reasonable rollout sequence for a new team:

  1. Document your team's existing filing structure and filename convention. The bot enforces them — it doesn't invent them.
  2. Define your extraction schema: which fields the LLM should return for each document. These become your filename pieces, your routing inputs, and your audit log columns.
  3. Identify your top 5–10 most common document types and the folders they should land in. That's enough to seed the Routing Tree.
  4. Build against a test Drive that mirrors your production structure.
  5. Run a 30-day shadow period: bot files into the test Drive, humans continue manual filing, compare results.
  6. Cut over to production; expect to spend the first month tuning the Routing Tree and Known Vendors as edge cases surface.

Reference implementation: Python on Google Cloud Functions (2nd gen) + Vertex AI (Gemini 2.5 Pro) + Google Workspace (Gmail, Drive, Sheets).

About

Document Filing AI Bot

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors