Skip to content

dylanpicart/google_sheet_survey_pipeline

Repository files navigation

Google Sheet Survey ETL Pipeline

CI

This project provides a robust, DevSecOps-ready reproducible ETL (Extract, Transform, Load) pipeline for end-to-end analysis of multi-year, multi-language Google Sheet survey data, including raw extraction, cleaning, canonical mapping, advanced summaries, and executive-ready Excel outputs.


Pipeline Overview

  • Input: Raw survey CSVs (downloaded from Google Drive or other source folders).
  • Process: Clean, normalize, translate, audit, map, summarize, and consolidate responses and questions.
  • Output: Master Excel files for analysis; fully harmonized survey data for reporting or modeling.

Project Structure

root/
│
├── Makefile                 # Run extract, transform, or load scripts
├── .venv/                   # Python virtual environment 
├── creds/                   # Google API credentials
│
├── data/
│   ├── configs/             # Config files (YAML, mappings, links)
│   ├── raw/                 # All raw survey data (CSV, pre-clean)
│   └── processed/           # All cleaned, processed, and summary data
│
├── notebook/                # Jupyter notebooks and experiments
├── tests/                   # Unit tests and pipeline QA scripts
├── utils/                   # Shared Python utilities/helpers
│
├── scripts/
│   ├── extract/             # Scripts for data download, cleaning, auditing
│   ├── transform/           # Scripts for mapping, summaries, consolidation
│   └── load/                # Scripts for Excel/output/reporting
│
├── logs/                    # Logging data and issues throughout the ETL pipeline
├── .env                     # Environment variables
├── .gitignore               # Files and folders excluded from git
├── README.md                # Project overview and documentation
├── SECURITY.md              # Security and responsible disclosure policy
└── requirements.txt         # Python dependencies


ETL Process

1. Extract

  • Download all raw survey CSVs (English, Spanish, etc.) for all years/cohorts.

Google API Extraction

Overview

This pipeline’s first step (scrape_drive_links.py) automates the downloading of raw survey CSV files from Google Drive (or optionally Google Sheets, converted to CSV). It uses the Google Drive and/or Google Sheets API for authenticated access to your organization’s files and ensures you always work with up-to-date, original data.


Requirements

  • Google Cloud Platform Project with Drive and/or Sheets API enabled

  • Service Account credentials (or OAuth2 credentials) with access to the relevant files/folders

  • The following Python packages:

    • google-api-python-client
    • google-auth-httplib2
    • google-auth-oauthlib
    • gspread (for direct Google Sheets to CSV, optional)
    • pandas
    • requests
    • tqdm (optional, for progress bars)

Install them with:

pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib gspread pandas requests tqdm

Google API Setup

  1. Create a Google Cloud project at https://console.cloud.google.com/

  2. Enable the APIs:

    • Google Drive API
    • (Optionally) Google Sheets API
  3. Create and download a Service Account key (JSON) or OAuth client secret.

  4. Share your survey folder/files with the Service Account email (if using a Service Account).

  5. Save your credentials (usually as service_account.json or credentials.json) in a secure location.


Script Configuration

  • Place your credentials file in a known location (e.g., secrets/service_account.json).

  • Set an environment variable or .env entry:

    GOOGLE_APPLICATION_CREDENTIALS=secrets/service_account.json
    
  • Update scrape_drive_links.py to read your folder IDs or search queries as needed.


How the Extraction Script Works

  • Authenticates with Google Drive using your credentials.
  • Lists all files in a given folder or matching a search query.
  • Downloads each file as CSV (either as a raw file or converts a Google Sheet to CSV).
  • Saves each file into a local directory (e.g., raw/ or data/raw/).

Common Usage Example

python -m scripts.extract.scrape_drive_links

This will:

  • Authenticate using your Google credentials
  • Download all target survey files (for each year/cohort) to your local raw/ directory
  • Ensure that your pipeline always starts with the latest official data

Possible Extraction Script Features

  • File type filtering (only download .csv or .gsheet)
  • Date filtering (only new/modified files)
  • Automatic conversion of Google Sheets to CSV
  • Logging of all downloaded file names, IDs, and timestamps
  • Parallel downloads for large batches

Error Handling

  • Missing credentials: Script halts and prints a clear message
  • API quota exceeded: Script sleeps and retries (or halts with a warning)
  • No files found: Script exits and prints which query/folder was empty
  • Partial download: Script can be resumed/restarted safely

Security

  • Never commit credentials (JSON) to git or public repos
  • Restrict service account to “read-only” where possible

Pro Tips

  • Use Google Groups to give bulk file access to your service account.
  • For one-off manual downloads, use Google Drive web UI, but for reproducibility, always prefer API automation.
  • Keep your service account credentials and folder IDs in .env or as command-line args for flexibility.

2. Transform

  • Standardize headers, clean rows, handle missing or malformed data.
  • Translate Spanish CSVs to English.
  • Audit all unique questions/options; map variations to canonical/overarching questions.
  • Generate summary tables (Likert, frequency, Yes/No, etc.) per year and group.
  • Consolidate all responses and question summaries across years and groups.
  • Summarize totals for cross-year and cross-group analysis.

3. Load

  • Export all final and intermediate outputs to Excel, with tabs for every summary and year/group.
  • Provide “master summary” and “full data” workbooks for stakeholders.

Requirements

  • Python 3.8+

  • Required Python packages:

    • pandas
    • xlsxwriter
    • (optionally: openpyxl, numpy, unicodedata, regex)
  • bash (for pipeline orchestration)

Install requirements (example):

pip install pandas xlsxwriter openpyxl

How to Run the Pipeline

Installation and Requirements

  1. Clone this repository:

    git clone <your-repo-url>
    cd <your-repo-folder>
  2. Set up your Python virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # (On Windows: venv\Scripts\activate)
  3. Install all Python dependencies:

    pip install -r requirements.txt

  1. Run the pipeline with:

    ./run_etl.sh

    (Ensure it’s executable: chmod +x run_etl.sh)

This will:

  • Scrape and standardize raw data
  • Translate (if needed)
  • Audit, map, and consolidate all survey questions/responses
  • Output results as SF_Master_Summary.xlsx and SF_Master_Data.xlsx in the data/processed/ directory

Using the Makefile for Your ETL Pipeline

This project includes a Makefile for easy automation and orchestration of the ETL workflow.

Common Makefile Commands

  • Run the full ETL pipeline (skipping Google Drive scraping and confirming before the final Excel output):

    make pipeline
  • Run only the extract step (data loading, cleaning, translation, audit):

    make extract
  • Run only the transform step (mapping, summary tables, consolidation):

    make transform
  • Manually run the scraping step, if you need to update the survey links:

    make scrape
  • Run only the final Excel export step:

    make load
  • Clean processed data outputs:

    make clean
  • Run all unit tests:

    make test

How it works

  • The Makefile allows you to run any ETL stage individually or in sequence.
  • The default pipeline target skips the Google Drive scraping step and asks for confirmation before creating the final Excel output, so you won’t overwrite results by accident.
  • You can always run the original run_etl.sh script for fully automated execution.

Pro tip:

Always run make commands from the project root directory. The Makefile expects all scripts, data, and outputs to use the standard project structure.


Logging

All steps of the ETL pipeline are fully logged for transparency, debugging, and auditability. Logs are stored in the logs/ directory, with separate subfolders for each ETL phase:

logs/
├── extract/
│   ├── extract.log           # All extraction progress and info
│   └── extract_error.log     # Extraction warnings and errors only
├── transform/
│   ├── transform.log         # Data cleaning, mapping, consolidation steps
│   └── transform_error.log   # Warnings and errors during transformation
├── load/
│   ├── load.log              # Final Excel/report generation steps
│   └── load_error.log        # Any errors or warnings during loading/output
  • Every script in the pipeline logs to its respective ETL phase.
  • Info, warning, and error messages are captured, including which files were processed, skipped, or had issues.
  • Error logs make it easy to spot and debug failed data loads, missing columns, or API problems.
  • Logs rotate automatically to prevent disk overflows.

🚦 Continuous Integration & Quality Assurance

This project uses GitHub Actions for full CI/CD, automated linting, and security checks:

  • On every push or pull request:
    • All code is linted with Ruff and Flake8.
    • Security is checked with Bandit and pip-audit.
    • All tests are run with pytest, including code coverage reports.
    • Logs and coverage reports are uploaded as build artifacts on failure.

You can view build status, logs, and artifacts on the Actions tab.


🧹 Linting & Formatting

Code is autoformatted and style-enforced with:

  • Black (for code style, line length, and blank lines)
  • Ruff (for code hygiene and unused code removal)
  • Flake8 (for PEP8 and custom style rules)

To lint and auto-fix your code locally:

black . --line-length 120
ruff check scripts/ tests/ utils/ --fix
flake8 scripts/ tests/ utils/
bandit -r scripts/ utils/

Possible Approaches to Data Cleaning

  • Column normalization: Consistently rename and strip all headers (lowercase, no spaces, uniform naming).
  • String cleaning: Remove leading/trailing whitespace, standardize punctuation/quotes, handle Unicode artifacts.
  • Missing value handling: Remove or fill empty rows/cells, standardize “N/A”, handle outliers.
  • Canonical mapping: Use QCON_MAP and audit mapping to ensure all question variations are consolidated.
  • Translation: Ensure all responses/options are in English for cross-year consistency.

Next Steps (Pivot & Advanced Analysis)

  • Multi-select/multi-pivot: Use canonical mappings to group, count, and analyze all subquestions/options as part of overarching categories.
  • Percentages and rates: Compute %Yes/No/Maybe per group/year/option.
  • Longitudinal tracking: Analyze trends in responses across years and cohorts.
  • Visualization: Use Excel or Python/Streamlit dashboards for charts, pivots, and interactive summaries.

Error Handling & Quality Assurance

  • Each ETL step is checked: Pipeline halts if a key script or file is missing or if any step fails.
  • Inter-step file checks: Verifies output files from each phase before moving to the next.
  • Logging: All errors and successes are printed to console (or optionally, a log file).
  • Custom checks: Can be extended to email, Slack, or alert on failure for production.

How to debug:

  • Inspect the output/error message to see where the process halted.
  • Review intermediate outputs (CSV/Excel) to verify data at each stage.
  • Re-run just the failed script for rapid iteration.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author

Developed by Dylan Picart at Partnership With Children.

Open an issue or submit a pull request! For help running or customizing the pipeline, contact dylanpicart@mail.adelphi.edu.

About

Lightweight Python ETL using Google Cloud Platform to Extract, Transform, and Load Google Surveys for Exploratory Data Analysis on Streamlit Dashboard and for Excel viewing with modern DevSecOps practices.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors