This project provides a robust, DevSecOps-ready reproducible ETL (Extract, Transform, Load) pipeline for end-to-end analysis of multi-year, multi-language Google Sheet survey data, including raw extraction, cleaning, canonical mapping, advanced summaries, and executive-ready Excel outputs.
- Input: Raw survey CSVs (downloaded from Google Drive or other source folders).
- Process: Clean, normalize, translate, audit, map, summarize, and consolidate responses and questions.
- Output: Master Excel files for analysis; fully harmonized survey data for reporting or modeling.
root/
│
├── Makefile # Run extract, transform, or load scripts
├── .venv/ # Python virtual environment
├── creds/ # Google API credentials
│
├── data/
│ ├── configs/ # Config files (YAML, mappings, links)
│ ├── raw/ # All raw survey data (CSV, pre-clean)
│ └── processed/ # All cleaned, processed, and summary data
│
├── notebook/ # Jupyter notebooks and experiments
├── tests/ # Unit tests and pipeline QA scripts
├── utils/ # Shared Python utilities/helpers
│
├── scripts/
│ ├── extract/ # Scripts for data download, cleaning, auditing
│ ├── transform/ # Scripts for mapping, summaries, consolidation
│ └── load/ # Scripts for Excel/output/reporting
│
├── logs/ # Logging data and issues throughout the ETL pipeline
├── .env # Environment variables
├── .gitignore # Files and folders excluded from git
├── README.md # Project overview and documentation
├── SECURITY.md # Security and responsible disclosure policy
└── requirements.txt # Python dependencies
1. Extract
- Download all raw survey CSVs (English, Spanish, etc.) for all years/cohorts.
This pipeline’s first step (scrape_drive_links.py) automates the downloading of raw survey CSV files from Google Drive (or optionally Google Sheets, converted to CSV).
It uses the Google Drive and/or Google Sheets API for authenticated access to your organization’s files and ensures you always work with up-to-date, original data.
-
Google Cloud Platform Project with Drive and/or Sheets API enabled
-
Service Account credentials (or OAuth2 credentials) with access to the relevant files/folders
-
The following Python packages:
google-api-python-clientgoogle-auth-httplib2google-auth-oauthlibgspread(for direct Google Sheets to CSV, optional)pandasrequeststqdm(optional, for progress bars)
Install them with:
pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib gspread pandas requests tqdm-
Create a Google Cloud project at https://console.cloud.google.com/
-
Enable the APIs:
- Google Drive API
- (Optionally) Google Sheets API
-
Create and download a Service Account key (JSON) or OAuth client secret.
-
Share your survey folder/files with the Service Account email (if using a Service Account).
-
Save your credentials (usually as
service_account.jsonorcredentials.json) in a secure location.
-
Place your credentials file in a known location (e.g.,
secrets/service_account.json). -
Set an environment variable or
.enventry:GOOGLE_APPLICATION_CREDENTIALS=secrets/service_account.json -
Update
scrape_drive_links.pyto read your folder IDs or search queries as needed.
- Authenticates with Google Drive using your credentials.
- Lists all files in a given folder or matching a search query.
- Downloads each file as CSV (either as a raw file or converts a Google Sheet to CSV).
- Saves each file into a local directory (e.g.,
raw/ordata/raw/).
python -m scripts.extract.scrape_drive_linksThis will:
- Authenticate using your Google credentials
- Download all target survey files (for each year/cohort) to your local
raw/directory - Ensure that your pipeline always starts with the latest official data
- File type filtering (only download
.csvor.gsheet) - Date filtering (only new/modified files)
- Automatic conversion of Google Sheets to CSV
- Logging of all downloaded file names, IDs, and timestamps
- Parallel downloads for large batches
- Missing credentials: Script halts and prints a clear message
- API quota exceeded: Script sleeps and retries (or halts with a warning)
- No files found: Script exits and prints which query/folder was empty
- Partial download: Script can be resumed/restarted safely
- Never commit credentials (JSON) to git or public repos
- Restrict service account to “read-only” where possible
- Use Google Groups to give bulk file access to your service account.
- For one-off manual downloads, use Google Drive web UI, but for reproducibility, always prefer API automation.
- Keep your service account credentials and folder IDs in
.envor as command-line args for flexibility.
2. Transform
- Standardize headers, clean rows, handle missing or malformed data.
- Translate Spanish CSVs to English.
- Audit all unique questions/options; map variations to canonical/overarching questions.
- Generate summary tables (Likert, frequency, Yes/No, etc.) per year and group.
- Consolidate all responses and question summaries across years and groups.
- Summarize totals for cross-year and cross-group analysis.
3. Load
- Export all final and intermediate outputs to Excel, with tabs for every summary and year/group.
- Provide “master summary” and “full data” workbooks for stakeholders.
-
Python 3.8+
-
Required Python packages:
pandasxlsxwriter- (optionally:
openpyxl,numpy,unicodedata,regex)
-
bash (for pipeline orchestration)
Install requirements (example):
pip install pandas xlsxwriter openpyxl-
Clone this repository:
git clone <your-repo-url> cd <your-repo-folder>
-
Set up your Python virtual environment (recommended):
python -m venv venv source venv/bin/activate # (On Windows: venv\Scripts\activate)
-
Install all Python dependencies:
pip install -r requirements.txt
-
Run the pipeline with:
./run_etl.sh
(Ensure it’s executable:
chmod +x run_etl.sh)
This will:
- Scrape and standardize raw data
- Translate (if needed)
- Audit, map, and consolidate all survey questions/responses
- Output results as
SF_Master_Summary.xlsxandSF_Master_Data.xlsxin thedata/processed/directory
This project includes a Makefile for easy automation and orchestration of the ETL workflow.
-
Run the full ETL pipeline (skipping Google Drive scraping and confirming before the final Excel output):
make pipeline
-
Run only the extract step (data loading, cleaning, translation, audit):
make extract
-
Run only the transform step (mapping, summary tables, consolidation):
make transform
-
Manually run the scraping step, if you need to update the survey links:
make scrape
-
Run only the final Excel export step:
make load
-
Clean processed data outputs:
make clean
-
Run all unit tests:
make test
- The Makefile allows you to run any ETL stage individually or in sequence.
- The default
pipelinetarget skips the Google Drive scraping step and asks for confirmation before creating the final Excel output, so you won’t overwrite results by accident. - You can always run the original
run_etl.shscript for fully automated execution.
Always run make commands from the project root directory.
The Makefile expects all scripts, data, and outputs to use the standard project structure.
All steps of the ETL pipeline are fully logged for transparency, debugging, and auditability.
Logs are stored in the logs/ directory, with separate subfolders for each ETL phase:
logs/
├── extract/
│ ├── extract.log # All extraction progress and info
│ └── extract_error.log # Extraction warnings and errors only
├── transform/
│ ├── transform.log # Data cleaning, mapping, consolidation steps
│ └── transform_error.log # Warnings and errors during transformation
├── load/
│ ├── load.log # Final Excel/report generation steps
│ └── load_error.log # Any errors or warnings during loading/output
- Every script in the pipeline logs to its respective ETL phase.
- Info, warning, and error messages are captured, including which files were processed, skipped, or had issues.
- Error logs make it easy to spot and debug failed data loads, missing columns, or API problems.
- Logs rotate automatically to prevent disk overflows.
This project uses GitHub Actions for full CI/CD, automated linting, and security checks:
- On every push or pull request:
You can view build status, logs, and artifacts on the Actions tab.
Code is autoformatted and style-enforced with:
- Black (for code style, line length, and blank lines)
- Ruff (for code hygiene and unused code removal)
- Flake8 (for PEP8 and custom style rules)
To lint and auto-fix your code locally:
black . --line-length 120
ruff check scripts/ tests/ utils/ --fix
flake8 scripts/ tests/ utils/
bandit -r scripts/ utils/- Column normalization: Consistently rename and strip all headers (lowercase, no spaces, uniform naming).
- String cleaning: Remove leading/trailing whitespace, standardize punctuation/quotes, handle Unicode artifacts.
- Missing value handling: Remove or fill empty rows/cells, standardize “N/A”, handle outliers.
- Canonical mapping: Use
QCON_MAPand audit mapping to ensure all question variations are consolidated. - Translation: Ensure all responses/options are in English for cross-year consistency.
- Multi-select/multi-pivot: Use canonical mappings to group, count, and analyze all subquestions/options as part of overarching categories.
- Percentages and rates: Compute %Yes/No/Maybe per group/year/option.
- Longitudinal tracking: Analyze trends in responses across years and cohorts.
- Visualization: Use Excel or Python/Streamlit dashboards for charts, pivots, and interactive summaries.
- Each ETL step is checked: Pipeline halts if a key script or file is missing or if any step fails.
- Inter-step file checks: Verifies output files from each phase before moving to the next.
- Logging: All errors and successes are printed to console (or optionally, a log file).
- Custom checks: Can be extended to email, Slack, or alert on failure for production.
How to debug:
- Inspect the output/error message to see where the process halted.
- Review intermediate outputs (CSV/Excel) to verify data at each stage.
- Re-run just the failed script for rapid iteration.
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by Dylan Picart at Partnership With Children.
Open an issue or submit a pull request! For help running or customizing the pipeline, contact dylanpicart@mail.adelphi.edu.