This file provides guidance to LLMs for working with code in this repository.
Richmond Sunlight Video Processor is a standalone pipeline for collecting, processing, and archiving Virginia General Assembly legislative video. It scrapes video metadata from the House and Senate, downloads videos to S3, generates screenshots, extracts transcripts, detects bills being discussed via OCR, identifies speakers, and uploads finalized assets to the Internet Archive.
This repository is part of the richmondsunlight.com collection:
richmondsunlight.com- Main website (PHP front-end, source of sharedincludes/directory)rs-api- Public JSON APIrs-machine- Data ingestion and processing (scrapers, parsers, cron jobs)rs-video-processor(this repo) - Video ingestion and analysis
Start the local Docker environment:
./docker-run.shThis script builds and starts the container with PHP, ffmpeg, and other required tooling. The first run copies deploy/settings-docker.inc.php to includes/settings.inc.php if it doesn't exist.
Stop the Docker environment:
./docker-stop.shRun all tests:
./docker-tests.shThis runs PHPUnit inside the Docker container. The test suite includes 40+ unit and integration tests.
Run tests on native host (requires ffmpeg):
./includes/vendor/bin/phpunitTests self-skip if prerequisites (ffmpeg, MP4 fixtures, API keys) are missing.
Fetch test video fixtures:
php bin/fetch_test_fixtures.phpDownloads sample MP4 files from video.richmondsunlight.com/fixtures for integration testing.
Install dependencies:
composer installNote: Composer is configured to install dependencies to includes/vendor/ rather than the default vendor/ directory (matching the richmondsunlight.com pattern).
All scripts bootstrap via bin/bootstrap.php. Key entry points:
# Scrape metadata from House/Senate sources
php bin/scrape.php
# Download videos to S3
php bin/fetch_videos.php --limit=N
# Screenshot generation (enqueue mode for control plane)
php bin/generate_screenshots.php --enqueue
# Screenshot generation (worker mode for analysis box)
php bin/generate_screenshots.php --limit=N
# Transcript extraction
php bin/generate_transcripts.php --enqueue|--limit=N
# Bill detection via OCR
php bin/detect_bills.php --enqueue|--limit=N
# Speaker identification
php bin/detect_speakers.php --enqueue|--limit=N
# Upload to Internet Archive
php bin/upload_archive.php --limit=N
# Full pipeline orchestration
php bin/pipeline.phpThe system runs across two server configurations:
-
Lightweight Control Plane (EC2 Micro instance, shared with rs-machine)
- Scraping and metadata collection
- Video downloading to S3
- Job enqueueing for analysis tasks
-
Heavy Analysis Workers (GPU-capable EC2 instance)
- Screenshot generation (ffmpeg)
- Transcript processing (OpenAI Whisper)
- Bill detection (Tesseract OCR)
- Speaker detection (diarization)
Gating: Analysis only runs if /home/ubuntu/video-processor.txt exists on the instance.
Production uses AWS SQS FIFO queue (rs-video-harvester.fifo). In Docker/tests, falls back to an in-memory queue automatically.
// Queue selection is automatic via QueueFactory
$queue = $factory->build($queueUrl, $config);rs-video-processor/
├── bin/ # CLI entry points
│ ├── bootstrap.php # Application initialization
│ ├── scrape.php # Metadata collection
│ ├── fetch_videos.php # Download to S3
│ ├── generate_screenshots.php # 1 FPS frame extraction
│ ├── generate_transcripts.php # Audio transcription
│ ├── detect_bills.php # OCR-based bill detection
│ ├── detect_speakers.php # Speaker identification
│ ├── upload_archive.php # Internet Archive uploads
│ └── pipeline.php # End-to-end orchestrator
├── src/ # Core application code (PSR-4)
│ ├── Bootstrap/ # App initialization
│ ├── Scraper/ # House/Senate metadata collection
│ │ ├── House/ # Sliq platform scraper
│ │ └── Senate/ # Granicus platform scraper
│ ├── Sync/ # Database synchronization
│ ├── Fetcher/ # Video downloading
│ ├── Screenshots/ # Frame extraction
│ ├── Transcripts/ # Audio transcription
│ ├── Analysis/ # Content analysis
│ │ ├── Bills/ # Bill detection via OCR
│ │ └── Speakers/ # Speaker identification
│ ├── Archive/ # Internet Archive uploads
│ └── Queue/ # Job orchestration (SQS/in-memory)
├── includes/ # Shared helpers (from richmondsunlight.com)
│ ├── settings.inc.php # Configuration constants
│ ├── class.Log.php # Logging with Slack integration
│ ├── class.Database.php # PDO connection wrapper
│ └── vendor/ # Composer dependencies
├── tests/ # PHPUnit test suite
│ └── fixtures/ # Test data (HTML, MP4s, JSON)
├── storage/ # Local file storage
│ ├── scraper/ # Scraped metadata snapshots
│ ├── pipeline/ # Pipeline output files
│ └── screenshots/ # Local screenshot cache
└── deploy/ # Deployment configuration
flowchart TD
subgraph Sources
House[House Video\nSliq Platform]
Senate[Senate Video\nGranicus Platform]
end
subgraph Pipeline
Scraper[Scraper]
JSON[JSON Snapshots]
DB[Database Sync]
S3[Download to S3]
Screenshots[Screenshot Generation]
Transcripts[Transcript Extraction]
Bills[Bill Detection\nOCR]
Speakers[Speaker Detection]
Archive[Internet Archive]
end
House --> Scraper
Senate --> Scraper
Scraper --> JSON
JSON --> DB
DB --> S3
S3 --> Screenshots
Screenshots --> Transcripts
Screenshots --> Bills
Screenshots --> Speakers
Transcripts --> Archive
Bills --> Archive
Speakers --> Archive
Pluggable Interfaces:
HttpClientInterface- Swap HTTP implementationsOcrEngineInterface- Replace Tesseract with OpenAI Vision or AWS RekognitionDiarizerInterface- Swap speaker detection providersQueueInterface- SQS or in-memory
Processor Pattern: Each analysis step has a dedicated processor class that handles job orchestration, execution, and result persistence.
Idempotency:
- Enqueue mode is safe to run repeatedly
- Worker mode acknowledges jobs only after successful processing
- Database prevents duplicate inserts
Uses the same MariaDB schema as richmondsunlight.com. Key tables:
files- Video metadata (id, chamber, committee_id, title, path, dimensions, date)video_index- Per-second analysis results (file_id, time, type, linked_id)video_transcript- Transcript segments (file_id, start_time, end_time, text)people- Legislator directory for speaker matchingcommittees- Committee metadata
No schema migrations required - works with existing richmondsunlight.com definitions.
Configuration is in includes/settings.inc.php. Key constants:
// Database
PDO_DSN, PDO_USERNAME, PDO_PASSWORD
// AWS
AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION
VIDEO_SQS_URL // rs-video-harvester.fifo queue URL
// APIs
OPENAI_KEY // For transcription fallback
IA_ACCESS_KEY, IA_SECRET_KEY // Internet Archive
// Optional
SLACK_WEBHOOK // Logging integrationFor Docker development, use deploy/settings-docker.inc.php as a template.
Videos and screenshots are stored in video.richmondsunlight.com:
video.richmondsunlight.com/
├── house/
│ ├── floor/
│ │ └── YYYYMMDD/
│ │ ├── video.mp4
│ │ └── screenshots/
│ └── {committee}/
│ └── YYYYMMDD/
└── senate/
├── floor/
└── {committee}/
Follow PSR-12 coding standards. The repository uses modern PHP 8.1+ patterns:
- Typed properties and return types
- Constructor property promotion
- Named arguments where clarity helps
- Variables:
$camelCasein new code,$snake_casein legacy includes - Methods:
camelCase()in src/,snake_case()in includes/ - Classes:
PascalCase - Constants:
UPPER_SNAKE_CASE
PSR-4 autoloader maps:
RichmondSunlight\VideoProcessor\→src/
Legacy code in includes/ uses require/include.
- Base URL:
https://sg001-harmony.sliq.net/ - Listing endpoint:
/en/api/Data/GetListViewData - Detail pages contain JavaScript objects:
mediaStartTime,AgendaTree,downloadMediaUrls,Speakers - Rate limited: 1 request/second
- Base URL:
https://virginia-senate.granicus.com/ - Listing:
/ViewPublisher.php?view_id=3 - Detail pages parsed from HTML tables
- Rate limited: 1 request/second
- Extracts 1 frame per second via ffmpeg
- Generates two sizes per frame (full resolution + thumbnail)
- Uploads to S3 with predictable naming for downstream OCR
Priority order:
- Existing WebVTT/SRT captions (preferred - reduces API costs)
- OpenAI Whisper API fallback
Audio is resampled to MP3 (mono, 32 Kbps, 16 kHz, <25 MB chunks) before API submission.
- OCR-based chyron detection from screenshots
- Chamber-specific crop regions:
- Senate: Bill number in top-right (dark red background)
- House: Bill info in top banner (translucent gray background)
- Falls back to House agenda data when OCR fails
- Results stored in
video_indextable (type='bill')
- House: Extracts from cached
Speakersmetadata object - Senate: OpenAI diarization for speaker boundaries
- Fuzzy-matches names to
peopletable for legislator IDs - Results stored in
video_index(type='legislator')
- Scraping/enqueueing runs on rs-machine EC2 instance (always-on, low-cost)
- Analysis workers run on separate GPU-capable instance (started on-demand)
- Deployed via GitHub Actions → AWS CodeDeploy
- Gated by presence of
/home/ubuntu/video-processor.txt
The video processor EC2 instance is expensive and only started when there's work to do. The rs-machine repository contains a cron script (cron/start_video_processor.php) that:
- Checks for videos needing processing (download, screenshots, transcripts, bill detection, speaker detection, archive)
- Queries the instance state via AWS EC2 API
- Starts the instance only if work is pending and it's not already running
This runs every 30 minutes during legislative hours (noon-9 PM) via crontab.
- PHP syntax linting
- Composer install
- PHPUnit test suite
- Deploy to CodeDeploy (on master branch)
- Always use Docker for testing - ensures ffmpeg and other dependencies are available
- Queue fallback is automatic - in-memory queue used when SQS unavailable
- Test fixtures are large - run
bin/fetch_test_fixtures.phpto download sample videos - Check the gating file - analysis won't run without
/home/ubuntu/video-processor.txt - Scraper output is JSON - check
storage/scraper/for debugging metadata collection
| Repository | Relationship |
|---|---|
| richmondsunlight.com | Source of includes/ directory (shared classes, settings) |
| rs-api | Consumes video_index and video_transcript data |
| rs-machine | Triggers video processor EC2 instance; provides committee/legislator data; runs cron/start_video_processor.php |
Videos are deduplicated using a composite key of chamber|date|duration_seconds. This is handled by:
ExistingFilesRepository- reads existing videos from database, builds keys fromlengthcolumnMissingVideoFilter- compares scraped records against existing keys
The duration (in seconds) is used rather than video URL because URLs can change between scrapes while the video content remains the same.
The SpeakerMetadataExtractor supports two formats:
- Raw Sliq format:
Speakersarray withtextandstartTimefields - Normalized format:
speakersarray withnameandstart_timefields
This flexibility allows processing both raw scraped data and normalized pipeline output.
Subcommittees use the same crop configuration as committees for bill detection OCR, since their video layouts are typically identical.
This is a production-critical system that processes legislative video for public transparency. Changes should be tested thoroughly before deployment.