This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See AGENTS.md for full architecture and design documentation. This file records Claude-specific guidance and session learnings.
# Run all tests (recommended — ensures ffmpeg available)
./docker-tests.sh
# Run tests natively
./includes/vendor/bin/phpunit
# Run a single test file
./includes/vendor/bin/phpunit tests/Scraper/SenateYouTubeScraperTest.php --display-warnings
# Install Composer dependencies (installs to includes/vendor/, not vendor/)
composer installVideoImporter stores video_index_cache with JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES, which writes "key": "value" (space after colon). Any SQL LIKE pattern matching against video_index_cache must include the space:
// CORRECT
'%"youtube_id": "' . $youtubeId . '"%'
// WRONG — matches nothing
'%"youtube_id":"' . $youtubeId . '"%'video_index.ignored is 'y'/'n'. Job queues insert placeholder rows with ignored='y' before work begins. Every query that selects from video_index for processing should add AND ignored != 'y' — otherwise placeholders are processed as real data.
S3KeyBuilder::build() returns chamber/floor/YYYYMMDD.mp4 when committeeShortname is null. If a video has committee_id = NULL due to classification failure, it will collide with the floor video path. When uploading, always re-derive the committee from video_index_cache fields (committee_name, event_type) rather than trusting the DB's committee_id.
The webvtt column (and others) uses utf8mb3, which rejects 4-byte Unicode sequences (emoji, supplementary characters) and null bytes. Strip them before inserting:
$captionContents = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $captionContents);Real Senate YouTube titles use "Senate of Virginia: {body} on YYYY-MM-DD [Status]" format with colon separators — not the dash-separated format that extractCommitteeFromTitle() currently handles. This is a known bug: the method needs to be updated to handle the colon-prefix format. See the plan file for details.
When the server's yt-dlp cookie authentication fails for Senate videos:
- Server generates
s3://video.richmondsunlight.com/uploads/manifest.jsonviabin/generate_upload_manifest.php(runs automatically in pipeline) - Local machine downloads and uploads via
scripts/fetch_youtube_uploads.sh(requiresyt-dlp,awscli,jq) - Server processes staged files via
bin/process_uploads.php(runs automatically in pipeline)
The shell script is safe to interrupt and resume — it skips already-uploaded files by checking aws s3 ls.
If downloads are throttled to ~2-3 MiB/s, YouTube's nsig extraction has failed. Fix with yt-dlp -U.
RawTextResolver::getFilesWithUnresolvedEntries() uses probabilistic age-based sampling to avoid wasting time on chronically unresolvable old entries:
AND (
DATEDIFF(NOW(), f.date) <= 30
OR RAND() < 30.0 / DATEDIFF(NOW(), f.date)
)Videos from the last 30 days are always attempted; older videos are sampled with decreasing probability.