Skip to content

Download pre-generated tutorial data instead of generating it#1057

Open
bendichter wants to merge 4 commits intomainfrom
cache-tutorial-data
Open

Download pre-generated tutorial data instead of generating it#1057
bendichter wants to merge 4 commits intomainfrom
cache-tutorial-data

Conversation

@bendichter
Copy link
Copy Markdown
Collaborator

Problem

Tutorial test data generation runs every E2E CI run, spending ~2-3 minutes fitting PCA across 50 units × 385 channels. This is unnecessary since the data is deterministic (seeded).

Solution

Pre-generated tutorial data is now hosted as GitHub release assets and downloaded instead of generated:

  • Single-session data (SpikeGLX + Phy): 36MB compressed
  • Multi-session dataset (2 subjects × 2 sessions): 113MB compressed

Changes

  1. Backend: New download_test_data() and download_test_dataset() functions + /data/download and /data/download/dataset API endpoints
  2. Frontend: App tries downloading first, falls back to generation if offline or download fails
  3. CI: ExampleDataCache workflow caches tutorial data from GitHub release; E2E workflow restores it before tests
  4. Release: tutorial-test-data-v1 hosts the compressed archives

Benefits

  • E2E tests skip data generation entirely (data already present)
  • App users get tutorial data in seconds instead of minutes
  • Generation code still works as fallback (no breaking change)
  • Data is versioned via release tags — bump tag when generation code changes

@bendichter bendichter force-pushed the cache-tutorial-data branch 2 times, most recently from 43f9cdf to dbb7ebe Compare February 13, 2026 13:06
SpikeGLX recording data is generated locally (fast, just binary writes).
Phy sorting data is downloaded from a GitHub release asset (17MB),
avoiding ~2 min of PCA fitting per CI run.

- Split generate_test_data into _generate_spikeglx_data (fast) + Phy (slow)
- Add download_test_data: generates SpikeGLX + downloads pre-built Phy
- App tries download first, falls back to full generation if offline
- CI caches Phy data; E2E restores it before tests
Instead of downloading everything or generating everything:
- SpikeGLX recording data is generated locally (fast, ~10s)
- Phy sorting data is downloaded from GitHub release (17MB, avoids ~2min PCA)
- Falls back to full generation if download fails (offline support)
@rly
Copy link
Copy Markdown
Collaborator

rly commented Mar 6, 2026

Tutorial generation now takes 13 seconds on my Mac M1. Do we still want to cache the tutorial data and use that in tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants