docs: improve the find data page to include information about queries, cache tables, and MCP by dbirman · Pull Request #69 · AllenNeuralDynamics/aind-software-docs

dbirman · 2026-03-20T16:07:07Z

No description provided.

saskiad · 2026-03-20T20:46:19Z

docs/source/explore_analyze/find_data.md

 # Find data

-The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report.
+Each raw asset uploaded from a platform at AIND produces a group of derived assets, one per modality. You can find these assets easily by performing a query on our metadata database using your project name and other fields unique to your project. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**.


This organization is true for phys/behavior. Not for other modalities.
For some spim, I think it's just one derived asset which is fine because it's one modality
But for other spim, I think there are many different dervied assets that have more to do with clustering results in time.

We maybe can just get rid of the first sentence and start with "you can find data assets by performing a query"

saskiad · 2026-03-20T20:50:03Z

docs/source/explore_analyze/find_data.md

+
+## Query DocDB
+
+DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. 


I actually think we need more information here.
It's a MongoDB query that uses a particular language/organization. These can be run in Python.
Helen probably can point us to some resources to direct people to, but I do think we want the last line of using the MCP to develop the queries is important.

oh, how is this meant to be different from the aind-data-access-api? I think I'm conflating the two - where/how would one do DocDB queries separate from the aind-data-access-api?

The query I anticipate for analysis workflows is using the aind-data-access-api, is that not true?

I cleaned it up, hopefully it makes more sense now

saskiad · 2026-03-20T20:50:24Z

docs/source/explore_analyze/find_data.md

+
+DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. 
+
+### AI (MCP Server)


I'd title this as "MCP Server (AI)"

saskiad · 2026-03-20T20:51:40Z

docs/source/explore_analyze/find_data.md

+
+### Fast queries through the cache
+
+Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets.


I'd put the last sentence as a paranthetical note.

Is it possible to list the fields that it caches so people know what they can use this for?

The tables are listed in the readme, I linked there. I'll also update the readme so it has more information about what fields are cached.

docs/source/explore_analyze/find_data.md

saskiad · 2026-03-25T17:58:27Z

docs/source/explore_analyze/find_data.md

+    qc_df = qc(subject_id=subject_id)
+    if qc_df.empty or "status" not in qc_df.columns:
+        continue
+    for _, row in subject_assets.iterrows():


I feel like there's an easier way to just ask if all metrics are status==Pass?

I'm fine with this example, but it feels complicated in a way that might overwhelm people. But not a deal breaker for me.

docs/source/explore_analyze/find_data.md

saskiad

two small comments - one is a typo, the other a small suggestion that you are free to ignore.

dbirman added 3 commits March 19, 2026 16:26

docs: added content to the find_data page

8767dd4

docs: fix qc filter example with zs

e605b25

fix: link out to adap docs

e852b51

dbirman linked an issue Mar 20, 2026 that may be closed by this pull request

find data should include information on using MCP #49

Open

dbirman requested a review from saskiad March 20, 2026 19:54

saskiad reviewed Mar 20, 2026

View reviewed changes

docs/source/explore_analyze/find_data.md Show resolved Hide resolved

saskiad reviewed Mar 20, 2026

View reviewed changes

docs/source/explore_analyze/find_data.md Show resolved Hide resolved

saskiad requested changes Mar 20, 2026

View reviewed changes

docs: changes from review

221f134

dbirman requested a review from saskiad March 24, 2026 03:15

saskiad reviewed Mar 25, 2026

View reviewed changes

docs/source/explore_analyze/find_data.md Outdated Show resolved Hide resolved

saskiad approved these changes Mar 25, 2026

View reviewed changes

docs: fix typo

4a8bd80


		## Query DocDB

		DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis and to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.


		DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis and to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.

		### AI (MCP Server)


		### Fast queries through the cache

		Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets.

Conversation

dbirman commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saskiad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants