Skip to content

docs: improve the find data page to include information about queries, cache tables, and MCP#69

Open
dbirman wants to merge 5 commits intomainfrom
49-find-data-should-include-information-on-using-mcp
Open

docs: improve the find data page to include information about queries, cache tables, and MCP#69
dbirman wants to merge 5 commits intomainfrom
49-find-data-should-include-information-on-using-mcp

Conversation

@dbirman
Copy link
Copy Markdown
Member

@dbirman dbirman commented Mar 20, 2026

No description provided.

@dbirman dbirman linked an issue Mar 20, 2026 that may be closed by this pull request
@dbirman dbirman requested a review from saskiad March 20, 2026 19:54
# Find data

The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report.
Each raw asset uploaded from a platform at AIND produces a group of derived assets, one per modality. You can find these assets easily by performing a query on our metadata database using your project name and other fields unique to your project. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This organization is true for phys/behavior. Not for other modalities.
For some spim, I think it's just one derived asset which is fine because it's one modality
But for other spim, I think there are many different dervied assets that have more to do with clustering results in time.

We maybe can just get rid of the first sentence and start with "you can find data assets by performing a query"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


## Query DocDB

DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we need more information here.
It's a MongoDB query that uses a particular language/organization. These can be run in Python.
Helen probably can point us to some resources to direct people to, but I do think we want the last line of using the MCP to develop the queries is important.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, how is this meant to be different from the aind-data-access-api? I think I'm conflating the two - where/how would one do DocDB queries separate from the aind-data-access-api?

The query I anticipate for analysis workflows is using the aind-data-access-api, is that not true?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned it up, hopefully it makes more sense now


DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.

### AI (MCP Server)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd title this as "MCP Server (AI)"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


### Fast queries through the cache

Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the last sentence as a paranthetical note.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to list the fields that it caches so people know what they can use this for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tables are listed in the readme, I linked there. I'll also update the readme so it has more information about what fields are cached.

@dbirman dbirman requested a review from saskiad March 24, 2026 03:15
qc_df = qc(subject_id=subject_id)
if qc_df.empty or "status" not in qc_df.columns:
continue
for _, row in subject_assets.iterrows():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's an easier way to just ask if all metrics are status==Pass?

I'm fine with this example, but it feels complicated in a way that might overwhelm people. But not a deal breaker for me.

Copy link
Copy Markdown
Contributor

@saskiad saskiad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small comments - one is a typo, the other a small suggestion that you are free to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

find data should include information on using MCP

2 participants