feat: add semantic search marimo notebook#581
Conversation
Converts docs/semantic-search.ipynb to a marimo notebook with: - Pinecone SDK 9.0.1 API (pc.indexes.*, pc.index(), new search signature) - Refactored dataset preparation into prepare_sentences/to_records functions - Keyword filtering with deduplication - mo.ui.table for dataset inspection - mo.status.progress_bar replacing tqdm - mo.ui.run_button for safe index deletion - Improved prose structure with explanations interspersed between code cells Also pins notebook dependencies in pyproject.toml. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace print-based result output with mo.vstack containing a bold query header and a mo.ui.table, and fix three cells that were incorrectly configured as markdown cells. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Switch embedding model to multilingual-e5-large - Refactor data prep: filter_pairs + extract_sentences(lang) to embed both English and Spanish sentences with prefixed IDs - Upsert both languages to a single namespace - Add lang column to search results table - Add cross-lingual queries (English + Spanish) and a no-keyword query to demonstrate meaning-over-keywords retrieval - Add language filtering section with lang= parameter on search() - Update How It Works to explain model selection's role in vector space - Improve prose throughout querying and filtering sections Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add Try It Yourself section with mo.ui.text and mo.ui.radio for language filter, results update reactively on input change - Fix empty cell and duplicate filter query cells - Correct second language filter query to Spanish Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| top_k=top_k, | ||
| inputs={"text": query}, | ||
| filter={"lang": {"$eq": lang}} if lang else None, | ||
| ) |
There was a problem hiding this comment.
Incorrect index.search call shape
High Severity
index.search passes top_k, inputs, and filter as top-level keyword arguments without a query dict. For integrated text search, Pinecone v9 still expects query={"inputs": {"text": ...}, "top_k": ..., "filter": ...}, so these calls will raise a TypeError or fail validation and break all query cells.
Reviewed by Cursor Bugbot for commit 72b422c. Configure here.
Both were added during development but are no longer used in the notebook — tqdm was replaced by mo.status.progress_bar. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
datasets and pinecone were added to [project.dependencies] by marimo's package manager during development. Notebook-specific deps belong in the notebook's inline PEP 723 metadata (# /// script block), not the root project config. Run with --sandbox to use the inline deps. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
No pyproject.toml changes on this branch that would affect the lock file. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fc1dc00. Configure here.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
## Summary Follow-up improvements to the semantic search marimo notebook merged in #581. ## Changes **Multilingual support** - Switch embedding model to `multilingual-e5-large` for cross-lingual retrieval - Embed both English and Spanish sentences from Tatoeba using `filter_pairs` + `extract_sentences(lang)` - Add cross-lingual query examples and a language filtering section using Pinecone metadata filters **Interactivity** - Interactive query input with `mo.ui.text` and `mo.ui.radio` language selector - Interactive API key input: reads `PINECONE_API_KEY` from env/`.env` with a `mo.ui.text(kind="password")` fallback for molab users; uses `mo.callout` admonitions for each state **Display** - Search results rendered as `mo.ui.table` with a `lang` column showing which language each hit came from - Progress bar via `mo.status.progress_bar` (replacing tqdm) **Correctness / hygiene** - Pin `datasets==3.5.1` — `datasets>=4` dropped support for custom loading scripts used by `Helsinki-NLP/tatoeba` - Use keyword argument names in all Pinecone API calls - Remove unused `numpy` and `tqdm` dependencies - Remove notebook-specific deps from root `pyproject.toml` (they belong in the notebook's `# /// script` inline metadata) ## Test Plan - [ ] Notebook runs end-to-end in sandbox mode (`uvx marimo edit --sandbox`) with a valid `PINECONE_API_KEY` - [ ] Password input appears when env var is unset; success callout appears when set - [ ] Cross-lingual queries return results in both English and Spanish - [ ] Language filter correctly scopes results to `en` or `es` - [ ] Interactive query input updates results on change 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk documentation/notebook-only changes that adjust dependency pinning and interactive API key handling; no production code paths affected. > > **Overview** > Improves the `docs/semantic-search.py` semantic search marimo notebook setup experience by **pinning `datasets==3.5.1`** and replacing the env-only Pinecone key requirement with an **interactive API key input** (env/`.env` auto-detect + password field fallback with callouts). > > Adds a guard (`mo.stop`) to halt execution until a key is provided and introduces a brief section clarifying client instantiation (including the example-only `source_tag`). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 6da4180. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>


Summary
Adds a new marimo notebook demonstrating semantic search with Pinecone, converted and significantly expanded from the existing
docs/semantic-search.ipynb. The notebook uses Pinecone's Integrated Inference with themultilingual-e5-largemodel to demonstrate cross-lingual semantic search across English and Spanish sentences.Changes
docs/semantic-search.py(marimo format) with:pc.indexes.*,pc.index(), updated search signature)multilingual-e5-largeembedding model for cross-lingual retrievalfilter_pairs+extract_sentences(lang)to produce both English and Spanish records from Tatoebato_recordsparameterized on column name with ID prefixes for multi-language upsertmo.ui.tablefor dataset inspection,mo.status.progress_barreplacing tqdm,mo.ui.run_buttonfor safe index deletionmo.ui.textandmo.ui.radiofor language filteren/espyproject.toml: pins notebook dependencies (datasets==3.5.1,pinecone==9.0.1,numpy,tqdm)Test Plan
PINECONE_API_KEYenores🤖 Generated with Claude Code
Note
Low Risk
Low risk: adds a new documentation notebook only, with no changes to production code paths; the main impact is on users running the example (it creates/deletes a Pinecone index).
Overview
Adds a new
docs/semantic-search.pyMarimo notebook that walks through building a semantic search demo with Pinecone Integrated Inference, including index creation formultilingual-e5-large, dataset filtering/record preparation for English+Spanish, batchedupsert_records, andindex.searchwith optionallangmetadata filtering.The notebook also adds interactive UI elements for querying and a run-button gated cleanup step to delete the created index.
Reviewed by Cursor Bugbot for commit a233d4b. Bugbot is set up for automated code reviews on this repo. Configure here.