Skip to content

feat: add semantic search marimo notebook#581

Merged
jhamon merged 9 commits into
mainfrom
semantic-search-marimo
May 21, 2026
Merged

feat: add semantic search marimo notebook#581
jhamon merged 9 commits into
mainfrom
semantic-search-marimo

Conversation

@jhamon
Copy link
Copy Markdown
Collaborator

@jhamon jhamon commented May 20, 2026

Summary

Adds a new marimo notebook demonstrating semantic search with Pinecone, converted and significantly expanded from the existing docs/semantic-search.ipynb. The notebook uses Pinecone's Integrated Inference with the multilingual-e5-large model to demonstrate cross-lingual semantic search across English and Spanish sentences.

Changes

  • New notebook docs/semantic-search.py (marimo format) with:
    • Pinecone SDK 9.0.1 API (pc.indexes.*, pc.index(), updated search signature)
    • multilingual-e5-large embedding model for cross-lingual retrieval
    • Refactored dataset prep: filter_pairs + extract_sentences(lang) to produce both English and Spanish records from Tatoeba
    • to_records parameterized on column name with ID prefixes for multi-language upsert
    • mo.ui.table for dataset inspection, mo.status.progress_bar replacing tqdm, mo.ui.run_button for safe index deletion
    • Interactive query section with mo.ui.text and mo.ui.radio for language filter
    • Language filtering section demonstrating metadata filters scoped to en/es
    • Prose interspersed between code cells narrating the process
    • "Meaning Over Keywords" and "How It Works" sections explaining model selection and cross-lingual retrieval
  • pyproject.toml: pins notebook dependencies (datasets==3.5.1, pinecone==9.0.1, numpy, tqdm)

Test Plan

  • Notebook runs end-to-end with a valid PINECONE_API_KEY
  • Index creation, upsert, and query cells execute without errors
  • Cross-lingual queries return results in both languages
  • Language filter correctly scopes results to en or es
  • Interactive query input updates results on change
  • Delete button safely removes the index

🤖 Generated with Claude Code


Note

Low Risk
Low risk: adds a new documentation notebook only, with no changes to production code paths; the main impact is on users running the example (it creates/deletes a Pinecone index).

Overview
Adds a new docs/semantic-search.py Marimo notebook that walks through building a semantic search demo with Pinecone Integrated Inference, including index creation for multilingual-e5-large, dataset filtering/record preparation for English+Spanish, batched upsert_records, and index.search with optional lang metadata filtering.

The notebook also adds interactive UI elements for querying and a run-button gated cleanup step to delete the created index.

Reviewed by Cursor Bugbot for commit a233d4b. Bugbot is set up for automated code reviews on this repo. Configure here.

claude and others added 4 commits May 20, 2026 12:46
Converts docs/semantic-search.ipynb to a marimo notebook with:
- Pinecone SDK 9.0.1 API (pc.indexes.*, pc.index(), new search signature)
- Refactored dataset preparation into prepare_sentences/to_records functions
- Keyword filtering with deduplication
- mo.ui.table for dataset inspection
- mo.status.progress_bar replacing tqdm
- mo.ui.run_button for safe index deletion
- Improved prose structure with explanations interspersed between code cells

Also pins notebook dependencies in pyproject.toml.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace print-based result output with mo.vstack containing a bold
query header and a mo.ui.table, and fix three cells that were
incorrectly configured as markdown cells.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Switch embedding model to multilingual-e5-large
- Refactor data prep: filter_pairs + extract_sentences(lang) to embed
  both English and Spanish sentences with prefixed IDs
- Upsert both languages to a single namespace
- Add lang column to search results table
- Add cross-lingual queries (English + Spanish) and a no-keyword query
  to demonstrate meaning-over-keywords retrieval
- Add language filtering section with lang= parameter on search()
- Update How It Works to explain model selection's role in vector space
- Improve prose throughout querying and filtering sections

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add Try It Yourself section with mo.ui.text and mo.ui.radio for
  language filter, results update reactively on input change
- Fix empty cell and duplicate filter query cells
- Correct second language filter query to Spanish

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread docs/semantic-search.py
top_k=top_k,
inputs={"text": query},
filter={"lang": {"$eq": lang}} if lang else None,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect index.search call shape

High Severity

index.search passes top_k, inputs, and filter as top-level keyword arguments without a query dict. For integrated text search, Pinecone v9 still expects query={"inputs": {"text": ...}, "top_k": ..., "filter": ...}, so these calls will raise a TypeError or fail validation and break all query cells.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 72b422c. Configure here.

claude and others added 4 commits May 20, 2026 15:30
Both were added during development but are no longer used in the
notebook — tqdm was replaced by mo.status.progress_bar.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
datasets and pinecone were added to [project.dependencies] by marimo's
package manager during development. Notebook-specific deps belong in
the notebook's inline PEP 723 metadata (# /// script block), not the
root project config. Run with --sandbox to use the inline deps.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
No pyproject.toml changes on this branch that would affect the lock file.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fc1dc00. Configure here.

Comment thread docs/semantic-search.py Outdated
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jhamon jhamon merged commit c63f91e into main May 21, 2026
12 checks passed
@jhamon jhamon deleted the semantic-search-marimo branch May 21, 2026 14:13
jhamon added a commit that referenced this pull request May 21, 2026
## Summary

Follow-up improvements to the semantic search marimo notebook merged in
#581.

## Changes

**Multilingual support**
- Switch embedding model to `multilingual-e5-large` for cross-lingual
retrieval
- Embed both English and Spanish sentences from Tatoeba using
`filter_pairs` + `extract_sentences(lang)`
- Add cross-lingual query examples and a language filtering section
using Pinecone metadata filters

**Interactivity**
- Interactive query input with `mo.ui.text` and `mo.ui.radio` language
selector
- Interactive API key input: reads `PINECONE_API_KEY` from env/`.env`
with a `mo.ui.text(kind="password")` fallback for molab users; uses
`mo.callout` admonitions for each state

**Display**
- Search results rendered as `mo.ui.table` with a `lang` column showing
which language each hit came from
- Progress bar via `mo.status.progress_bar` (replacing tqdm)

**Correctness / hygiene**
- Pin `datasets==3.5.1` — `datasets>=4` dropped support for custom
loading scripts used by `Helsinki-NLP/tatoeba`
- Use keyword argument names in all Pinecone API calls
- Remove unused `numpy` and `tqdm` dependencies
- Remove notebook-specific deps from root `pyproject.toml` (they belong
in the notebook's `# /// script` inline metadata)

## Test Plan

- [ ] Notebook runs end-to-end in sandbox mode (`uvx marimo edit
--sandbox`) with a valid `PINECONE_API_KEY`
- [ ] Password input appears when env var is unset; success callout
appears when set
- [ ] Cross-lingual queries return results in both English and Spanish
- [ ] Language filter correctly scopes results to `en` or `es`
- [ ] Interactive query input updates results on change

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk documentation/notebook-only changes that adjust dependency
pinning and interactive API key handling; no production code paths
affected.
> 
> **Overview**
> Improves the `docs/semantic-search.py` semantic search marimo notebook
setup experience by **pinning `datasets==3.5.1`** and replacing the
env-only Pinecone key requirement with an **interactive API key input**
(env/`.env` auto-detect + password field fallback with callouts).
> 
> Adds a guard (`mo.stop`) to halt execution until a key is provided and
introduces a brief section clarifying client instantiation (including
the example-only `source_tag`).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
6da4180. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Code <claude@anthropic.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants