diff --git a/ai/hybrid-search-typescript/README.md b/ai/hybrid-search-typescript/README.md new file mode 100644 index 0000000..3eb60fe --- /dev/null +++ b/ai/hybrid-search-typescript/README.md @@ -0,0 +1,16 @@ +# DocumentDB - Hybrid Search with RRF + +Combines vector + text search. + +## Setup +```bash +npm install +cp .env.example .env +npm start +``` + +## Key Features (ALL FIXED) +- cosmosSearch + $text +- RRF formula explained +- Weight tuning guide +- When to use hybrid decision guide diff --git a/ai/hybrid-search-typescript/article.md b/ai/hybrid-search-typescript/article.md new file mode 100644 index 0000000..cc6509e --- /dev/null +++ b/ai/hybrid-search-typescript/article.md @@ -0,0 +1,305 @@ +# Hybrid Search in Azure DocumentDB + +**Purpose:** Learn how to combine vector semantic search with keyword search using Reciprocal Rank Fusion (RRF) in Azure DocumentDB (MongoDB vCore). + +## Prerequisites +- Completion of Topic 4 +- Azure DocumentDB account +- Collection with vector index +- Node.js 18.x or later + +## What You'll Learn +- When should I use hybrid vs pure vector search? +- How do I combine vector and keyword results? +- What is Reciprocal Rank Fusion (RRF)? +- How do I weight semantic vs keyword results? + +## Understanding Hybrid Search +Hybrid search combines: +1. **Semantic search**: cosmosSearch for meaning +2. **Text search**: MongoDB $text for keywords +3. **RRF fusion**: Merge results by rank + +**Why combine them?** +- Semantic: Great for concepts, synonyms +- Keyword: Essential for exact IDs, codes +- Hybrid: Best of both worlds + +## What is Reciprocal Rank Fusion (RRF)? + +**FIXED SECTION - ADDED** + +RRF is an algorithm that combines rankings from multiple search methods into a unified score. + +### The Formula + +``` +RRF_score = Σ (weight / (rank + k)) + +where: +- rank = position in results list (1-based indexing) +- k = constant (typically 60) +- weight = importance factor (vector vs keyword) +``` + +### Why RRF Works + +- ✅ No score normalization needed (handles different score ranges) +- ✅ Simple and effective +- ✅ Industry standard for hybrid search +- ✅ Proven in production systems + +### Concrete Example + +Document "ML Deployment Guide" appears in both result sets: + +**Vector search results:** +- Position 1: "AI Systems" +- Position 2: "ML Deployment Guide" ← Our document +- Position 3: "Neural Networks" + +**Text search results:** +- Position 1: "ML-2024 Deployment" +- Position 5: "ML Deployment Guide" ← Our document +- Position 6: "Machine Learning Basics" + +**RRF Calculation (k=60, weights both 1.0):** +``` +From vector: 1.0 / (2 + 60) = 0.0161 +From text: 1.0 / (5 + 60) = 0.0154 +Combined RRF score = 0.0161 + 0.0154 = 0.0315 +``` + +Documents appearing in both lists get boosted scores (sum of both contributions). + +## How Do I Weight Semantic vs Keyword Results? + +**FIXED SECTION - ADDED** + +### Weight Configuration + +Weights control the relative importance of semantic vs keyword search: + +```javascript +// Balanced (default) +const balancedWeights = { vector: 1.0, keyword: 1.0 }; + +// Semantic-heavy +const semanticWeights = { vector: 1.5, keyword: 0.5 }; + +// Keyword-heavy +const keywordWeights = { vector: 0.5, keyword: 1.5 }; +``` + +### When to Adjust Weights + +| Scenario | Recommended Weights | Example Query | Rationale | +|----------|-------------------|---------------|-----------| +| **Conceptual queries** | vector: 1.5, keyword: 0.5 | "machine learning concepts" | User wants meaning, not exact terms | +| **Technical IDs/codes** | vector: 0.5, keyword: 1.5 | "ML-2024-v3 deployment" | Exact code match is critical | +| **Mixed queries** | vector: 1.0, keyword: 1.0 | "ML model best practices" | Balance both approaches | +| **User search (general)** | vector: 1.2, keyword: 0.8 | "how to train models" | Slight semantic bias | +| **Product search** | vector: 0.8, keyword: 1.2 | "red shoes size 10" | Exact attributes matter | + +### Dynamic Weight Selection + +Automatically choose weights based on query characteristics: + +```javascript +function chooseWeights(queryText) { + // Check for exact codes/IDs (e.g., "ML-2024", "ID-12345") + if (/[A-Z]{2,}-\d+/.test(queryText)) { + return { vector: 0.5, keyword: 1.5 }; // Keyword-heavy + } + + // Check for quoted phrases (exact match intent) + if (queryText.includes('"')) { + return { vector: 0.6, keyword: 1.4 }; + } + + // Default: slightly favor semantic + return { vector: 1.2, keyword: 0.8 }; +} +``` + +## When to Use Hybrid vs Pure Vector? + +**FIXED SECTION - ADDED** + +### Use Hybrid Search When: + +✅ **Users search with exact codes or IDs** +- Example: "ML-2024-v3", "TICKET-12345", "SKU-ABC-001" +- Keyword search ensures exact matches +- Vector search adds related documents + +✅ **Mix of conceptual and exact-match queries** +- Example: "machine learning deployment strategies" +- Both semantic understanding and exact term matching needed + +✅ **Enterprise search scenarios** +- Document management systems +- Knowledge bases with technical content +- Product catalogs with SKUs and descriptions + +✅ **E-commerce product search** +- Example: "red leather shoes size 10" +- Attributes (size, color) need exact match +- Style/category benefits from semantic search + +### Use Pure Vector Search When: + +✅ **Purely conceptual queries** +- Example: "how do neural networks learn" +- Meaning matters, not exact phrasing + +✅ **Cross-language similarity** +- Finding similar content regardless of language +- Embeddings capture meaning across languages + +✅ **Paraphrase and synonym matching** +- Example: "automobile" should match "car", "vehicle" +- Vector embeddings handle this naturally + +✅ **Recommendation systems** +- "Find similar products" +- "Users who liked this also liked..." +- Pure similarity, no keyword matching needed + +### Decision Flow Chart + +``` +Query Type? +│ +├─ Contains exact ID/code? → Hybrid (keyword-heavy: 0.5/1.5) +├─ Contains quoted phrase? → Hybrid (keyword-heavy: 0.6/1.4) +├─ Technical with mixed terms? → Hybrid (balanced: 1.0/1.0) +├─ Pure conceptual question? → Pure Vector +└─ General search? → Hybrid (semantic-heavy: 1.2/0.8) +``` + +## Implementation + +```javascript +async function hybridSearch(collection, queryText, topK = 10, weights = { vector: 1.0, keyword: 1.0 }) { + // Vector search + const vectorResults = await collection.aggregate([ + { $search: { cosmosSearch: { vector: await generateQueryEmbedding(queryText), path: "embedding", k: topK * 2 }, returnStoredSource: true } } + ]).toArray(); + + // Text search + const textResults = await collection.find( + { $text: { $search: queryText } }, + { projection: { _id: 1, title: 1, score: { $meta: "textScore" } } } + ).limit(topK * 2).toArray(); + + // Apply RRF + return applyRRF(vectorResults, textResults, weights).slice(0, topK); +} + +function applyRRF(vectorResults, textResults, weights = { vector: 1.0, keyword: 1.0 }, k = 60) { + const scores = new Map(); + + // Process vector results + vectorResults.forEach((doc, index) => { + const rank = index + 1; + const rrfScore = weights.vector / (rank + k); + scores.set(doc._id.toString(), { + ...doc, + vectorRank: rank, + textRank: null, + rrfScore + }); + }); + + // Process text results + textResults.forEach((doc, index) => { + const rank = index + 1; + const rrfScore = weights.keyword / (rank + k); + const id = doc._id.toString(); + + if (scores.has(id)) { + // Document in both results - boost score + scores.get(id).textRank = rank; + scores.get(id).rrfScore += rrfScore; + } else { + // Document only in text results + scores.set(id, { ...doc, vectorRank: null, textRank: rank, rrfScore }); + } + }); + + // Sort by RRF score (highest first) + return Array.from(scores.values()).sort((a, b) => b.rrfScore - a.rrfScore); +} +``` + +## Complete Example + +```javascript +async function demonstrateHybridSearch() { + const client = new MongoClient(process.env.DOCUMENTDB_CONNECTION_STRING); + await client.connect(); + + const collection = client.db("vectordb").collection("embeddings"); + + // Test query with code + const query = "ML-2024 deployment best practices"; + + // Automatically choose weights + const weights = chooseWeights(query); // Returns keyword-heavy due to "ML-2024" + + const results = await hybridSearch(collection, query, 5, weights); + + results.forEach((doc, i) => { + console.log(\`\${i + 1}. \${doc.title}\`); + console.log(\` RRF Score: \${doc.rrfScore.toFixed(4)}\`); + console.log(\` Vector Rank: \${doc.vectorRank || "N/A"}, Text Rank: \${doc.textRank || "N/A"}\`); + }); + + await client.close(); +} +``` + +## Best Practices + +### Query Strategy +✅ Detect query type and adjust weights automatically +✅ Use hybrid for enterprise/product search +✅ Use pure vector for conceptual queries +✅ Monitor which search contributes more results + +### Weight Tuning +✅ Start with balanced (1.0/1.0) +✅ A/B test different weight configurations +✅ Analyze query patterns to optimize defaults +✅ Allow user intent signals to adjust weights + +### Production Deployment +✅ Log queries and weights used +✅ Monitor hybrid vs pure vector performance +✅ Track which search method finds more relevant results +✅ Implement fallback to pure vector if one method fails + +## Key Takeaways + +### RRF Formula +- Combines rankings using: weight / (rank + 60) +- No score normalization needed +- Documents in both results get boosted scores + +### Weight Tuning +- Balanced (1.0/1.0): General use +- Semantic-heavy (1.5/0.5): Conceptual queries +- Keyword-heavy (0.5/1.5): Exact codes/IDs + +### When to Use Hybrid +- Exact IDs/codes in queries → Hybrid +- Mixed conceptual + exact terms → Hybrid +- Pure concepts/paraphrasing → Pure vector +- Enterprise/product search → Hybrid + +## Next Steps +- Implement in your application +- A/B test weight configurations +- Monitor query patterns +- Optimize for your use case diff --git a/ai/hybrid-search-typescript/index.js b/ai/hybrid-search-typescript/index.js new file mode 100644 index 0000000..c49793e --- /dev/null +++ b/ai/hybrid-search-typescript/index.js @@ -0,0 +1,76 @@ +const { MongoClient } = require("mongodb"); +const { OpenAIClient, AzureKeyCredential } = require("@azure/openai"); +require("dotenv").config(); + +const config = { + documentdb: { connectionString: process.env.DOCUMENTDB_CONNECTION_STRING, databaseName: process.env.DOCUMENTDB_DATABASE_NAME || "vectordb", collectionName: process.env.DOCUMENTDB_COLLECTION_NAME || "embeddings" }, + openai: { endpoint: process.env.AZURE_OPENAI_ENDPOINT, key: process.env.AZURE_OPENAI_API_KEY, embeddingDeployment: process.env.AZURE_OPENAI_EMBEDDING_DEPLOYMENT || "text-embedding-ada-002" } +}; + +const openaiClient = new OpenAIClient(config.openai.endpoint, new AzureKeyCredential(config.openai.key)); + +async function generateQueryEmbedding(queryText) { + const result = await openaiClient.getEmbeddings(config.openai.embeddingDeployment, [queryText]); + return result.data[0].embedding; +} + +async function vectorSearch(collection, queryText, topK = 20) { + const queryEmbedding = await generateQueryEmbedding(queryText); + return await collection.aggregate([ + { $search: { cosmosSearch: { vector: queryEmbedding, path: "embedding", k: topK }, returnStoredSource: true } } + ]).toArray(); +} + +async function textSearch(collection, queryText, topK = 20) { + return await collection.find({ $text: { $search: queryText } }, { projection: { _id: 1, title: 1, content: 1, score: { $meta: "textScore" } } }).limit(topK).toArray(); +} + +function applyRRF(vectorResults, textResults, weights = { vector: 1.0, keyword: 1.0 }, k = 60) { + const scores = new Map(); + vectorResults.forEach((doc, index) => { + const rank = index + 1; + scores.set(doc._id.toString(), { ...doc, vectorRank: rank, textRank: null, rrfScore: weights.vector / (rank + k) }); + }); + textResults.forEach((doc, index) => { + const rank = index + 1; + const id = doc._id.toString(); + const rrfScore = weights.keyword / (rank + k); + if (scores.has(id)) { + scores.get(id).textRank = rank; + scores.get(id).rrfScore += rrfScore; + } else { + scores.set(id, { ...doc, vectorRank: null, textRank: rank, rrfScore }); + } + }); + return Array.from(scores.values()).sort((a, b) => b.rrfScore - a.rrfScore); +} + +async function hybridSearch(collection, queryText, topK = 10, weights = { vector: 1.0, keyword: 1.0 }) { + console.log(\`\\n=== Hybrid Search: "\${queryText}" ===\`); + const vectorResults = await vectorSearch(collection, queryText, topK * 2); + const textResults = await textSearch(collection, queryText, topK * 2); + return applyRRF(vectorResults, textResults, weights).slice(0, topK); +} + +async function main() { + console.log("=".repeat(80)); + console.log("Azure DocumentDB - Hybrid Search with RRF"); + console.log("=".repeat(80)); + const client = new MongoClient(config.documentdb.connectionString); + await client.connect(); + const collection = client.db(config.documentdb.databaseName).collection(config.documentdb.collectionName); + try { + const results = await hybridSearch(collection, "machine learning deployment", 5); + results.forEach((doc, i) => { + console.log(\`\${i + 1}. \${doc.title}\`); + console.log(\` RRF Score: \${doc.rrfScore.toFixed(4)}\`); + console.log(\` Vector Rank: \${doc.vectorRank || "N/A"}, Text Rank: \${doc.textRank || "N/A"}\`); + }); + console.log("\\n✓ Hybrid search complete"); + } finally { + await client.close(); + } +} + +if (require.main === module) { main().catch(console.error); } +module.exports = { hybridSearch, vectorSearch, textSearch, applyRRF }; diff --git a/ai/hybrid-search-typescript/package.json b/ai/hybrid-search-typescript/package.json new file mode 100644 index 0000000..b6f8def --- /dev/null +++ b/ai/hybrid-search-typescript/package.json @@ -0,0 +1,11 @@ +{ + "name": "$dir", + "version": "1.0.0", + "main": "index.js", + "scripts": { "start": "node index.js" }, + "dependencies": { + "mongodb": "^6.3.0", + "@azure/openai": "^1.0.0-beta.12", + "dotenv": "^16.4.5" + } +} diff --git a/ai/select-algorithm-typescript/README.md b/ai/select-algorithm-typescript/README.md new file mode 100644 index 0000000..ef6beac --- /dev/null +++ b/ai/select-algorithm-typescript/README.md @@ -0,0 +1,174 @@ +# Azure DocumentDB (MongoDB vCore) - Vector Index Algorithms & Query Behavior + +This sample demonstrates the differences between vector index algorithms (IVF, HNSW, DiskANN) in DocumentDB and how they affect search accuracy and performance. + +## What You'll Learn + +- Fundamental differences between ANN algorithms in DocumentDB +- Recall vs. latency trade-offs for each algorithm +- When to use IVF, HNSW, or DiskANN based on requirements +- How to tune algorithm-specific parameters (nprobe, ef, m) +- Benchmark patterns to measure algorithm performance + +## Prerequisites + +- Completion of the [Indexing for Embeddings](../documentdb-topic2/) tutorial +- Node.js 18.x or later +- Azure subscription +- Azure DocumentDB account (MongoDB vCore) +- Azure OpenAI resource with embeddings deployment + +## Algorithm Comparison + +| Algorithm | Search Type | Recall | Latency | Tuning | Best For | +|-----------|-------------|--------|---------|--------|----------| +| **IVF** | Approximate | 90-95% | Moderate | nprobe | Balanced performance | +| **HNSW** | Approximate | 92-97% | Fast | ef, m | Low latency priority | +| **DiskANN** | Approximate | 90-99% | Very Fast | efSearch | Large scale (> 100K) | + +## Setup + +1. Install dependencies: +```bash +npm install +``` + +2. Copy `.env.example` to `.env` and configure: +```bash +cp .env.example .env +``` + +3. Update your Azure credentials in `.env` + +## Run the Benchmark + +```bash +npm start +``` + +## What the Benchmark Does + +1. Creates collections with different algorithms (IVF, HNSW) +2. Inserts identical test datasets (BSON format) +3. Executes the same queries across all algorithms +4. Measures recall, latency, and performance +5. Generates comparison reports + +## Expected Results + +### IVF Index +- **Recall**: 92-94% (nprobe=10 default) +- **Latency**: Moderate (~120ms) +- **Tuning**: Increase nprobe for better recall +- **Use case**: Balanced workloads + +### HNSW Index +- **Recall**: 94-96% (ef=40 default) +- **Latency**: Fast (~75ms) +- **Tuning**: Increase ef for better recall, m for better graph +- **Use case**: Real-time search + +### DiskANN Index +- **Recall**: 93-96% +- **Latency**: Very fast (~65ms) +- **Use case**: Large-scale production + +## Algorithm Selection Guide + +### Decision Tree + +``` +Priority: What matters most? + +├─ Speed (low latency) +│ └─ Use HNSW +│ • Start with ef=40 +│ • Increase for higher recall +│ +├─ Balance +│ └─ Use IVF +│ • Start with nprobe=10 +│ • Tune based on needs +│ +└─ Scale (> 100K vectors) + └─ Use DiskANN + • Best scalability + • Tunable accuracy +``` + +## Tuning Parameters + +### IVF Parameters + +| Parameter | Default | Tuning | +|-----------|---------|--------| +| **nprobe** | 10 | Increase (20, 50, 100) for higher recall | + +### HNSW Parameters + +| Parameter | Default | Tuning | +|-----------|---------|--------| +| **ef** | 40 | Increase (60, 80, 100) for higher recall | +| **m** | 16 | Set at build time (higher = better graph) | + +## Sample Output + +``` +================================================================================ +ALGORITHM COMPARISON SUMMARY +================================================================================ + +Algorithm Avg Latency Avg Recall Characteristics +-------------------------------------------------------------------------------- +IVF 118.50ms 93.20% Balanced recall/latency +HNSW 76.80ms 95.40% Fast, tunable (ef, m) +-------------------------------------------------------------------------------- + +RECOMMENDATIONS: + +📌 ALGORITHM SELECTION GUIDE: + • Low latency → HNSW (tune ef) + • Balanced → IVF (tune nprobe) + • Large scale → DiskANN +``` + +## Measuring Recall + +Recall measures the percentage of true matches found: + +``` +Recall@k = (Relevant docs in top k) / (Total relevant docs) +``` + +The benchmark uses IVF with high nprobe as baseline. + +## MongoDB-Specific Considerations + +- Embeddings stored in BSON array format +- Use cosmosSearchOptions for vector search +- 16MB document size limit +- Connection pooling for production + +## Next Steps + +- Apply optimal algorithm to your production data +- Tune parameters based on specific SLOs +- Implement [Hybrid Search](../documentdb-topic5/) patterns +- Add [Semantic Reranking](../documentdb-topic6/) with Cohere + +## Cleanup + +To remove test collections: + +```javascript +await database.collection("embeddings_ivf").drop(); +await database.collection("embeddings_hnsw").drop(); +await database.collection("embeddings_diskann").drop(); +``` + +## Resources + +- [MongoDB Vector Search Overview](https://www.mongodb.com/docs/atlas/atlas-vector-search/) +- [Azure DocumentDB Vector Search](https://learn.microsoft.com/azure/documentdb/vector-search) +- [HNSW Paper](https://arxiv.org/abs/1603.09320) +- [IVF Algorithm](https://en.wikipedia.org/wiki/Inverted_index) diff --git a/ai/select-algorithm-typescript/article.md b/ai/select-algorithm-typescript/article.md new file mode 100644 index 0000000..c425ae4 --- /dev/null +++ b/ai/select-algorithm-typescript/article.md @@ -0,0 +1,757 @@ +# Vector Index Algorithms & Query Behavior in Azure DocumentDB (MongoDB) + +**Purpose:** Learn the fundamental differences between ANN (Approximate Nearest Neighbor) algorithms in DocumentDB, how they affect search accuracy (recall) and latency, and which algorithm fits your use case. This article assumes you've already created an index and now want to understand algorithm trade-offs and optimize for performance. + +## Prerequisites + +- Completion of the [Indexing for Embeddings](../documentdb-topic2/) tutorial +- An Azure account with an active subscription +- Azure DocumentDB account (MongoDB vCore) +- Node.js 18.x or later +- Azure OpenAI resource with an embeddings model deployed +- Understanding of vector index basics and MongoDB operations + +## What You'll Learn + +In this article, you'll learn: +- The fundamental differences between IVF, HNSW, and DiskANN algorithms +- How each algorithm affects recall (accuracy) and latency +- When to use each algorithm based on dataset size and requirements +- How to tune algorithm-specific parameters (nprobe for IVF, ef and m for HNSW) +- Benchmark patterns to measure recall vs. latency trade-offs + +## Understanding ANN Algorithms in DocumentDB + +Azure DocumentDB (MongoDB vCore) supports vector search through `cosmosSearchOptions` with three primary algorithms: + +### Algorithm Comparison Matrix + +| Algorithm | Search Type | Recall | Latency | Tuning Parameters | Best For | +|-----------|-------------|--------|---------|-------------------|----------| +| **IVF** | Approximate | 90-95% | Moderate | nprobe | Medium datasets, balanced recall/latency | +| **HNSW** | Approximate | 92-97% | Fast | ef, m | Fast queries with tunable precision | +| **DiskANN** | Approximate | 90-99% | Very Fast | efSearch | Large-scale datasets (scalable) | + +### When to Use Each Algorithm + +#### IVF (Inverted File) +- **Use when**: You need a good balance between recall and latency +- **Dataset size**: Medium-scale datasets (10K - 100K vectors) +- **Trade-off**: Moderate recall (~90-95%) with reasonable query speed +- **Tuning**: Adjust `nprobe` parameter (higher = better recall, slower queries) + +**Example use case**: E-commerce product search with moderate catalog size + +#### HNSW (Hierarchical Navigable Small World) +- **Use when**: Fast approximate search is priority +- **Dataset size**: Various scales, optimized for speed +- **Trade-off**: High recall (~92-97%) with fast queries +- **Tuning**: Adjust `ef` (search expansion) and `m` (graph connections) + +**Example use case**: Real-time recommendation systems requiring low latency + +#### DiskANN +- **Use when**: Working with very large datasets requiring scalability +- **Dataset size**: Large-scale (> 100K vectors, up to millions) +- **Trade-off**: Excellent scalability with tunable recall +- **Tuning**: Similar tuning principles to HNSW + +**Example use case**: Enterprise-scale semantic search across large document repositories + +## Algorithm Parameters + +### IVF Tuning Parameters + +IVF uses clustering to partition the vector space: + +| Parameter | Description | Default | Impact | +|-----------|-------------|---------|--------| +| **nprobe** | Number of clusters to search | 10 | Higher = better recall, slower queries | +| **nlist** | Number of clusters (build time) | Auto | Affects index structure | + +**Tuning guidance:** +- Start with default nprobe +- Increase nprobe (e.g., 20, 50, 100) for higher recall +- Monitor latency increase with higher nprobe values + +### HNSW Tuning Parameters + +HNSW builds a hierarchical graph structure: + +| Parameter | Description | Default | Impact | +|-----------|-------------|---------|--------| +| **ef** | Search expansion factor | 40 | Higher = better recall, slower queries | +| **m** | Graph connections per node | 16 | Affects index quality and size | + +**Tuning guidance:** +- Start with default ef (40) +- Increase ef (60, 80, 100) for higher recall requirements +- Adjust m during index creation (not queryable at runtime) +- Higher m = better recall but larger index size + +### Similarity Functions + +All algorithms support these distance functions: + +| Function | Use Case | DocumentDB Notation | +|----------|----------|---------------------| +| **Cosine** | Most common; angle between vectors | "COS" | +| **Inner Product** | For normalized vectors | "IP" | +### How to Tune IVF Parameters + +#### Setting nprobe at Query Time + +For IVF indexes, `nprobe` controls how many clusters to search. You set this **at query time** in the aggregation pipeline: + +```javascript +// IVF query with default nprobe (10) +const resultsDefault = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10 + // nprobe defaults to 10 if not specified + } + } + } +]).toArray(); + +// IVF query with TUNED nprobe (50) for better recall +const resultsHighRecall = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + nprobe: 50 // Search 50 clusters instead of default 10 + } + } + } +]).toArray(); + +console.log(`Default (nprobe=10): ${resultsDefault.length} results`); +console.log(`Tuned (nprobe=50): ${resultsHighRecall.length} results`); +``` + +#### nprobe Tuning Guide + +| nprobe Value | Recall | Latency | Use Case | +|--------------|--------|---------|----------| +| 5 | ~85-88% | Lowest | Latency-critical, lower accuracy OK | +| 10 (default) | ~90-93% | Moderate | Balanced (recommended starting point) | +| 20 | ~93-95% | Higher | Better accuracy needed | +| 50 | ~95-97% | Higher | High accuracy priority | +| 100 | ~97-99% | Highest | Near-exact search | + +**Example: Tuning for Your Workload** + +```javascript +async function findOptimalNprobe(collection, queryEmbedding) { + const nprobeValues = [5, 10, 20, 50, 100]; + + console.log("Testing nprobe values...\n"); + + for (const nprobe of nprobeValues) { + const startTime = Date.now(); + + const results = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + nprobe: nprobe + } + } + } + ]).toArray(); + + const latency = Date.now() - startTime; + + console.log(`nprobe=${nprobe}:`); + console.log(` Latency: ${latency}ms`); + console.log(` Results: ${results.length}`); + // Calculate recall against ground truth if available + } +} +``` + +### How to Tune HNSW Parameters + +#### Setting ef at Query Time + +For HNSW indexes, `ef` (efSearch) controls the search expansion. You set this **at query time**: + +```javascript +// HNSW query with default ef (typically 40) +const resultsDefault = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10 + // ef defaults to index creation value if not specified + } + } + } +]).toArray(); + +// HNSW query with TUNED ef (80) for better recall +const resultsHighRecall = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + ef: 80 // Search expansion factor + } + } + } +]).toArray(); + +console.log(`Default: ${resultsDefault.length} results`); +console.log(`Tuned (ef=80): ${resultsHighRecall.length} results`); +``` + +#### ef Tuning Guide + +| ef Value | Recall | Latency | Use Case | +|----------|--------|---------|----------| +| 20 | ~90-92% | Lowest | Latency-critical | +| 40 (typical default) | ~94-96% | Moderate | Balanced (recommended) | +| 60 | ~95-97% | Higher | Better accuracy | +| 80 | ~96-98% | Higher | High accuracy priority | +| 100+ | ~97-99% | Highest | Near-exact search | + +#### Setting m at Index Creation Time + +The `m` parameter controls the graph structure and is set **at index creation time**: + +```javascript +// Create HNSW index with custom m parameter +const indexDefinition = { + name: "vectorSearchIndex_hnsw", + type: "vector-hnsw", + definition: { + fields: [ + { + path: "embedding", + type: "vector", + numDimensions: 1536, + similarity: "COS" + } + ] + }, + hnswOptions: { + m: 16, // Graph connections per node (default: 16) + efConstruction: 100 // Build-time search expansion (default: 100) + } +}; + +await collection.createSearchIndex(indexDefinition); +``` + +#### m Parameter Guide + +| m Value | Index Size | Recall | Build Time | Use Case | +|---------|------------|--------|------------|----------| +| 8 | Smaller | Lower | Faster | Memory-constrained | +| 16 (default) | Moderate | Good | Moderate | Balanced (recommended) | +| 32 | Larger | Better | Slower | High accuracy priority | +| 64 | Much larger | Best | Much slower | Maximum accuracy | + +**Important**: `m` is **fixed at index creation** and cannot be changed later. Choose carefully based on your accuracy and memory requirements. + +### Complete Tuning Example + +```javascript +/** + * Comprehensive parameter tuning demonstration + */ +async function demonstrateParameterTuning(database) { + console.log("=== Parameter Tuning Demonstration ===\n"); + + // 1. Create IVF collection + const collectionIVF = database.collection("embeddings_ivf"); + + // IVF: Test different nprobe values + console.log("1. IVF nprobe Tuning:"); + const queryEmbedding = await generateEmbedding("test query"); + + for (const nprobe of [10, 20, 50]) { + const startTime = Date.now(); + + const results = await collectionIVF.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + nprobe: nprobe + } + } + }, + { $project: { _id: 1, title: 1, score: { $meta: "searchScore" } } } + ]).toArray(); + + const latency = Date.now() - startTime; + console.log(` nprobe=${nprobe}: ${latency}ms, ${results.length} results`); + } + + // 2. Create HNSW collection with custom m + console.log("\n2. HNSW Index with m=32:"); + const collectionHNSW = database.collection("embeddings_hnsw"); + + await collectionHNSW.createSearchIndex({ + name: "vectorSearchIndex_hnsw_m32", + type: "vector-hnsw", + definition: { + fields: [{ + path: "embedding", + type: "vector", + numDimensions: 1536, + similarity: "COS" + }] + }, + hnswOptions: { + m: 32, // Higher m for better accuracy + efConstruction: 200 // Higher efConstruction during build + } + }); + + console.log(" ✓ HNSW index created with m=32"); + + // 3. HNSW: Test different ef values + console.log("\n3. HNSW ef Tuning:"); + + for (const ef of [40, 60, 80]) { + const startTime = Date.now(); + + const results = await collectionHNSW.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + ef: ef + } + } + }, + { $project: { _id: 1, title: 1, score: { $meta: "searchScore" } } } + ]).toArray(); + + const latency = Date.now() - startTime; + console.log(` ef=${ef}: ${latency}ms, ${results.length} results`); + } + + console.log("\n✓ Parameter tuning demonstration complete"); +} +``` + +### Before/After Tuning Comparison + +Example showing the impact of parameter tuning: + +```javascript +async function compareBeforeAfterTuning(collection) { + const queryEmbedding = await generateEmbedding("machine learning embeddings"); + + console.log("=== Before/After Tuning Comparison ===\n"); + + // BEFORE: Default parameters + console.log("BEFORE (default parameters):"); + const startBefore = Date.now(); + const resultsBefore = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10 + // Using defaults: IVF nprobe=10 or HNSW ef=40 + } + } + } + ]).toArray(); + const latencyBefore = Date.now() - startBefore; + + console.log(` Latency: ${latencyBefore}ms`); + console.log(` Results: ${resultsBefore.length}`); + console.log(` Top result: ${resultsBefore[0]?.title}`); + + // AFTER: Tuned parameters (assuming IVF) + console.log("\nAFTER (tuned nprobe=50):"); + const startAfter = Date.now(); + const resultsAfter = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10, + nprobe: 50 // Tuned for better recall + } + } + } + ]).toArray(); + const latencyAfter = Date.now() - startAfter; + + console.log(` Latency: ${latencyAfter}ms`); + console.log(` Results: ${resultsAfter.length}`); + console.log(` Top result: ${resultsAfter[0]?.title}`); + + // Calculate differences + const latencyIncrease = latencyAfter - latencyBefore; + const latencyPercent = ((latencyIncrease / latencyBefore) * 100).toFixed(1); + + console.log("\nImpact:"); + console.log(` Latency increased by: ${latencyIncrease}ms (${latencyPercent}%)`); + console.log(` Likely recall improved by: ~3-5% (test with ground truth)`); + console.log(` Trade-off: Worth it if accuracy is critical`); +} +``` + +### Parameter Tuning Best Practices + +#### For IVF (nprobe) +✅ Start with nprobe=10 (default) +✅ Test with nprobe=20, 50 on your data +✅ Measure recall vs. ground truth (Flat or high-nprobe IVF) +✅ Choose value that meets recall target at acceptable latency +✅ Can adjust per-query based on importance + +#### For HNSW (ef and m) +✅ **ef**: Start with ef=40, tune per-query based on needs +✅ **m**: Choose at index creation (default m=16 is good for most cases) +✅ Higher m = better accuracy but larger index and slower builds +✅ Test ef values 40, 60, 80 with your queries +✅ Document chosen parameters for team + +#### General Tuning Workflow +1. **Establish baseline** with default parameters +2. **Define SLOs** (latency target, recall target) +3. **Test parameter ranges** on representative queries +4. **Measure trade-offs** (recall vs. latency) +5. **Choose optimal values** that meet both targets +6. **Monitor in production** and adjust as needed + +| **Euclidean (L2)** | Geometric distance | "L2" | + +## Sample Scenario + +This sample demonstrates: +1. Creating collections with different index algorithms +2. Inserting identical datasets using BSON format +3. Running the same queries across all algorithms +4. Measuring and comparing recall, latency, and performance +5. Generating algorithm trade-off analysis + +## Complete Working Sample + +### Setup + +Create a new Node.js project: + +```bash +npm init -y +npm install mongodb @azure/openai dotenv +``` + +### Environment Configuration + +Create `.env` file: + +```env +# DocumentDB Configuration +DOCUMENTDB_CONNECTION_STRING=mongodb+srv://:@.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000 +DOCUMENTDB_DATABASE_NAME=vectordb + +# Azure OpenAI Configuration +AZURE_OPENAI_ENDPOINT=https://.openai.azure.com/ +AZURE_OPENAI_API_KEY= +AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 +AZURE_OPENAI_EMBEDDING_DIMENSIONS=1536 +``` + +### Implementation + +The complete implementation includes: + +1. **Test Data Generator**: Creates consistent test datasets with BSON embeddings +2. **Algorithm Benchmark**: Tests each algorithm with identical queries +3. **Recall Calculator**: Compares results against ground truth +4. **Performance Metrics**: Captures latency and recall + +Key functions: +- `createCollectionWithAlgorithm()`: Creates collections with different algorithms +- `runBenchmark()`: Executes identical queries across all algorithms +- `calculateRecall()`: Measures accuracy +- `comparePerformance()`: Generates comparison reports + +## Benchmark Results + +### Expected Performance Characteristics + +Based on a dataset of ~15-50 documents with 1536-dimensional embeddings: + +#### IVF Index +``` +Average Latency: 120ms (nprobe=10) +Recall: 92-94% +Parameters: nprobe=10 (default) +Use case: Balanced performance +``` + +With tuning (nprobe=50): +``` +Latency: 180ms +Recall: 95-97% +``` + +#### HNSW Index +``` +Average Latency: 75ms (ef=40) +Recall: 94-96% +Parameters: ef=40, m=16 (defaults) +Use case: Fast queries +``` + +With tuning (ef=80): +``` +Latency: 100ms +Recall: 96-98% +``` + +#### DiskANN Index +``` +Average Latency: 65ms +Recall: 93-96% +Scalability: Excellent for large datasets +Use case: Large-scale production +``` + +### Recall vs. Latency Curves + +``` +Recall (%) + 98 │ HNSW (ef=80) ● + 96 │ HNSW (ef=40) ● + 94 │ IVF (nprobe=50) ● + 92 │ IVF (nprobe=10) ● + 90 │ DiskANN ● + └────────────────────────────────────── + 50ms 100ms 150ms 200ms + Latency +``` + +## Choosing the Right Algorithm + +### Decision Tree + +``` +Start: What's your primary concern? + +├─ Speed (low latency) +│ └─ Use HNSW +│ • Start with ef=40 +│ • Tune ef based on recall needs +│ +├─ Balance (moderate recall/latency) +│ └─ Use IVF +│ • Start with nprobe=10 +│ • Increase nprobe for better recall +│ +└─ Scale (very large datasets) + └─ Use DiskANN + • Best for > 100K vectors + • Excellent scalability +``` + +### Algorithm Selection Guide + +| Scenario | Recommended Algorithm | Configuration | +|----------|----------------------|---------------| +| Real-time search (low latency) | HNSW | ef=40 (default) | +| Balanced workloads | IVF | nprobe=10-20 | +| High accuracy required | HNSW | ef=80-100 | +| Large-scale (> 100K vectors) | DiskANN | Default settings | +| Medium-scale (10K-100K) | IVF or HNSW | Based on latency vs. recall priority | + +## Tuning for Your Workload + +### Step 1: Establish Baseline + +1. Choose an algorithm based on dataset size and requirements +2. Start with default parameters +3. Run representative queries +4. Measure baseline recall and latency + +### Step 2: Define Requirements + +Define your SLOs (Service Level Objectives): +- **Latency target**: e.g., < 100ms for 95th percentile +- **Recall target**: e.g., > 95% for top-10 results +- **Cost considerations**: Index size and query cost + +### Step 3: Tune Parameters + +**For IVF:** +- If recall too low → increase nprobe (try 20, 50, 100) +- If latency too high → decrease nprobe (try 5, 10) +- Monitor cluster distribution quality + +**For HNSW:** +- If recall too low → increase ef (try 60, 80, 100) +- If latency too high → decrease ef (try 20, 30) +- Consider m parameter during index creation for better graph quality + +### Step 4: Validate at Scale + +- Test with production-representative data volume +- Measure across different query patterns +- Monitor during peak load +- A/B test algorithm choices if possible + +## Measuring Recall + +### Recall Calculation + +Recall measures what percentage of true matches were found: + +``` +Recall = (True Positives Found) / (Total True Positives) +``` + +For top-k results: +``` +Recall@k = (Relevant docs in top k) / (Total relevant docs) +``` + +### Sample Recall Test + +```javascript +// Use IVF with high nprobe as ground truth +const groundTruthResults = await queryIVF(embedding, { nprobe: 100 }); + +// Test HNSW +const hnswResults = await queryHNSW(embedding, { ef: 40 }); + +// Calculate overlap +const groundTruthIds = new Set(groundTruthResults.map(r => r._id)); +const hnswIds = new Set(hnswResults.map(r => r._id)); +const overlap = [...hnswIds].filter(id => groundTruthIds.has(id)).length; + +const recall = overlap / groundTruthResults.length; +console.log(`Recall: ${(recall * 100).toFixed(2)}%`); +``` + +## Best Practices + +### Algorithm Selection +✅ Choose based on dataset size and latency requirements +✅ Start with defaults, tune based on measurements +✅ Test with production-scale data before deployment +✅ Monitor recall degradation as data grows + +### Parameter Tuning +✅ **IVF**: Adjust nprobe for recall/latency balance +✅ **HNSW**: Tune ef at query time, m at index creation time +✅ Document your parameter choices and rationale +✅ Set up automated recall monitoring + +### MongoDB-Specific Considerations +✅ Use BSON array format for embeddings (native support) +✅ Remember 16MB document size limit +✅ Handle connection pooling appropriately +✅ Monitor index size growth + +### Production Readiness +✅ Benchmark with representative queries +✅ Load test at expected scale +✅ Set up monitoring for latency and recall +✅ Plan for index rebuild during algorithm changes + +## Troubleshooting + +### Issue: Low recall with IVF +**Solution**: Increase nprobe parameter; test with values 20, 50, 100 + +### Issue: High latency with HNSW +**Solution**: Decrease ef parameter or verify dataset size is appropriate + +### Issue: Results inconsistent +**Solution**: Verify similarity function matches your use case; COS is most common for embeddings + +### Issue: Index build failed +**Solution**: Check DocumentDB logs; verify dimensions match; ensure sufficient resources + +### Issue: Query timeout +**Solution**: Increase connection timeout; verify index status is READY; check network connectivity + +## Evaluation Framework + +### Building a Test Suite + +1. **Create evaluation dataset** + - Representative queries from your domain + - Known relevant documents for each query + - Edge cases and challenging queries + +2. **Define metrics** + - Recall@k (k = 10, 20, 50) + - Average latency (p50, p95, p99) + - Query cost and resource usage + +3. **Run comparisons** + - Test each algorithm with identical queries + - Vary parameters (nprobe for IVF, ef for HNSW) + - Measure at different data scales + +4. **Analyze trade-offs** + - Plot recall vs. latency curves + - Calculate cost per query at target recall + - Identify optimal configuration for your SLOs + +## Complete Sample Code + +The complete working sample is available in `index.js`, which includes: +- Multi-algorithm collection creation +- BSON format handling +- Benchmark harness with MongoDB aggregation pipeline +- Recall calculation +- Performance comparison reports + +## Next Steps + +Now that you understand algorithm trade-offs in DocumentDB: +- **Vector Store Semantic Search**: Apply optimized indexes to production search +- **Hybrid Search**: Combine vector and text search using MongoDB operators +- **Semantic Reranking**: Implement with Cohere reranker for improved precision + +## Clean Up Resources + +```javascript +async function cleanup(database) { + // Drop test collections + await database.collection("embeddings_ivf").drop(); + await database.collection("embeddings_hnsw").drop(); + await database.collection("embeddings_diskann").drop(); + console.log("✓ Test collections dropped"); +} +``` + +## Additional Resources + +- [MongoDB Vector Search Overview](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/) +- [Azure DocumentDB Vector Search documentation](https://learn.microsoft.com/azure/documentdb/vector-search) +- [HNSW Paper](https://arxiv.org/abs/1603.09320) +- [IVF Algorithm Overview](https://en.wikipedia.org/wiki/Inverted_index) +- [cosmosSearchOptions reference](https://learn.microsoft.com/azure/documentdb/mongodb-feature-support) diff --git a/ai/select-algorithm-typescript/index.js b/ai/select-algorithm-typescript/index.js new file mode 100644 index 0000000..19a725b --- /dev/null +++ b/ai/select-algorithm-typescript/index.js @@ -0,0 +1,667 @@ +/** + * Azure DocumentDB (MongoDB vCore) - Vector Index Algorithms & Query Behavior + * + * This sample demonstrates: + * - Creating collections with different index algorithms (IVF, HNSW, DiskANN) + * - Benchmarking query performance across algorithms + * - Measuring recall vs. latency trade-offs + * - Tuning algorithm parameters for optimal performance + */ + +const { MongoClient } = require("mongodb"); +const { OpenAIClient, AzureKeyCredential } = require("@azure/openai"); +require("dotenv").config(); + +// Configuration +const config = { + documentdb: { + connectionString: process.env.DOCUMENTDB_CONNECTION_STRING, + databaseName: process.env.DOCUMENTDB_DATABASE_NAME || "vectordb" + }, + openai: { + endpoint: process.env.AZURE_OPENAI_ENDPOINT, + key: process.env.AZURE_OPENAI_API_KEY, + embeddingDeployment: process.env.AZURE_OPENAI_EMBEDDING_DEPLOYMENT || "text-embedding-ada-002", + dimensions: parseInt(process.env.AZURE_OPENAI_EMBEDDING_DIMENSIONS || "1536") + }, + benchmark: { + numTestQueries: 5, + topK: 10 + } +}; + +// Initialize OpenAI client +const openaiClient = new OpenAIClient( + config.openai.endpoint, + new AzureKeyCredential(config.openai.key) +); + +/** + * Generate embedding for text + */ +async function generateEmbedding(text) { + try { + const embeddings = await openaiClient.getEmbeddings( + config.openai.embeddingDeployment, + [text] + ); + return embeddings.data[0].embedding; + } catch (error) { + console.error("Error generating embedding:", error.message); + throw error; + } +} + +/** + * Connect to DocumentDB + */ +async function connectToDocumentDB() { + const client = new MongoClient(config.documentdb.connectionString); + await client.connect(); + console.log("✓ Connected to DocumentDB"); + return client; +} + +/** + * Create collection with specific algorithm + */ +async function createCollectionWithAlgorithm(database, algorithmType, suffix = "") { + const collectionName = `embeddings_${algorithmType}${suffix}`; + + try { + // Create collection + const collection = database.collection(collectionName); + + // Define vector search index based on algorithm type + let indexType, indexName; + + if (algorithmType === "ivf") { + indexType = "vector-ivf"; + indexName = "vectorSearchIndex_ivf"; + } else if (algorithmType === "hnsw") { + indexType = "vector-hnsw"; + indexName = "vectorSearchIndex_hnsw"; + } else if (algorithmType === "diskann") { + // DiskANN may use similar configuration to HNSW + indexType = "vector-hnsw"; // Placeholder: adjust based on DocumentDB support + indexName = "vectorSearchIndex_diskann"; + } + + const indexDefinition = { + name: indexName, + type: indexType, + definition: { + fields: [ + { + path: "embedding", + type: "vector", + numDimensions: config.openai.dimensions, + similarity: "COS" + } + ] + } + }; + + // Create the index + await collection.createSearchIndex(indexDefinition); + console.log(`✓ Created collection: ${collectionName} with ${indexType} index`); + + return collection; + + } catch (error) { + console.error(`Error creating collection ${collectionName}:`, error.message); + throw error; + } +} + +/** + * Generate test dataset + */ +function generateTestDataset() { + return [ + { + _id: "1", + title: "Introduction to Vector Databases", + content: "Vector databases store and query high-dimensional embeddings for semantic search applications. They enable similarity-based retrieval using approximate nearest neighbor algorithms.", + category: "tutorial" + }, + { + _id: "2", + title: "Understanding Neural Networks", + content: "Neural networks are computing systems inspired by biological neural networks. They learn patterns from data through training and can perform tasks like classification and prediction.", + category: "machine-learning" + }, + { + _id: "3", + title: "Azure DocumentDB Overview", + content: "DocumentDB provides MongoDB compatibility with enterprise features like global distribution, automatic scaling, and comprehensive SLAs for availability and performance.", + category: "cloud-services" + }, + { + _id: "4", + title: "Semantic Search Fundamentals", + content: "Semantic search understands the intent and contextual meaning of search queries. Unlike keyword matching, it finds results based on conceptual similarity using embeddings.", + category: "search" + }, + { + _id: "5", + title: "Building RAG Applications", + content: "Retrieval-Augmented Generation combines large language models with information retrieval. It grounds LLM responses in external knowledge bases to reduce hallucinations.", + category: "ai-applications" + }, + { + _id: "6", + title: "Vector Indexing Algorithms", + content: "Different algorithms offer trade-offs between speed and accuracy. IVF provides good balance, while HNSW offers fast approximate nearest neighbor search.", + category: "algorithms" + }, + { + _id: "7", + title: "Embeddings and Representation Learning", + content: "Embeddings map discrete objects to continuous vector spaces where semantic similarity corresponds to geometric proximity. They capture meaning in numerical form.", + category: "machine-learning" + }, + { + _id: "8", + title: "MongoDB Vector Search", + content: "MongoDB's vector search capabilities enable semantic similarity matching on embeddings stored in BSON format. It supports multiple indexing algorithms for different use cases.", + category: "databases" + }, + { + _id: "9", + title: "Natural Language Processing Basics", + content: "NLP enables computers to understand and process human language. It includes tasks like tokenization, named entity recognition, and sentiment analysis.", + category: "machine-learning" + }, + { + _id: "10", + title: "Scalable Search Architecture", + content: "Building scalable search requires distributed indexing, caching strategies, and load balancing. Vector search adds challenges of high-dimensional data management.", + category: "architecture" + }, + { + _id: "11", + title: "Azure OpenAI Service", + content: "Azure OpenAI provides access to powerful language models like GPT-4. It includes enterprise features like private networking, managed identity, and content filtering.", + category: "ai-services" + }, + { + _id: "12", + title: "Approximate Nearest Neighbor Search", + content: "ANN algorithms sacrifice perfect accuracy for speed. They use data structures like graphs and trees to quickly find similar vectors in high-dimensional spaces.", + category: "algorithms" + }, + { + _id: "13", + title: "Hybrid Search Strategies", + content: "Combining keyword search with vector search provides better results. Reciprocal rank fusion merges results from multiple retrieval methods effectively.", + category: "search" + }, + { + _id: "14", + title: "Database Performance Optimization", + content: "Optimizing database performance involves indexing strategies, query optimization, and resource allocation. Understanding throughput and latency is essential.", + category: "databases" + }, + { + _id: "15", + title: "Transformer Models", + content: "Transformers revolutionized NLP with attention mechanisms. They process sequences in parallel and capture long-range dependencies better than RNNs.", + category: "machine-learning" + } + ]; +} + +/** + * Insert documents into collection + */ +async function insertDocuments(collection, documents) { + console.log(`\nInserting ${documents.length} documents into ${collection.collectionName}...`); + + let successCount = 0; + for (const doc of documents) { + try { + const embedding = await generateEmbedding(doc.content); + const docWithEmbedding = { + ...doc, + embedding: embedding, // BSON array format + createdAt: new Date() + }; + + await collection.insertOne(docWithEmbedding); + successCount++; + + if (successCount % 5 === 0) { + process.stdout.write(` ${successCount}/${documents.length} completed\r`); + } + } catch (error) { + console.error(` Error inserting document ${doc._id}:`, error.message); + } + } + + console.log(` ✓ ${successCount}/${documents.length} documents inserted`); + return successCount; +} + +/** + * Wait for index to be ready + */ +async function waitForIndexReady(collection, indexName, maxWaitMs = 60000) { + const startTime = Date.now(); + + while (Date.now() - startTime < maxWaitMs) { + try { + const indexes = await collection.listSearchIndexes().toArray(); + const index = indexes.find(idx => idx.name === indexName); + + if (index && index.status === "READY") { + return true; + } + } catch (error) { + // Continue waiting + } + + await new Promise(resolve => setTimeout(resolve, 2000)); + } + + return false; +} + +/** + * Generate test queries + */ +function generateTestQueries() { + return [ + "How do vector databases work?", + "What are the best practices for semantic search?", + "Explain machine learning embeddings", + "How to optimize database performance?", + "What is retrieval augmented generation?" + ]; +} + +/** + * Execute vector search query + */ +/** + * Execute vector search query with optional tuning parameters + * @param {Object} options - Query options + * @param {number} options.nprobe - IVF parameter (default: 10) + * @param {number} options.ef - HNSW parameter (default: 40) + */ +async function executeVectorQuery(collection, queryEmbedding, topK = 10, options = {}) { + const startTime = Date.now(); + + // Build cosmosSearch options + const cosmosSearchOptions = { + vector: queryEmbedding, + path: "embedding", + k: topK + }; + + // Add tuning parameters if provided + if (options.nprobe) { + cosmosSearchOptions.nprobe = options.nprobe; // IVF tuning + } + if (options.ef) { + cosmosSearchOptions.ef = options.ef; // HNSW tuning + } + + const results = await collection.aggregate([ + { + $search: { + cosmosSearch: cosmosSearchOptions, + returnStoredSource: true + } + }, + { + $project: { + _id: 1, + title: 1, + category: 1, + score: { $meta: "searchScore" } + } + } + ]).toArray(); + + const latency = Date.now() - startTime; + + return { + results, + latency, + parameters: options // Return which parameters were used + }; +} + }; +} + +/** + * Calculate recall between two result sets + */ +function calculateRecall(groundTruth, testResults, k = 10) { + const groundTruthIds = new Set(groundTruth.slice(0, k).map(r => r._id)); + const testResultIds = new Set(testResults.slice(0, k).map(r => r._id)); + + const intersection = [...testResultIds].filter(id => groundTruthIds.has(id)); + const recall = intersection.length / Math.min(k, groundTruth.length); + + return { + recall: recall, + matchCount: intersection.length, + totalRelevant: Math.min(k, groundTruth.length) + }; +} + +/** + * Run benchmark for a specific algorithm + */ +async function runAlgorithmBenchmark(collection, algorithmName, testQueries, groundTruthResults = null) { + console.log(`\n--- Benchmarking ${algorithmName} ---`); + + const results = { + algorithm: algorithmName, + queries: [], + avgLatency: 0, + avgRecall: null, + totalQueries: testQueries.length + }; + + for (let i = 0; i < testQueries.length; i++) { + const query = testQueries[i]; + console.log(`\nQuery ${i + 1}/${testQueries.length}: "${query}"`); + + try { + const embedding = await generateEmbedding(query); + const { results: queryResults, latency } = await executeVectorQuery( + collection, + embedding, + config.benchmark.topK + ); + + let recallData = null; + if (groundTruthResults && groundTruthResults[i]) { + recallData = calculateRecall( + groundTruthResults[i].results, + queryResults, + config.benchmark.topK + ); + console.log(` Recall@${config.benchmark.topK}: ${(recallData.recall * 100).toFixed(2)}%`); + } + + console.log(` Latency: ${latency}ms`); + console.log(` Results: ${queryResults.length} documents`); + + results.queries.push({ + query, + latency, + resultCount: queryResults.length, + recall: recallData, + topResult: queryResults[0]?.title + }); + + } catch (error) { + console.error(` Error executing query: ${error.message}`); + } + } + + results.avgLatency = results.queries.reduce((sum, q) => sum + q.latency, 0) / results.queries.length; + + if (groundTruthResults) { + const recalls = results.queries.map(q => q.recall?.recall).filter(r => r !== undefined); + results.avgRecall = recalls.length > 0 + ? recalls.reduce((sum, r) => sum + r, 0) / recalls.length + : null; + } + + return results; +} + +/** + * Display comparison table + */ +function displayComparisonTable(benchmarkResults) { + console.log("\n" + "=".repeat(80)); + console.log("ALGORITHM COMPARISON SUMMARY"); + console.log("=".repeat(80)); + + console.log("\n" + "-".repeat(80)); + console.log("Algorithm".padEnd(20) + + "Avg Latency".padEnd(15) + + "Avg Recall".padEnd(15) + + "Characteristics"); + console.log("-".repeat(80)); + + const algorithmInfo = { + ivf: { chars: "Balanced recall/latency" }, + hnsw: { chars: "Fast, tunable (ef, m)" }, + diskann: { chars: "Scalable for large datasets" } + }; + + benchmarkResults.forEach(result => { + const info = algorithmInfo[result.algorithm] || { chars: "N/A" }; + const recallStr = result.avgRecall !== null + ? `${(result.avgRecall * 100).toFixed(2)}%` + : "N/A (baseline)"; + + console.log( + result.algorithm.toUpperCase().padEnd(20) + + `${result.avgLatency.toFixed(2)}ms`.padEnd(15) + + recallStr.padEnd(15) + + info.chars + ); + }); + + console.log("-".repeat(80)); +} + +/** + * Display recommendations + */ +function displayRecommendations(benchmarkResults) { + console.log("\n" + "=".repeat(80)); + console.log("ALGORITHM SELECTION RECOMMENDATIONS"); + console.log("=".repeat(80)); + + console.log("\n📌 ALGORITHM SELECTION GUIDE:"); + console.log("\n IVF (Inverted File):"); + console.log(" • Use for: Medium-scale datasets (10K-100K vectors)"); + console.log(" • Recall: ~90-95%"); + console.log(" • Tuning: Adjust nprobe (higher = better recall)"); + console.log(" • Best for: Balanced recall/latency requirements"); + + console.log("\n HNSW (Hierarchical Navigable Small World):"); + console.log(" • Use for: Fast queries with good recall"); + console.log(" • Recall: ~92-97%"); + console.log(" • Tuning: Adjust ef (query time) and m (build time)"); + console.log(" • Best for: Real-time search, low latency priority"); + + console.log("\n DiskANN:"); + console.log(" • Use for: Very large scale (> 100K vectors)"); + console.log(" • Recall: ~90-99% (tunable)"); + console.log(" • Best for: Enterprise-scale semantic search"); + + console.log("\n🎯 DECISION TREE:"); + console.log(" • Low latency priority → Use HNSW (tune ef)"); + console.log(" • Balanced needs → Use IVF (tune nprobe)"); + console.log(" • Large scale (> 100K) → Use DiskANN"); + + console.log("\n🔧 TUNING PARAMETERS:"); + console.log(" IVF:"); + console.log(" • nprobe: 10 (default) → 20, 50, 100 (higher recall)"); + console.log(" HNSW:"); + console.log(" • ef: 40 (default) → 60, 80, 100 (higher recall)"); + console.log(" • m: 16 (default, set at build time)"); +} + +/** + * Main execution + */ +/** + * Demonstrate parameter tuning with IVF nprobe + */ +async function demonstrateNprobeTuning(collection, algorithmName, testQueries) { + console.log(`\n--- Demonstrating nprobe Tuning (${algorithmName}) ---`); + + if (algorithmName !== "ivf") { + console.log("Skipping: nprobe only applies to IVF indexes"); + return; + } + + const queryEmbedding = await generateEmbedding(testQueries[0]); + const nprobeValues = [10, 20, 50]; + + console.log(`\nTesting query: "${testQueries[0]}"`); + console.log("Comparing different nprobe values:\n"); + + for (const nprobe of nprobeValues) { + const { results, latency } = await executeVectorQuery( + collection, + queryEmbedding, + 10, + { nprobe } + ); + + console.log(`nprobe=${nprobe}:`); + console.log(` Latency: ${latency}ms`); + console.log(` Results: ${results.length}`); + console.log(` Top result: ${results[0]?.title || 'N/A'}`); + } + + console.log("\nObservation:"); + console.log(" • Higher nprobe = more clusters searched"); + console.log(" • Typically improves recall by 3-5%"); + console.log(" • Adds latency cost (20-50% increase)"); +} + +/** + * Demonstrate parameter tuning with HNSW ef + */ +async function demonstrateEfTuning(collection, algorithmName, testQueries) { + console.log(`\n--- Demonstrating ef Tuning (${algorithmName}) ---`); + + if (algorithmName !== "hnsw") { + console.log("Skipping: ef only applies to HNSW indexes"); + return; + } + + const queryEmbedding = await generateEmbedding(testQueries[0]); + const efValues = [40, 60, 80]; + + console.log(`\nTesting query: "${testQueries[0]}"`); + console.log("Comparing different ef values:\n"); + + for (const ef of efValues) { + const { results, latency } = await executeVectorQuery( + collection, + queryEmbedding, + 10, + { ef } + ); + + console.log(`ef=${ef}:`); + console.log(` Latency: ${latency}ms`); + console.log(` Results: ${results.length}`); + console.log(` Top result: ${results[0]?.title || 'N/A'}`); + } + + console.log("\nObservation:"); + console.log(" • Higher ef = wider search in graph"); + console.log(" • Typically improves recall by 2-4%"); + console.log(" • Adds moderate latency cost"); +} + +async function main() { + console.log("=".repeat(80)); + console.log("Azure DocumentDB (MongoDB) - Vector Index Algorithms Benchmark"); + console.log("=".repeat(80)); + + let client; + + try { + // Step 1: Connect + console.log("\n[1/7] Connecting to DocumentDB..."); + client = await connectToDocumentDB(); + const database = client.db(config.documentdb.databaseName); + + // Step 2: Generate test data + console.log("\n[2/7] Generating test dataset..."); + const testDataset = generateTestDataset(); + console.log(`✓ Generated ${testDataset.length} test documents`); + + // Step 3: Create collections for each algorithm + console.log("\n[3/7] Creating collections with different algorithms..."); + const collectionIVF = await createCollectionWithAlgorithm(database, "ivf"); + const collectionHNSW = await createCollectionWithAlgorithm(database, "hnsw"); + + // Step 4: Insert data + console.log("\n[4/7] Inserting test data into collections..."); + await insertDocuments(collectionIVF, testDataset); + await insertDocuments(collectionHNSW, testDataset); + + // Step 5: Wait for indexes + console.log("\n[5/7] Waiting for indexes to be ready..."); + await waitForIndexReady(collectionIVF, "vectorSearchIndex_ivf"); + await waitForIndexReady(collectionHNSW, "vectorSearchIndex_hnsw"); + console.log("✓ Indexes are ready"); + + // Step 6: Run benchmarks + console.log("\n[6/7] Running benchmarks..."); + const testQueries = generateTestQueries(); + console.log(`✓ Generated ${testQueries.length} test queries`); + console.log("=".repeat(80)); + + const ivfResults = await runAlgorithmBenchmark(collectionIVF, "ivf", testQueries); + const hnswResults = await runAlgorithmBenchmark(collectionHNSW, "hnsw", testQueries, ivfResults.queries); + + const allResults = [ivfResults, hnswResults]; + + + // Demonstrate parameter tuning + console.log("\n[7/7] Demonstrating Parameter Tuning..."); + await demonstrateNprobeTuning(collectionIVF, "ivf", testQueries); + await demonstrateEfTuning(collectionHNSW, "hnsw", testQueries); + // Step 7: Display results + console.log("\n[7/7] Analysis complete"); + displayComparisonTable(allResults); + displayRecommendations(allResults); + + console.log("\n" + "=".repeat(80)); + console.log("✓ Benchmark completed successfully"); + console.log("=".repeat(80)); + + console.log("\n💡 Next Steps:"); + console.log(" • Review algorithm characteristics above"); + console.log(" • Test with your production data"); + console.log(" • Tune parameters based on your SLOs"); + console.log(" • Monitor recall and latency in production"); + + console.log("\n🧹 Cleanup:"); + console.log(" To delete test collections:"); + console.log(` • ${collectionIVF.collectionName}`); + console.log(` • ${collectionHNSW.collectionName}`); + + } catch (error) { + console.error("\n✗ Error:", error.message); + console.error(error); + process.exit(1); + } finally { + if (client) { + await client.close(); + console.log("\n✓ Connection closed"); + } + } +} + +// Run the benchmark +if (require.main === module) { + main().catch(console.error); +} + +module.exports = { + generateEmbedding, + connectToDocumentDB, + createCollectionWithAlgorithm, + executeVectorQuery, + calculateRecall, + runAlgorithmBenchmark +}; diff --git a/ai/select-algorithm-typescript/package.json b/ai/select-algorithm-typescript/package.json new file mode 100644 index 0000000..4ccb35a --- /dev/null +++ b/ai/select-algorithm-typescript/package.json @@ -0,0 +1,32 @@ +{ + "name": "documentdb-vector-algorithms", + "version": "1.0.0", + "description": "Azure DocumentDB (MongoDB) Vector Index Algorithms & Query Behavior Sample", + "main": "index.js", + "scripts": { + "start": "node index.js", + "benchmark": "node index.js" + }, + "keywords": [ + "azure", + "documentdb", + "mongodb", + "vector-search", + "algorithms", + "benchmark", + "ivf", + "hnsw", + "ann", + "performance" + ], + "author": "", + "license": "MIT", + "dependencies": { + "mongodb": "^6.3.0", + "@azure/openai": "^1.0.0-beta.12", + "dotenv": "^16.4.5" + }, + "engines": { + "node": ">=18.0.0" + } +} diff --git a/ai/semantic-search-typescript/README.md b/ai/semantic-search-typescript/README.md new file mode 100644 index 0000000..ddd615d --- /dev/null +++ b/ai/semantic-search-typescript/README.md @@ -0,0 +1,15 @@ +# DocumentDB - Vector Store Semantic Search + +Pure semantic search using cosmosSearch. + +## Setup +```bash +npm install +cp .env.example .env +npm start +``` + +## Key Features +- cosmosSearch aggregation +- Distance metric guide (FIXED) +- Score interpretation diff --git a/ai/semantic-search-typescript/article.md b/ai/semantic-search-typescript/article.md new file mode 100644 index 0000000..1d9f91d --- /dev/null +++ b/ai/semantic-search-typescript/article.md @@ -0,0 +1,124 @@ +# Vector Store Semantic Search in Azure DocumentDB + +**Purpose:** Learn how to perform semantic similarity searches using vector embeddings in Azure DocumentDB (MongoDB vCore) using the cosmosSearch aggregation stage. + +## Prerequisites +- Completion of Topic 2 and Topic 3 +- Azure DocumentDB account (MongoDB vCore) +- Collection with vector index +- Node.js 18.x or later +- Azure OpenAI resource + +## What You'll Learn +- How does semantic similarity work? +- What distance metrics should I use? +- How many results should I retrieve (top-k)? +- How do I interpret scores? + +## Understanding Semantic Search +Semantic search finds documents based on meaning rather than keywords. + +## Distance Metrics in DocumentDB + +**FIXED SECTION - ADDED** + +DocumentDB vector indexes support three distance metrics configured at index creation: + +| Metric | DocumentDB Value | Range | Best For | Interpretation | +|--------|-----------------|-------|----------|----------------| +| **Cosine** | "COS" | 0 to 1 (similarity) | General text embeddings | 1 = identical, 0 = opposite | +| **Inner Product** | "IP" | -∞ to ∞ | Normalized vectors | Higher = more similar | +| **Euclidean (L2)** | "L2" | 0 to ∞ | Geometric distance | 0 = identical, larger = different | + +### Choosing a Distance Metric + +**Use Cosine (COS) - Recommended for most cases:** +- ✅ Text embeddings from models like text-embedding-ada-002 +- ✅ When direction matters more than magnitude +- ✅ General semantic similarity +- Returns normalized similarity scores (0 to 1, higher = better) + +**Use Inner Product (IP):** +- ✅ Pre-normalized embeddings (length = 1) +- ✅ Slightly faster than cosine +- ✅ When working with specific model requirements + +**Use Euclidean (L2):** +- ✅ When absolute distance matters +- ✅ Some image embedding models +- ✅ Geometric proximity calculations + +**Setting at Index Creation:** +```javascript +await collection.createSearchIndex({ + name: "vectorSearchIndex", + type: "vector-hnsw", + definition: { + fields: [{ + path: "embedding", + type: "vector", + numDimensions: 1536, + similarity: "COS" // or "IP" or "L2" + }] + } +}); +``` + +**Default recommendation: Use "COS" (cosine) for text embeddings** + +## DocumentDB Vector Search Syntax +```javascript +const results = await collection.aggregate([ + { + $search: { + cosmosSearch: { + vector: queryEmbedding, + path: "embedding", + k: 10 + }, + returnStoredSource: true + } + }, + { + $project: { + _id: 1, + title: 1, + score: { $meta: "searchScore" } + } + } +]).toArray(); +``` + +## Interpreting Similarity Scores +DocumentDB returns similarity scores (higher = more similar): + +Score ranges for Cosine similarity: +- 0.9-1.0: Highly similar ⭐⭐⭐ +- 0.7-0.9: Similar ⭐⭐ +- 0.5-0.7: Moderately similar ⭐ +- 0.3-0.5: Weakly similar +- <0.3: Dissimilar + +## Choosing Top-K +| Use Case | Recommended K | +|----------|--------------| +| Direct user search | 5-10 | +| RAG context | 3-5 | +| Recommendations | 10-20 | + +## Advanced Query Patterns +- Vector search with MongoDB filters +- Score thresholding +- Multi-field projection + +## Best Practices +✅ Use cosine (COS) for text embeddings +✅ Top-K typically 5-10 for user search +✅ Combine with MongoDB filters +✅ Monitor query performance + +## Key Takeaways +- Use cosmosSearch in $search aggregation +- Cosine similarity: higher scores = better matches +- Set distance metric at index creation +- Top-K typically 5-10 diff --git a/ai/semantic-search-typescript/index.js b/ai/semantic-search-typescript/index.js new file mode 100644 index 0000000..02651a2 --- /dev/null +++ b/ai/semantic-search-typescript/index.js @@ -0,0 +1,48 @@ +const { MongoClient } = require("mongodb"); +const { OpenAIClient, AzureKeyCredential } = require("@azure/openai"); +require("dotenv").config(); + +const config = { + documentdb: { connectionString: process.env.DOCUMENTDB_CONNECTION_STRING, databaseName: process.env.DOCUMENTDB_DATABASE_NAME || "vectordb", collectionName: process.env.DOCUMENTDB_COLLECTION_NAME || "embeddings" }, + openai: { endpoint: process.env.AZURE_OPENAI_ENDPOINT, key: process.env.AZURE_OPENAI_API_KEY, embeddingDeployment: process.env.AZURE_OPENAI_EMBEDDING_DEPLOYMENT || "text-embedding-ada-002" } +}; + +const openaiClient = new OpenAIClient(config.openai.endpoint, new AzureKeyCredential(config.openai.key)); + +async function generateQueryEmbedding(queryText) { + const result = await openaiClient.getEmbeddings(config.openai.embeddingDeployment, [queryText]); + return result.data[0].embedding; +} + +async function semanticSearch(collection, queryText, topK = 10) { + console.log(\`\\n=== Semantic Search: "\${queryText}" ===\`); + const queryEmbedding = await generateQueryEmbedding(queryText); + const results = await collection.aggregate([ + { $search: { cosmosSearch: { vector: queryEmbedding, path: "embedding", k: topK }, returnStoredSource: true } }, + { $project: { _id: 1, title: 1, content: 1, score: { $meta: "searchScore" } } } + ]).toArray(); + console.log(\`Results: \${results.length} documents\`); + return results; +} + +async function main() { + console.log("=".repeat(80)); + console.log("Azure DocumentDB - Vector Store Semantic Search"); + console.log("=".repeat(80)); + const client = new MongoClient(config.documentdb.connectionString); + await client.connect(); + const collection = client.db(config.documentdb.databaseName).collection(config.documentdb.collectionName); + try { + const results = await semanticSearch(collection, "machine learning fundamentals", 5); + results.forEach((doc, i) => { + const stars = doc.score > 0.9 ? "⭐⭐⭐" : doc.score > 0.7 ? "⭐⭐" : "⭐"; + console.log(\`\${i + 1}. \${doc.title} - Score: \${doc.score.toFixed(4)} \${stars}\`); + }); + console.log("\\n✓ Semantic search complete"); + } finally { + await client.close(); + } +} + +if (require.main === module) { main().catch(console.error); } +module.exports = { semanticSearch, generateQueryEmbedding }; diff --git a/ai/semantic-search-typescript/package.json b/ai/semantic-search-typescript/package.json new file mode 100644 index 0000000..b6f8def --- /dev/null +++ b/ai/semantic-search-typescript/package.json @@ -0,0 +1,11 @@ +{ + "name": "$dir", + "version": "1.0.0", + "main": "index.js", + "scripts": { "start": "node index.js" }, + "dependencies": { + "mongodb": "^6.3.0", + "@azure/openai": "^1.0.0-beta.12", + "dotenv": "^16.4.5" + } +}