Azure-Samples · diberry · May 20, 2026 · May 11, 2026 · May 15, 2026 · May 20, 2026
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -11,9 +11,6 @@
 	"customizations": {
 		"vscode": {
 			"extensions": [
-				"ms-azuretools.vscode-cosmosdb",
-				"buildwithlayer.mongodb-integration-expert-qS6DB",
-				"mongodb.mongodb-vscode",
 				"ms-azuretools.vscode-documentdb"
 			]
 		}

diff --git a/.devcontainer/typescript/devcontainer.json b/.devcontainer/typescript/devcontainer.json
@@ -11,10 +11,8 @@
 	"customizations": {
 		"vscode": {
 			"extensions": [
-				"ms-azuretools.vscode-cosmosdb",
-				"buildwithlayer.mongodb-integration-expert-qS6DB",
-				"mongodb.mongodb-vscode",
-				"ms-azuretools.vscode-documentdb"
+				"ms-azuretools.vscode-documentdb",
+				"mongodb.mongodb-vscode"
 			]
 		}
 	}

diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -105,13 +105,14 @@ All samples MUST use these environment variable names and defaults:
 - efSearch: 40
 
 ### DiskANN
-- maxDegree: 20
-- lBuild: 10
+- vector-search samples: maxDegree: 20, lBuild: 10
+- select-algorithm compare-all samples: maxDegree: 32, lBuild: 50
 - lSearch: 40
+- Select-algorithm samples use higher values for meaningful comparison results.
 
 ## Rules
 
-1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references.
+1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname) and `cosmosSearch` (API command) are valid and NOT Cosmos references.
 2. **Vector field name is DescriptionVector.** Never default to "contentVector".
 3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS`. The default depends on the sample category: vector-search samples use `../data/Hotels_Vector.json` (shared data directory one level up), while select-algorithm samples use `data/Hotels_Vector.json` (local copy in each sample). .NET copies data locally to `data/Hotels_Vector.json` in the build output.
 4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants.
@@ -121,6 +122,37 @@ All samples MUST use these environment variable names and defaults:
 8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes.
 9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability.
 10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration.
-11. **Collection naming:** `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Index naming: `vectorIndex_{algorithm}`.
+11. **Collection naming:** Standard per-algorithm samples use `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Standard index naming is `vectorIndex_{algorithm}`. Compare-all samples that drop and recreate a single collection use collection `hotels` and index naming `vector_{algorithm}_{metric}` (for example, `vector_ivf_cos`). TypeScript `select-algorithm.ts` remains a separate per-collection mode.
 12. **Vector search uses k=5.** All samples return top 5 results. Do not parameterize k unless explicitly required.
 13. **Use the Global read-write hostname.** All samples MUST use the Global read-write connection string format: `<clusterName>.global.mongocluster.cosmos.azure.com`. The `.global.` form auto-follows the active write region after a replica promotion. The non-`.global.` form pins to one cluster and silently becomes read-only after failover — reserve that for read-scale-out scenarios only. (Confirmed by Khelan Modi, DocumentDB PM.)
+14. **VS Code extension is DocumentDB for VS Code.** Always reference [DocumentDB for VS Code](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-documentdb) (`ms-azuretools.vscode-documentdb`). Never reference the Azure Databases extension (`ms-azuretools.vscode-cosmosdb`).
+
+## Sample Review Checklist
+
+Use this checklist when creating new samples or reviewing existing ones. Derived from PM (Khelan Modi) feedback.
+
+### Branding & Naming
+- [ ] Environment variables use `DOCUMENTDB_CLUSTER_NAME` (not `MONGO_CLUSTER_NAME`) for select-algorithm samples
+- [ ] All references say "Azure DocumentDB" — no "Cosmos DB" or "MongoDB vCore"
+- [ ] Connection hostname uses `.global.mongocluster.cosmos.azure.com` format
+
+### Tooling References
+- [ ] VS Code extension references point to [DocumentDB for VS Code](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-documentdb) (`ms-azuretools.vscode-documentdb`)
+- [ ] No references to Data Explorer for data browsing — use the VS Code extension instead
+- [ ] No references to the old Azure Databases extension (`ms-azuretools.vscode-cosmosdb`)
+
+### Index Selection Guidance
+- [ ] IVF is positioned for dev/test, demos, and small clusters (works on any tier)
+- [ ] DiskANN is the default recommendation for production (M30+ clusters)
+- [ ] HNSW is positioned for production when maximum recall is the top priority (M30+)
+- [ ] Decision table or clear guidance helps readers pick the right algorithm quickly
+
+### DiskANN as Default
+- [ ] DiskANN recommendation is prominent (not buried in a footnote)
+- [ ] Higher dimension support called out (up to 16,000 vs HNSW's 8,000)
+- [ ] Memory efficiency explained (index on disk, frees RAM for read/write ops)
+- [ ] Operational benefits mentioned (lighter updates, easier backups, faster recovery)
+- [ ] Future-proofing noted (less likely to need index redesign as models evolve)
+
+### Optional Enhancements
+- [ ] Consider mentioning DocumentDB agent kit (`npx skills add Azure/documentdb-agent-kit`) where appropriate — currently beta/optional
diff --git a/.github/workflows/validate-samples.yml b/.github/workflows/validate-samples.yml
@@ -31,6 +31,7 @@ jobs:
         sample:
           - vector-search-typescript
           - vector-search-agent-typescript
+          - select-algorithm-typescript
 
     steps:
       - name: Checkout code
@@ -52,10 +53,16 @@ jobs:
         run: npm run build
 
   validate-dotnet:
-    name: .NET
+    name: .NET - ${{ matrix.sample }}
     runs-on: ubuntu-latest
     timeout-minutes: 10
     continue-on-error: false
+    strategy:
+      fail-fast: false
+      matrix:
+        sample:
+          - documentdb-samples.sln
+          - ai/select-algorithm-dotnet/SelectAlgorithm.csproj
 
     steps:
       - name: Checkout code
@@ -66,8 +73,8 @@ jobs:
         with:
           dotnet-version: '8.0.x'
 
-      - name: Build solution
-        run: dotnet build documentdb-samples.sln
+      - name: Build
+        run: dotnet build ${{ matrix.sample }}
 
   validate-go:
     name: Go - ${{ matrix.sample }}
@@ -80,6 +87,7 @@ jobs:
         sample:
           - vector-search-go
           - vector-search-agent-go
+          - select-algorithm-go
 
     steps:
       - name: Checkout code
@@ -102,14 +110,20 @@ jobs:
               go build -o /dev/null "$f" utils.go
             done
           else
-            go build ./...
+            go build -o /dev/null ./...
           fi
 
   validate-python:
-    name: Python
+    name: Python - ${{ matrix.sample }}
     runs-on: ubuntu-latest
     timeout-minutes: 10
     continue-on-error: false
+    strategy:
+      fail-fast: false
+      matrix:
+        sample:
+          - vector-search-python
+          - select-algorithm-python
 
     steps:
       - name: Checkout code
@@ -121,19 +135,25 @@ jobs:
           python-version: '3.11'
 
       - name: Install dependencies
-        working-directory: ai/vector-search-python
+        working-directory: ai/${{ matrix.sample }}
         run: pip install -r requirements.txt
 
       - name: Validate Python syntax
-        working-directory: ai/vector-search-python
+        working-directory: ai/${{ matrix.sample }}
         run: |
           find . -name "*.py" -exec python -m py_compile {} +
 
   validate-java:
-    name: Java
+    name: Java - ${{ matrix.sample }}
     runs-on: ubuntu-latest
     timeout-minutes: 10
     continue-on-error: false
+    strategy:
+      fail-fast: false
+      matrix:
+        sample:
+          - vector-search-java
+          - select-algorithm-java
 
     steps:
       - name: Checkout code
@@ -147,5 +167,5 @@ jobs:
           cache: 'maven'
 
       - name: Compile Java
-        working-directory: ai/vector-search-java
+        working-directory: ai/${{ matrix.sample }}
         run: mvn compile -DskipTests
diff --git a/ai/includes/choosing-algorithm.md b/ai/includes/choosing-algorithm.md
@@ -0,0 +1,41 @@
+### Choosing the right algorithm
+
+> [!IMPORTANT]
+> For production workloads, start with **DiskANN** on an M30+ cluster. DiskANN supports higher embedding dimensions, uses less cluster memory, and is less likely to require an index redesign as your models evolve.
+
+Use this quick-reference table to select the right algorithm for your workload:
+
+| Scenario | Algorithm | Cluster tier | Max dimensions |
+|----------|-----------|--------------|----------------|
+| Dev/test, demos, small datasets | **IVF** | Any (free tier OK) | 2,000 |
+| Production (default) | **DiskANN** | M30+ | 16,000 |
+| Production (max recall priority) | **HNSW** | M30+ | 8,000 |
+
+**IVF** (inverted file index):
+- Best for: Test environments, demos, and small clusters
+- Pros: Fast to build, low resource requirements, works on any cluster tier
+- Cons: Lower recall compared to graph-based algorithms at scale
+- Tune: Increase `numLists` for larger datasets, increase `nProbes` for better recall
+
+**DiskANN** (disk-based approximate nearest neighbor) — *recommended for production*:
+- Best for: Production workloads on M30+ clusters
+- Pros: Supports embeddings up to 16,000 dimensions, keeps most index data on disk freeing cluster memory for reads and writes, lighter index updates, easier backups, faster recovery
+- Cons: Requires M30+ cluster tier
+- Tune: Increase `maxDegree` and `lBuild` for better accuracy, increase `lSearch` for better recall
+- Why default: As embedding models evolve (some already exceed 8,000 dimensions), DiskANN avoids costly index redesigns. Its disk-based architecture also means your cluster memory stays available for operational workloads rather than index storage.
+
+**HNSW** (hierarchical navigable small world):
+- Best for: Production workloads on M30+ clusters where maximum recall is the top priority
+- Pros: Excellent recall, fast queries
+- Cons: Requires M30+ cluster tier, supports embeddings up to 8,000 dimensions (vs 16,000 for DiskANN), higher memory usage since the full graph lives in RAM
+- Tune: Increase `m` and `efConstruction` for better index quality, increase `efSearch` for better recall
+
+### Choosing the right similarity function
+
+| Function | Score meaning | Best for |
+|----------|-------------|----------|
+| **COS (Cosine)** | Higher = more similar (0–1) | Text embeddings (normalized vectors) |
+| **L2 (Euclidean)** | Lower = more similar (distance) | When magnitude matters |
+| **IP (Inner Product)** | Higher = more similar | Equivalent to COS for normalized vectors |
+
+For the `text-embedding-3-small` model used in this quickstart, **COS (cosine similarity) is recommended** because OpenAI embeddings are normalized and optimized for cosine similarity.
diff --git a/ai/select-algorithm-dotnet/.devcontainer/devcontainer.json b/ai/select-algorithm-dotnet/.devcontainer/devcontainer.json
@@ -0,0 +1,48 @@
+{
+  "name": "Azure DocumentDB Select Algorithm - .NET 8",
+  "image": "mcr.microsoft.com/devcontainers/dotnet:1-8.0-bookworm",
+
+  "features": {
+    "ghcr.io/devcontainers/features/azure-cli:1": {},
+    "ghcr.io/devcontainers/features/github-cli:1": {},
+    "ghcr.io/devcontainers/features/common-utils:2": {
+      "installZsh": true,
+      "configureZshAsDefaultShell": true,
+      "installOhMyZsh": true
+    }
+  },
+
+  "customizations": {
+    "vscode": {
+      "extensions": [
+        "ms-dotnettools.csdevkit",
+        "ms-dotnettools.vscodeintellicode-csharp",
+        "ms-azuretools.vscode-azureresourcegroups",
+        "ms-azuretools.vscode-documentdb",
+        "mongodb.mongodb-vscode"
+      ],
+      "settings": {
+        "dotnet.completion.showCompletionItemsFromUnimportedNamespaces": true,
+        "files.exclude": {
+          "**/bin": true,
+          "**/obj": true
+        }
+      }
+    }
+  },
+
+  "postCreateCommand": "dotnet restore && dotnet build",
+  "remoteUser": "vscode",
+
+  "containerEnv": {
+    "DOTNET_CLI_TELEMETRY_OPTOUT": "1",
+    "DOTNET_NOLOGO": "1"
+  },
+
+  "mounts": [
+    "source=${localEnv:HOME}${localEnv:USERPROFILE}/.azure,target=/home/vscode/.azure,type=bind,consistency=cached"
+  ],
+
+  "capAdd": ["SYS_PTRACE"],
+  "securityOpt": ["seccomp:unconfined"]
+}
diff --git a/ai/select-algorithm-dotnet/.gitignore b/ai/select-algorithm-dotnet/.gitignore
@@ -0,0 +1,7 @@
+bin/
+obj/
+.env
+
+# Local data copy (user copies from ai/data/)
+data/*.json
+!data/README.md