Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@
"customizations": {
"vscode": {
"extensions": [
"ms-azuretools.vscode-cosmosdb",
"buildwithlayer.mongodb-integration-expert-qS6DB",
"mongodb.mongodb-vscode",
"ms-azuretools.vscode-documentdb"
]
}
Expand Down
6 changes: 2 additions & 4 deletions .devcontainer/typescript/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@
"customizations": {
"vscode": {
"extensions": [
"ms-azuretools.vscode-cosmosdb",
"buildwithlayer.mongodb-integration-expert-qS6DB",
"mongodb.mongodb-vscode",
"ms-azuretools.vscode-documentdb"
"ms-azuretools.vscode-documentdb",
"mongodb.mongodb-vscode"
]
}
}
Expand Down
40 changes: 36 additions & 4 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,14 @@ All samples MUST use these environment variable names and defaults:
- efSearch: 40

### DiskANN
- maxDegree: 20
- lBuild: 10
- vector-search samples: maxDegree: 20, lBuild: 10
- select-algorithm compare-all samples: maxDegree: 32, lBuild: 50
- lSearch: 40
- Select-algorithm samples use higher values for meaningful comparison results.

## Rules

1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references.
1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname) and `cosmosSearch` (API command) are valid and NOT Cosmos references.
2. **Vector field name is DescriptionVector.** Never default to "contentVector".
3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS`. The default depends on the sample category: vector-search samples use `../data/Hotels_Vector.json` (shared data directory one level up), while select-algorithm samples use `data/Hotels_Vector.json` (local copy in each sample). .NET copies data locally to `data/Hotels_Vector.json` in the build output.
4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants.
Expand All @@ -121,6 +122,37 @@ All samples MUST use these environment variable names and defaults:
8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes.
9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability.
10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration.
11. **Collection naming:** `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Index naming: `vectorIndex_{algorithm}`.
11. **Collection naming:** Standard per-algorithm samples use `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Standard index naming is `vectorIndex_{algorithm}`. Compare-all samples that drop and recreate a single collection use collection `hotels` and index naming `vector_{algorithm}_{metric}` (for example, `vector_ivf_cos`). TypeScript `select-algorithm.ts` remains a separate per-collection mode.
12. **Vector search uses k=5.** All samples return top 5 results. Do not parameterize k unless explicitly required.
13. **Use the Global read-write hostname.** All samples MUST use the Global read-write connection string format: `<clusterName>.global.mongocluster.cosmos.azure.com`. The `.global.` form auto-follows the active write region after a replica promotion. The non-`.global.` form pins to one cluster and silently becomes read-only after failover — reserve that for read-scale-out scenarios only. (Confirmed by Khelan Modi, DocumentDB PM.)
14. **VS Code extension is DocumentDB for VS Code.** Always reference [DocumentDB for VS Code](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-documentdb) (`ms-azuretools.vscode-documentdb`). Never reference the Azure Databases extension (`ms-azuretools.vscode-cosmosdb`).

## Sample Review Checklist

Use this checklist when creating new samples or reviewing existing ones. Derived from PM (Khelan Modi) feedback.

### Branding & Naming
- [ ] Environment variables use `DOCUMENTDB_CLUSTER_NAME` (not `MONGO_CLUSTER_NAME`) for select-algorithm samples
- [ ] All references say "Azure DocumentDB" — no "Cosmos DB" or "MongoDB vCore"
- [ ] Connection hostname uses `.global.mongocluster.cosmos.azure.com` format

### Tooling References
- [ ] VS Code extension references point to [DocumentDB for VS Code](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-documentdb) (`ms-azuretools.vscode-documentdb`)
- [ ] No references to Data Explorer for data browsing — use the VS Code extension instead
- [ ] No references to the old Azure Databases extension (`ms-azuretools.vscode-cosmosdb`)

### Index Selection Guidance
- [ ] IVF is positioned for dev/test, demos, and small clusters (works on any tier)
- [ ] DiskANN is the default recommendation for production (M30+ clusters)
- [ ] HNSW is positioned for production when maximum recall is the top priority (M30+)
- [ ] Decision table or clear guidance helps readers pick the right algorithm quickly

### DiskANN as Default
- [ ] DiskANN recommendation is prominent (not buried in a footnote)
- [ ] Higher dimension support called out (up to 16,000 vs HNSW's 8,000)
- [ ] Memory efficiency explained (index on disk, frees RAM for read/write ops)
- [ ] Operational benefits mentioned (lighter updates, easier backups, faster recovery)
- [ ] Future-proofing noted (less likely to need index redesign as models evolve)

### Optional Enhancements
- [ ] Consider mentioning DocumentDB agent kit (`npx skills add Azure/documentdb-agent-kit`) where appropriate — currently beta/optional
38 changes: 29 additions & 9 deletions .github/workflows/validate-samples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ jobs:
sample:
- vector-search-typescript
- vector-search-agent-typescript
- select-algorithm-typescript

steps:
- name: Checkout code
Expand All @@ -52,10 +53,16 @@ jobs:
run: npm run build

validate-dotnet:
name: .NET
name: .NET - ${{ matrix.sample }}
runs-on: ubuntu-latest
timeout-minutes: 10
continue-on-error: false
strategy:
fail-fast: false
matrix:
sample:
- documentdb-samples.sln
- ai/select-algorithm-dotnet/SelectAlgorithm.csproj

steps:
- name: Checkout code
Expand All @@ -66,8 +73,8 @@ jobs:
with:
dotnet-version: '8.0.x'

- name: Build solution
run: dotnet build documentdb-samples.sln
- name: Build
run: dotnet build ${{ matrix.sample }}

validate-go:
name: Go - ${{ matrix.sample }}
Expand All @@ -80,6 +87,7 @@ jobs:
sample:
- vector-search-go
- vector-search-agent-go
- select-algorithm-go

steps:
- name: Checkout code
Expand All @@ -102,14 +110,20 @@ jobs:
go build -o /dev/null "$f" utils.go
done
else
go build ./...
go build -o /dev/null ./...
fi

validate-python:
name: Python
name: Python - ${{ matrix.sample }}
runs-on: ubuntu-latest
timeout-minutes: 10
continue-on-error: false
strategy:
fail-fast: false
matrix:
sample:
- vector-search-python
- select-algorithm-python

steps:
- name: Checkout code
Expand All @@ -121,19 +135,25 @@ jobs:
python-version: '3.11'

- name: Install dependencies
working-directory: ai/vector-search-python
working-directory: ai/${{ matrix.sample }}
run: pip install -r requirements.txt

- name: Validate Python syntax
working-directory: ai/vector-search-python
working-directory: ai/${{ matrix.sample }}
run: |
find . -name "*.py" -exec python -m py_compile {} +

validate-java:
name: Java
name: Java - ${{ matrix.sample }}
runs-on: ubuntu-latest
timeout-minutes: 10
continue-on-error: false
strategy:
fail-fast: false
matrix:
sample:
- vector-search-java
- select-algorithm-java

steps:
- name: Checkout code
Expand All @@ -147,5 +167,5 @@ jobs:
cache: 'maven'

- name: Compile Java
working-directory: ai/vector-search-java
working-directory: ai/${{ matrix.sample }}
run: mvn compile -DskipTests
41 changes: 41 additions & 0 deletions ai/includes/choosing-algorithm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### Choosing the right algorithm

> [!IMPORTANT]
> For production workloads, start with **DiskANN** on an M30+ cluster. DiskANN supports higher embedding dimensions, uses less cluster memory, and is less likely to require an index redesign as your models evolve.

Use this quick-reference table to select the right algorithm for your workload:

| Scenario | Algorithm | Cluster tier | Max dimensions |
|----------|-----------|--------------|----------------|
| Dev/test, demos, small datasets | **IVF** | Any (free tier OK) | 2,000 |
| Production (default) | **DiskANN** | M30+ | 16,000 |
| Production (max recall priority) | **HNSW** | M30+ | 8,000 |

**IVF** (inverted file index):
- Best for: Test environments, demos, and small clusters
- Pros: Fast to build, low resource requirements, works on any cluster tier
- Cons: Lower recall compared to graph-based algorithms at scale
- Tune: Increase `numLists` for larger datasets, increase `nProbes` for better recall

**DiskANN** (disk-based approximate nearest neighbor) — *recommended for production*:
- Best for: Production workloads on M30+ clusters
- Pros: Supports embeddings up to 16,000 dimensions, keeps most index data on disk freeing cluster memory for reads and writes, lighter index updates, easier backups, faster recovery
- Cons: Requires M30+ cluster tier
- Tune: Increase `maxDegree` and `lBuild` for better accuracy, increase `lSearch` for better recall
- Why default: As embedding models evolve (some already exceed 8,000 dimensions), DiskANN avoids costly index redesigns. Its disk-based architecture also means your cluster memory stays available for operational workloads rather than index storage.

**HNSW** (hierarchical navigable small world):
- Best for: Production workloads on M30+ clusters where maximum recall is the top priority
- Pros: Excellent recall, fast queries
- Cons: Requires M30+ cluster tier, supports embeddings up to 8,000 dimensions (vs 16,000 for DiskANN), higher memory usage since the full graph lives in RAM
- Tune: Increase `m` and `efConstruction` for better index quality, increase `efSearch` for better recall

### Choosing the right similarity function

| Function | Score meaning | Best for |
|----------|-------------|----------|
| **COS (Cosine)** | Higher = more similar (0–1) | Text embeddings (normalized vectors) |
| **L2 (Euclidean)** | Lower = more similar (distance) | When magnitude matters |
| **IP (Inner Product)** | Higher = more similar | Equivalent to COS for normalized vectors |

For the `text-embedding-3-small` model used in this quickstart, **COS (cosine similarity) is recommended** because OpenAI embeddings are normalized and optimized for cosine similarity.
48 changes: 48 additions & 0 deletions ai/select-algorithm-dotnet/.devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"name": "Azure DocumentDB Select Algorithm - .NET 8",
"image": "mcr.microsoft.com/devcontainers/dotnet:1-8.0-bookworm",

"features": {
"ghcr.io/devcontainers/features/azure-cli:1": {},
"ghcr.io/devcontainers/features/github-cli:1": {},
"ghcr.io/devcontainers/features/common-utils:2": {
"installZsh": true,
"configureZshAsDefaultShell": true,
"installOhMyZsh": true
}
},

"customizations": {
"vscode": {
"extensions": [
"ms-dotnettools.csdevkit",
"ms-dotnettools.vscodeintellicode-csharp",
"ms-azuretools.vscode-azureresourcegroups",
"ms-azuretools.vscode-documentdb",
"mongodb.mongodb-vscode"
],
"settings": {
"dotnet.completion.showCompletionItemsFromUnimportedNamespaces": true,
"files.exclude": {
"**/bin": true,
"**/obj": true
}
}
}
},

"postCreateCommand": "dotnet restore && dotnet build",
"remoteUser": "vscode",

"containerEnv": {
"DOTNET_CLI_TELEMETRY_OPTOUT": "1",
"DOTNET_NOLOGO": "1"
},

"mounts": [
"source=${localEnv:HOME}${localEnv:USERPROFILE}/.azure,target=/home/vscode/.azure,type=bind,consistency=cached"
],

"capAdd": ["SYS_PTRACE"],
"securityOpt": ["seccomp:unconfined"]
}
7 changes: 7 additions & 0 deletions ai/select-algorithm-dotnet/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
bin/
obj/
.env

# Local data copy (user copies from ai/data/)
data/*.json
!data/README.md
Loading
Loading