100% Local RAG pipeline for document ingestion. Your data never leaves your machine.
This project is built on top of Microsoft.Extensions.DataIngestion, the new official .NET library from Microsoft that provides a standardized, extensible pipeline for data ingestion in RAG (Retrieval-Augmented Generation) scenarios.
Microsoft.Extensions.DataIngestion offers out-of-the-box abstractions for the complete ingestion workflow:
- Document loading β read documents from multiple sources via pluggable readers (including the MarkItDown MCP integration).
- Semantic chunking β split documents into meaningful chunks using token-aware strategies.
- Enrichment β augment chunks with metadata such as AI-generated summaries.
- Embedding generation β produce vector embeddings through any compatible provider (e.g., Ollama, OpenAI).
- Vector storage β persist embeddings into vector stores like SQLite + sqlite-vec via Semantic Kernel connectors.
By leveraging this library, DataIngest avoids reinventing the wheel and focuses on composing a fully local, privacy-first pipeline where every step β from parsing to search β runs on your machine.
Transform your documents into a searchable semantic knowledge base using Ollama for AI processing, SQLite for vector storage, and MarkItDown MCP for document conversion.
| Feature | Description |
|---|---|
| π Privacy First | All processing runs locally with Ollama |
| π§ Semantic Chunking | Intelligent document splitting based on meaning |
| π Auto-summarization | AI-generated summaries for each chunk |
| π Vector Search | Fast semantic search with SQLite + sqlite-vec |
| π» Interactive CLI | Real-time search with relevance visualization |
| ποΈ Clean Architecture | SOLID principles throughout |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β .md Files ββββββΆβ MarkItDown MCP ββββββΆβ Semantic β
β (./data) β β (Docker:3001) β β Chunker β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Semantic βββββββ SQLite Vector βββββββ Ollama β
β Search β β Store β β Embeddings β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
| Dependency | Version | Purpose |
|---|---|---|
| .NET SDK | 10.0+ | Runtime |
| Ollama | Latest | Local LLM inference |
| Docker | Latest | MarkItDown MCP server |
ollama pull qwen3:1.7b # Chat & summarization (structured output support)
ollama pull nomic-embed-text # Embeddings (768 dimensions)# 1. Start Ollama
ollama serve
# 2. Start MarkItDown MCP Server
docker run -p 3001:3001 mcp/markitdown --http --host 0.0.0.0 --port 3001
# 3. Add your documents to ./data/
# 4. Run
dotnet runAlternative: MarkItDown via pip
pip install markitdown-mcp-server
markitdown-mcp --http --host 0.0.0.0 --port 3001<!-- Core pipeline -->
<PackageReference Include="Microsoft.Extensions.DataIngestion" Version="10.0.1-preview.1.25571.5" />
<PackageReference Include="Microsoft.Extensions.DataIngestion.MarkItDown" Version="10.0.1-preview.1.25571.5" />
<!-- Vector storage -->
<PackageReference Include="Microsoft.SemanticKernel.Connectors.SqliteVec" Version="1.67.1-preview" />
<!-- LLM client -->
<PackageReference Include="OllamaSharp" Version="5.4.16" />
<!-- Tokenization -->
<PackageReference Include="Microsoft.ML.Tokenizers.Data.Cl100kBase" Version="2.0.0" />dataingest/
βββ src/
β βββ Program.cs # Entry point & orchestration
β βββ Configuration/
β β βββ PipelineConfig.cs # Centralized settings
β βββ Services/
β β βββ PipelineFactory.cs # Component factory (DI)
β βββ UI/
β βββ ConsoleUI.cs # Console interactions
βββ data/ # Input documents (.md files)
βββ dataingest.csproj
βββ README.md
| Component | Principle | Responsibility |
|---|---|---|
PipelineConfig |
SRP | Centralized configuration |
ConsoleUI |
SRP | User interface / console output |
PipelineFactory |
DIP, OCP | Component creation, dependency decoupling |
Program |
Composition Root | Orchestration only |
All settings in src/Configuration/PipelineConfig.cs:
public record PipelineConfig
{
public string OllamaEndpoint { get; init; } = "http://localhost:11434";
public string ChatModel { get; init; } = "qwen3:1.7b";
public string EmbeddingModel { get; init; } = "nomic-embed-text";
public int EmbeddingDimensions { get; init; } = 768;
public int MaxTokensPerChunk { get; init; } = 2000;
public int OverlapTokens { get; init; } = 200;
public TimeSpan HttpTimeout { get; init; } = TimeSpan.FromMinutes(5);
public int TopResults { get; init; } = 5;
}- Clean: Any existing
vectors.dbis deleted automatically - Read: Documents loaded via MarkItDown MCP server
- Chunk: Semantic splitting using embedding similarity
- Enrich: Auto-generate summaries with LLM
- Store: Embeddings saved to SQLite with sqlite-vec
- Search: Interactive semantic search loop
Each run performs a fresh ingestion to ensure data consistency.
| Issue | Solution |
|---|---|
| Timeout errors | Increase HttpTimeout in config |
| MarkItDown connection refused | Check Docker: docker ps | grep markitdown |
| First query slow | Normal - model loading into memory |
- SummaryEnricher batch size:
BatchSizeis set to 1 to ensure Ollama returns the correct number of summaries per chunk - Cold start latency: First embedding takes longer as model loads
| Technology | Purpose |
|---|---|
| .NET 10 | Runtime & framework |
| Ollama | Local LLM inference |
| Semantic Kernel | AI orchestration |
| sqlite-vec | Vector search |
| MarkItDown | Document conversion |
MIT - See LICENSE file for details.
Built with β€οΈ for local-first AI


