improvement(kb): optimize processes, add more robust fallbacks for large file ops#2684
Merged
waleedlatif1 merged 9 commits intostagingfrom Jan 6, 2026
Merged
improvement(kb): optimize processes, add more robust fallbacks for large file ops#2684waleedlatif1 merged 9 commits intostagingfrom
waleedlatif1 merged 9 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Contributor
Greptile SummaryThis PR optimizes knowledge base document processing with a focus on handling large files more robustly. Key improvements include:
The changes address previous thread concerns about transaction safety and improve overall robustness for large document operations. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as Frontend (base.tsx)
participant API as Document Service
participant Processor as Document Processor
participant OCR as Mistral OCR API
participant Parser as File Parser (DOC/DOCX)
participant S3 as S3 Storage
participant DB as Database
Client->>API: processDocumentAsync(documentId)
activate API
API->>DB: Update status to 'processing'
API->>Processor: processDocument(fileUrl, mimeType)
activate Processor
alt PDF with OCR enabled
Processor->>Processor: getPdfPageCount(buffer)
alt Page count > 1000
Processor->>Processor: splitPdfIntoChunks(buffer, 1000)
loop For each chunk batch (MAX_CONCURRENT_CHUNKS)
Processor->>S3: Upload chunk PDF
S3-->>Processor: presigned URL
Processor->>OCR: Process chunk via Mistral OCR
OCR-->>Processor: Extracted text
Processor->>S3: Delete chunk PDF
end
Processor->>Processor: Combine all chunk results
else Page count <= 1000
Processor->>S3: Upload full PDF
S3-->>Processor: presigned URL
Processor->>OCR: Process full PDF
OCR-->>Processor: Extracted text
end
else DOC/DOCX file
alt Primary parser (mammoth/officeparser)
Processor->>Parser: Parse with primary parser
Parser-->>Processor: Extracted text
else Primary fails
Processor->>Parser: Fallback to secondary parser
Parser-->>Processor: Extracted text
end
end
Processor->>Processor: Chunk content (TextChunker)
Processor-->>API: {chunks, metadata}
deactivate Processor
API->>API: generateEmbeddings(chunks) in batches
API->>DB: BEGIN TRANSACTION
activate DB
API->>DB: DELETE embeddings for documentId
loop For each batch of embeddings
API->>DB: INSERT embedding batch
end
API->>DB: UPDATE document status to 'completed'
DB-->>API: COMMIT TRANSACTION
deactivate DB
API-->>Client: Success
deactivate API
Note over Client: React Query refetchInterval (3s)<br/>polls while processing=true
Client->>API: Fetch documents (auto-refresh)
API-->>Client: Updated document status
|
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
c62a575 to
c065eb7
Compare
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
waleedlatif1
added a commit
that referenced
this pull request
Jan 8, 2026
…rge file ops (#2684) * improvement(kb): optimize processes, add more robust fallbacks for large file ops * stronger typing * comments cleanup * ack PR comments * upgraded turborepo * ack more PR comments * fix failing test * moved doc update inside tx for embeddings chunks upload * ack more PR comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Type of Change
Testing
Tested manually
Checklist