A modular Spring Boot application for ingesting, processing, and searching PDF documents. Uploads are persisted securely, run through OCR and text-cleaning, enriched via an LLM adapter, and prepared for downstream indexing and search.
- pdf-api – REST edge for upload/download and folder-scanning hooks (see
DocumentController). - pdf-service – Orchestrates ingestion, OCR, enrichment, and emits index events to search.
- pdf-database – JPA entities/repositories for users, documents, and per-page text storage.
- pdf-common – Shared DTOs, mappers, events, and utility interfaces used across modules.
- pdf-security – Spring Security configuration and models for authentication.
- pdf-llm – Adapters for LLM-backed enrichment of OCR text.
- pdf-search – Placeholder for Elasticsearch integration fed by index events.
- Upload –
DocumentServiceImpl.processUploadvalidates the authenticated user and PDF, persists the file, and publishes anOcrEventwith the raw bytes for downstream processing. - OCR –
OcrEventListenerinvokesDocumentOcrProcessorImplto extract page text, saves per-page results, and emits anEnrichmentEventcontaining the OCR output. - Enrichment –
DocumentEnrichmentProcessorImplcleans the first page, calls the enrichment service asynchronously, and persists title/date/tags back onto the document; a follow-up indexing step is triggered after enrichment completes. - Download –
DocumentServiceImpl.downloadDocumentenforces ownership and streams the stored PDF bytes back to the caller.
- Prerequisites – Java 21, Maven 3.9+, and Docker if you want optional infrastructure (databases/search) via
docker-compose-infra.yml. - Build – From the repo root, run
mvn clean packageto compile all modules. (If Maven Central access is restricted, configure a mirror or local cache.) - Start services – Launch dependencies with
docker compose up -dand then runpdf-apiwithmvn spring-boot:run -pl pdf-apito expose REST endpoints on port 8080. - Upload & process – POST to
/api/documents/uploadwith a PDF file; OCR and enrichment run asynchronously once the upload is accepted.
The project follows a layered testing strategy: fast Mockito unit tests for processors/adapters, Spring @DataJpaTest slices for persistence and event publication, and targeted container-based integration tests for search. See TESTING_STRATEGY.md for the detailed testing backlog and conventions.