Quick reference for AI assistants working on the Fess Crawler project.
Fess Crawler is a Java-based web crawling framework for enterprise content extraction.
- Language: Java 21+
- Build: Maven 3.x
- License: Apache 2.0
- DI: LastaFlute DI
- Repo: https://github.com/codelibs/fess-crawler
- HTTP: Apache HttpComponents 4.5+ and 5.x (switchable)
- Extraction: Apache Tika, POI, PDFBox
- Testing: JUnit 4, UTFlute, Mockito, Testcontainers
- Storage: In-memory (default), OpenSearch (optional)
- Cloud: AWS SDK v2 (S3), Google Cloud Storage
HTTP/HTTPS, File, FTP/FTPS, SMB/CIFS (SMB1/SMB2+), Storage (MinIO via storage://), S3 (s3://), GCS (gcs://)
Office (Word, Excel, PowerPoint), PDF, Archives (ZIP, TAR, GZ, LHA), HTML, XML, JSON, Markdown, Media metadata, Images (EXIF/IPTC/XMP), Email (EML)
fess-crawler-parent/
├── fess-crawler/ # Core framework
├── fess-crawler-lasta/ # LastaFlute DI integration
└── fess-crawler-opensearch/ # OpenSearch backend
- Factory:
CrawlerClientFactory,ExtractorFactory- protocol/format-specific component selection - Strategy:
CrawlerClient,Extractor,Transformer- pluggable implementations - Builder:
RequestDataBuilder,ExtractorBuilder- fluent construction - Template Method:
AbstractCrawlerClient,AbstractExtractor- common logic with overrides - DI: LastaFlute container with
@Resourceand XML config
Thread Safety: AtomicLong for counters, volatile for status flags, synchronized blocks, thread-local storage via CrawlingParameterUtil
Resource Management: AutoCloseable throughout, DeferredFileOutputStream for large responses, connection pooling, background temp file deletion via FileUtil.deleteInBackground()
Fault Tolerance: FaultTolerantClient wrapper (retry, circuit breaker), SwitchableHttpClient for HTTP client fallback
- Crawler (
Crawler.java): Main orchestrator -execute(),addUrl(),cleanup(),stop() - CrawlerContext (
CrawlerContext.java): Execution context -sessionId,status,accessCount,numOfThread,maxDepth,maxAccessCount - CrawlerThread (
CrawlerThread.java): Worker thread - Poll URL → Validate → Execute → Process → Queue children
SwitchableHttpClient (extends FaultTolerantClient)
├── Hc5HttpClient (default) - Apache HttpComponents 5.x
└── Hc4HttpClient (fallback) - Apache HttpComponents 4.x
HcHttpClient (abstract base class)
├── Hc4HttpClient
└── Hc5HttpClient
Switch via system property: -Dfess.crawler.http.client=hc4 or hc5 (default)
Key Properties: connectionTimeout, soTimeout, proxyHost, proxyPort, userAgent, robotsTxtEnabled, ignoreSslCertificate, maxTotalConnection, defaultMaxConnectionPerRoute
Pattern-based client selection (from crawler/client.xml):
http:.*,https:.*→ SwitchableHttpClientfile:.*→ FileSystemClientsmb:.*→ SmbClient (SMB2+),smb1:.*→ SmbClient (SMB1)ftp:.*,ftps:.*→ FtpClientstorage:.*→ StorageClient,s3:.*→ S3Client,gcs:.*→ GcsClient
- S3Client: AWS SDK v2,
s3://bucket/path, properties:endpoint,accessKey,secretKey,region - GcsClient: Google Cloud SDK,
gcs://bucket/path, properties:projectId,credentialsFile,endpoint - StorageClient: MinIO SDK,
storage://bucket/path
- UrlQueueService: URL queue management (FIFO), duplicate detection
- DataService: Access result persistence, iteration
- Implementations:
UrlQueueServiceImpl,DataServiceImpl(in-memory),OpenSearchDataService(persistent)
CrawlerThread → Client → ResponseProcessor → Transformer → Extractor → ExtractData
↓
← UrlQueueService ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ←
← DataService ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ←
- Rule: Pattern-based response routing (
RegexRule,SitemapsRule) - ResponseProcessor:
DefaultResponseProcessor,SitemapsResponseProcessor,NullResponseProcessor - Transformer:
HtmlTransformer,XmlTransformer,FileTransformer, etc. - Extractor: Weight-based selection (tries in descending weight order)
TikaExtractor, PdfExtractor, MsWordExtractor, MsExcelExtractor, MsPowerPointExtractor, ZipExtractor, HtmlExtractor, MarkdownExtractor, EmlExtractor
- RobotsTxtHelper: RFC 9309 parsing, user-agent matching, crawl-delay, sitemaps
- SitemapsHelper: Sitemap XML parsing, index handling
- MimeTypeHelper: MIME detection via Tika
- EncodingHelper: Charset detection with BOM
- UrlConvertHelper: URL normalization
- ContentLengthHelper: Content length limits per MIME type
mvn clean install # Build all
mvn clean install -DskipTests # Skip tests
mvn test # Run tests
mvn formatter:format # Format code
mvn license:format # Update license headers- 4 spaces (no tabs), opening brace on same line, max line length 120
- JavaDoc required for public APIs
- License headers required (Apache 2.0)
- Structure:
src/test/java/org/codelibs/fess/crawler/ - Base class: Extend
PlainTestCasefrom UTFlute - Test Resources:
src/test/resources/ - Coverage Goal: >80% line coverage
- Fork repo, create feature branch
- Make focused commits with tests
- Format code (
mvn formatter:format && mvn license:format) - Run tests (
mvn test) - Open Pull Request
Core: fess-crawler/src/main/java/org/codelibs/fess/crawler/
Crawler.java,CrawlerContext.java,CrawlerThread.java
Clients: fess-crawler/src/main/java/org/codelibs/fess/crawler/client/
http/-HcHttpClient.java,Hc4HttpClient.java,Hc5HttpClient.java,SwitchableHttpClient.javafs/FileSystemClient.java,ftp/FtpClient.javasmb/SmbClient.java,smb1/SmbClient.javastorage/StorageClient.java,s3/S3Client.java,gcs/GcsClient.java
DI Config: fess-crawler-lasta/src/main/resources/
crawler.xml(root),crawler/client.xml,crawler/extractor.xml,crawler/rule.xml,crawler/transformer.xml,crawler/transformer_basic.xmlcrawler/mimetype.xml,crawler/encoding.xml,crawler/robotstxt.xml,crawler/sitemaps.xmlcrawler/filter.xml,crawler/interval.xml,crawler/contentlength.xml,crawler/urlconverter.xml,crawler/container.xml,crawler/log.xml
All exceptions are unchecked (extend RuntimeException via CrawlerSystemException).
CrawlerSystemException (RuntimeException)
├─ CrawlingAccessException
│ └─ MaxLengthExceededException
├─ ChildUrlsException
├─ ExtractException
│ └─ UnsupportedExtractException
├─ CrawlerLoginFailureException
├─ ExecutionTimeoutException
├─ MimeTypeException
├─ MultipleCrawlingAccessException
├─ RobotsTxtException
└─ SitemapsException
Use CrawlingParameterUtil to set/get CrawlerContext and UrlQueue in worker threads. Always clear in finally block with CrawlingParameterUtil.clearAll().
Always use try-with-resources for ResponseData - temp files are auto-deleted on close.
- Format parameters as
key=value(e.g.,sessionId={},url={}) - Prefix with
[name]when context identification is needed - Use full words, not abbreviations
- Log only identifying fields, not entire objects