| title | Files |
|---|---|
| sidebar_position | 6 |
import ApiSchema from '@theme/ApiSchema'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
In Open Register, Files are binary data attachments that can be associated with objects. They extend the system beyond structured data to include documents, images, videos, and other file types that are essential for many applications.
Files in Open Register are:
- Securely stored and managed
- Associated with specific objects
- Versioned alongside their parent objects
- Accessible through a consistent API
- Integrated with Nextcloud's file management capabilities
Files can be attached to objects in several ways:
- Integrated Uploads: Files can be uploaded directly within object POST/PUT operations using multipart/form-data, base64-encoded content, or URL references
- Schema-defined file properties: When a schema includes properties of type 'file', these are automatically handled during object creation or updates
- Direct API attachment: Files can be added to an object after creation using the file attachment API endpoints
- Base64 encoded content: Files can be included in object data as base64-encoded strings
- URL references: External files can be referenced by URL and will be downloaded and stored locally
OpenRegister supports integrated file uploads directly within object POST/PUT operations, providing a unified approach to handling structured data (objects) and unstructured data (files) together.
Use Case: Uploading files from web forms or file inputs
Example:
POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data
title=Annual Report 2024
attachment=@report.pdf
thumbnail=@cover.jpgJavaScript Example:
const formData = new FormData();
formData.append('title', 'Annual Report 2024');
formData.append('attachment', fileInput.files[0]);
formData.append('thumbnail', thumbnailInput.files[0]);
fetch('/index.php/apps/openregister/api/registers/documents/schemas/document/objects', {
method: 'POST',
body: formData,
headers: {
'Authorization': 'Bearer YOUR_TOKEN'
}
})
.then(response => response.json())
.then(data => console.log('Created:', data));Why this is recommended:
- ✅ Most efficient: No encoding overhead, files transferred directly
- ✅ Preserves metadata: Original filename and MIME type are maintained
- ✅ No guessing: Extension and filename are exactly as uploaded
- ✅ Best file quality: No conversion or inference errors
- ✅ Low memory footprint: Can stream directly from disk to disk
- ✅ Fastest method: Direct transfer without intermediate conversions
Use Case: Embedding files in JSON payloads, API integrations
Data URI Format:
{
"title": "Screenshot",
"image": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA..."
}Plain Base64 Format:
{
"title": "Document",
"attachment": "JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9MZW5ndGggMj..."
}Note: Base64 encoding increases file size by approximately 33% and original filenames are lost. Use only for small files (< 100 KB) or when multipart is not possible.
Use Case: Referencing remote files, importing from external sources
Example:
{
"title": "External Document",
"attachment": "https://example.com/files/document.pdf",
"logo": "https://cdn.example.com/images/logo.png"
}Note: URL references are slower as the server must download the file from the external URL. Use only for trusted sources or migration scenarios.
You can combine all three methods in a single request:
POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data
title=Complete Package
mainDocument=@contract.pdf
signature=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA...
reference=https://example.com/terms.pdfFiles can be uploaded as arrays:
Schema:
{
"properties": {
"attachments": {
"type": "array",
"items": {
"type": "file"
}
}
}
}Upload:
{
"title": "Multi-File Document",
"attachments": [
"data:application/pdf;base64,JVBERi0xLjQKJeL...",
"https://example.com/file2.pdf",
"data:image/png;base64,iVBORw0KGgo..."
]
}File properties work the same way with PUT/PATCH operations:
PUT /index.php/apps/openregister/api/registers/documents/schemas/document/objects/abc-123
Content-Type: multipart/form-data
title=Updated Document
attachment=@new-version.pdfNote: Updating a file property replaces the previous file.
{
"error": "File at attachment has invalid type 'application/zip'. Allowed types: application/pdf, application/msword"
}{
"error": "File at attachment exceeds maximum size (10485760 bytes). File size: 15728640 bytes"
}{
"error": "Failed to read uploaded file for field 'attachment'"
}{
"error": "Unable to fetch file from URL: https://example.com/missing.pdf"
}✅ Existing file endpoints remain unchanged:
POST /api/objects/{register}/{schema}/{id}/filesGET /api/objects/{register}/{schema}/{id}/filesDELETE /api/objects/{register}/{schema}/{id}/files/{fileId}
Both approaches work and can be used interchangeably.
| Method | Speed | File Size | Metadata | Use Case |
|---|---|---|---|---|
| Multipart | Fastest | Original | Preserved | ✅ Recommended for all uploads |
| Base64 | Medium | +33% larger | Lost | |
| URL | Slowest | Original | Preserved | 🐌 External imports only |
-
✅ ALWAYS use Multipart for user uploads
- Users expect filenames to be preserved
- Prevents confusion about generic filenames
-
⚠️ Base64 only for APIs- When API client doesn't support multipart
- Document that filenames will be lost
- Always use data URI format with MIME type
-
🐌 URLs only for trusted sources
- Use timeout limits (max 30 seconds)
- Validate content-length headers upfront
- Implement retry logic
-
📝 Document your choice
- If using base64 or URL, explain why
- Make users aware of trade-offs
-
🧪 Test performance
- Measure upload times in production
- Monitor failure rates for URL downloads
Each file attachment includes rich metadata:
- Basic properties (name, size, type, extension)
- Creation and modification timestamps
- Access and download URLs
- Checksum for integrity verification
- Custom tags for categorization
Files can be tagged with both simple labels and key-value pairs:
- Tags with a colon (':') are treated as key-value pairs and can be used for advanced filtering and organization
The system maintains file versions by:
- Tracking file modifications with timestamps
- Preserving checksums to detect changes
- Integrating with the object audit trail system
- Supporting file restoration from previous versions
File attachments inherit the security model of their parent objects:
- Files are stored in NextCloud with appropriate permissions
- Share links can be generated for controlled external access
- Access is managed through the OpenRegister user and group system
- Files are associated with the OpenRegister application user for consistent permissions
The system supports the following operations on file attachments:
- Retrieving Files
- Updating Files
- Deleting Files
The system leverages NextCloud's preview capabilities for supported file types:
- Images are displayed as thumbnails
- PDFs can be previewed in-browser
- Office documents can be viewed with compatible apps
- Preview URLs are generated for easy embedding
File attachments are fully integrated with the object lifecycle:
- When objects are created, their file folders are automatically provisioned
- When objects are updated, file references are maintained
- When objects are deleted, associated files can be optionally preserved or removed
- File operations are recorded in the object's audit trail
The file attachment system is implemented through two main service classes:
- FileService: Handles low-level file operations, folder management, and NextCloud integration
- ObjectService: Provides high-level methods for attaching, retrieving, and managing files in the context of objects
These services work together to provide a seamless file management experience within the OpenRegister application.
Open Register provides flexible storage options for files:
By default, files are stored in Nextcloud's file system, leveraging its robust file management capabilities, including:
- Access control
- Versioning
- Encryption
- Collaborative editing
For larger deployments or specialized needs, files can be stored in:
- Object storage systems (S3, MinIO)
- Content delivery networks
- Specialized document management systems
Small files can be stored directly in the database for simplicity and performance.
Files maintain version history, allowing you to:
- Track changes over time
- Revert to previous versions
- Compare different versions
Files inherit access control from their parent objects, ensuring consistent security:
- Users who can access an object can access its files
- Additional file-specific permissions can be applied
- Permissions can be audited
Files support rich metadata to provide context and improve searchability:
- Standard metadata (creation date, size, type)
- Custom metadata specific to your application
- Extracted metadata (e.g., EXIF data from images)
Open Register can generate previews for common file types:
- Thumbnails for images
- PDF previews
- Document previews
For supported file types, content can be extracted for indexing and search:
- Text extraction from documents
- OCR for scanned documents and images
- Metadata extraction
:::tip Enhanced Text Extraction OpenRegister now includes enhanced text extraction with entity tracking (GDPR), language detection, and language level assessment. See Enhanced Text Extraction & GDPR Entity Tracking for details. :::
Asynchronous Processing: Text extraction happens in the background after file upload, ensuring:
- Fast uploads: Your file uploads complete instantly without waiting
- Non-blocking: Users don't experience delays during file operations
- Reliable: Background jobs automatically handle retries for failed extractions
- Resource-efficient: Processing happens when resources are available
Text Extraction Options:
OpenRegister supports two text extraction engines:
-
LLPhant (Default) - PHP-based extraction:
- ✓ Native support: TXT, MD, HTML, JSON, XML, CSV
- ○ Library support: PDF, DOCX, DOC, XLSX, XLS (requires PhpOffice, PdfParser)
⚠️ Limited: PPTX, ODT, RTF- ✗ No support: Image files (JPG, PNG, GIF, WebP)
- Best for: Privacy-conscious environments, regular documents
- Cost: Free (included)
-
Dolphin AI - Advanced AI-powered extraction:
- ✓ All document formats with superior quality
- ✓ OCR for scanned documents and images (JPG, PNG, GIF, WebP)
- ✓ Advanced table extraction
- ✓ Formula recognition
- ✓ Multi-language OCR
- Best for: Complex documents, scanned materials, images with text
- Cost: API subscription required
Extraction Scope Options:
- None: Text extraction disabled
- All files: Extract from all uploaded files
- Files in folders: Extract only from files in specific folders
- Files attached to objects: Extract only from files linked to objects (recommended)
Typical Processing Times:
- Text files: < 1 second
- PDFs (LLPhant): 2-10 seconds
- PDFs (Dolphin): 3-15 seconds
- Large documents or OCR: 10-60 seconds
- Images with OCR (Dolphin): 5-20 seconds
You can configure text extraction in Settings → File Configuration. Check extraction status in the file's metadata after upload.
Background Job Processing:
Text extraction uses Nextcloud's background job system for reliable, async processing:
- File Upload - User uploads a file
- Job Queuing - 'FileChangeListener' automatically queues 'FileTextExtractionJob'
- Job Execution - Background job system processes the file when resources are available
- Text Extraction - Selected extractor (LLPhant or Dolphin) processes the file
- Chunking - Text is automatically split into chunks with overlap (1000 chars per chunk, 200 char overlap)
- Storage - Extracted text and chunks stored in 'FileText' entity for reuse
- Completion - Status updated to 'completed' or 'failed'
Note: Text extraction is now fully independent of SOLR. Chunks are generated during extraction and stored in the database, making them reusable for SOLR indexing, vector embeddings, AI processing, or any other service that needs chunked text.
File Type Compatibility Matrix:
LLPhant Support:
- ✓ Native (TXT, MD, HTML, JSON, XML, CSV) - Perfect quality, very fast
- ○ Library (PDF, DOCX, DOC, XLSX, XLS) - Good quality, medium speed
⚠️ Limited (PPTX, ODT, RTF) - Basic text only, use Dolphin for better results- ✗ No Support (JPG, PNG, GIF, WebP) - Requires Dolphin with OCR
Dolphin AI Support:
- ✓ All formats with superior quality
- ✓ OCR for scanned documents and images
- ✓ Table extraction with structure preserved
- ✓ Formula recognition (LaTeX format)
- ✓ Multi-language support
- ✓ Layout understanding (multi-column, etc.)
OCR-Specific Use Cases (Dolphin only):
- Document Digitization - Scanning paper archives into searchable text
- Receipt Processing - Photo receipts from mobile devices
- Screenshot Analysis - Extract text from application screenshots
- Infographic Text - Extract text from images with embedded text
- Historical Documents - Digitize old scanned materials
Quality Requirements for OCR:
- Minimum: 150 DPI resolution
- Recommended: 300+ DPI
- Clear, high-contrast images
- Minimal blur or distortion
- Properly oriented (not rotated)
Extraction Configuration Options:
Configure in Settings → File Configuration:
-
Text Extractor Selection:
- LLPhant (default) - Local, free, privacy-friendly
- Dolphin - Advanced AI, requires API key
-
Extraction Scope:
- None - Disabled
- All files - Every uploaded file
- Files in folders - Specific folders only
- Files attached to objects - Only object attachments (recommended)
-
Extraction Mode:
- Background (default) - Async via background jobs
- Immediate - Synchronous during upload (slower)
- Manual - Triggered by admin action only
-
Enabled File Types:
- Select which file extensions to process
- Different for LLPhant vs Dolphin
- Enable OCR formats (images) only if using Dolphin
Integration Tests:
The file text extraction system includes comprehensive integration tests:
# Run file extraction tests
vendor/bin/phpunit tests/Integration/FileTextExtractionIntegrationTest.php
# Test cases covered:
# - File upload queues background job
# - Background job execution completes
# - Text extraction end-to-end with content verification
# - Multiple file format support (TXT, MD, JSON)
# - Extraction metadata recording (status, method, timestamps)Monitoring Extraction:
Check extraction status via logs:
# Watch extraction progress
docker logs -f nextcloud-container | grep FileTextExtractionJob
# Check for errors
docker logs nextcloud-container | grep 'extraction failed'
# View extraction statistics
# Settings → File Configuration → Statistics sectionThe Files page provides a centralized view of all files tracked in the text extraction system.
Accessing the Files Page:
Navigate to Files in the main menu to view all files with their extraction status.
Features:
-
File List Table:
- File name and path
- File type and size
- Extraction status (Pending, Processing, Completed, Failed)
- Number of text chunks created
- Last extraction timestamp
-
Status Indicators:
- 🟠 Pending: File discovered but not yet extracted
- 🔵 Processing: Extraction in progress
- 🟢 Completed: Successfully extracted
- 🔴 Failed: Extraction error occurred
-
File Actions:
- Retry: Re-extract failed files
- View Error: See detailed error message for failed extractions
-
Pagination:
- Browse through large file lists (50 files per page)
- Navigate between pages
-
Refresh:
- Update the list to see latest extraction status
Use Cases:
- Monitor extraction progress across all files
- Identify and retry failed extractions
- View error details for troubleshooting
- Verify which files have been processed
Core File Extraction API:
OpenRegister provides dedicated API endpoints for file text extraction (moved from settings to core functionality):
GET /api/files- List all tracked files with extraction statusGET /api/files/{id}- Get single file extraction informationPOST /api/files/{id}/extract- Extract text from specific filePOST /api/files/extract- Extract all pending files (batch processing)POST /api/files/retry-failed- Retry all failed extractionsGET /api/files/stats- Get extraction statistics
Smart Re-Extraction:
The system automatically detects when files need re-extraction by comparing:
- File modification time ('mtime' from Nextcloud's 'oc_filecache')
- Last extraction time ('extractedAt' from 'oc_openregister_file_texts')
If 'mtime > extractedAt', the file is re-extracted to ensure content is up-to-date.
File Tracking Table:
Extracted text and metadata are stored in 'oc_openregister_file_texts' with:
- 'file_id' - Links to Nextcloud's 'oc_filecache' table
- 'extraction_status' - pending, processing, completed, failed
- 'extractedAt' - Timestamp of last extraction
- 'text_content' - Full extracted text
- 'text_length' - Character count
- 'chunked' - Whether text has been chunked
- 'chunk_count' - Number of chunks created
- 'chunks_json' - JSON array of text chunks with offsets (new in v0.2.7)
- 'extraction_method' - LLPhant or Dolphin
- Plus SOLR indexing and vectorization tracking
Chunking Details: Each chunk in 'chunks_json' contains the chunk text, start offset, and end offset. This allows for precise text retrieval and consistent chunking across all services.
Files can be uploaded and attached to objects:
POST /api/objects/{id}/files
Content-Type: multipart/form-data
file: [binary data]
metadata: {"author": "Legal Department", "securityLevel": "confidential"}
You can download a file:
GET /api/files/{id}
Or get file metadata:
GET /api/files/{id}/metadata
You can retrieve all files associated with an object:
GET /api/objects/files/{objectId}
Files can be updated in two ways:
Upload a new version of the file:
PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json
{
'content': '[base64 encoded content or raw content]',
'tags': ['tag1', 'tag2']
}
Update only the file metadata (tags) without changing content:
PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json
{
'tags': ['updated-tag1', 'updated-tag2']
}
Note: The 'content' parameter is optional. If omitted, only the metadata will be updated without modifying the file content itself.
Files can be deleted when no longer needed:
DELETE /api/files/{id}
Files have important relationships with other core concepts:
- Files are attached to objects
- An object can have multiple files
- Files inherit permissions from their parent object
- Files are versioned alongside their parent object
- Schemas can define expectations for file attachments
- File validation can be specified in schemas (allowed types, max size)
- Schemas can define required file attachments
- Registers can be configured with different file storage options
- File storage policies can be defined at the register level
- Registers can have quotas for file storage
Attach important documents to business objects:
- Contracts to customer records
- Invoices to order records
- Specifications to product records
Store and manage media assets:
- Product images
- Marketing materials
- Training videos
Maintain evidence for regulatory or legal purposes:
- Compliance documentation
- Audit evidence
- Legal case files
Manage technical documents:
- User manuals
- Technical specifications
- Installation guides
File properties can be configured to automatically share uploaded files publicly. This is useful for assets that need to be accessible without authentication, such as product images or public documents.
When editing a schema in the OpenRegister UI:
- Select a property with type 'file' or 'array' with items type 'file'
- In the property actions menu, expand the 'File Configuration' section
- Check the 'Auto-Share Files' checkbox
- Save the schema
Files uploaded to this property will now be automatically publicly shared.
In your schema definition, add the 'autoPublish' option to file properties:
{
'properties': {
'productImage': {
'type': 'file',
'autoPublish': true,
'allowedTypes': ['image/jpeg', 'image/png'],
'maxSize': 5242880
}
}
}When 'autoPublish' is set to 'true', files uploaded to this property will automatically:
- Create a public share link
- Set the 'published' timestamp
- Generate a public 'accessUrl' and 'downloadUrl'
1. Property-Level autoPublish (this section):
{
'properties': {
'productImage': {
'type': 'file',
'autoPublish': true // ← Controls if FILES are published
}
}
}Controls whether files uploaded to this specific property are automatically shared publicly.
2. Schema-Level autoPublish (different setting):
{
'configuration': {
'autoPublish': true // ← Controls if OBJECTS are published
}
}Controls whether the object entity itself is published (has nothing to do with file sharing).
These are completely separate settings with different purposes. Setting one does NOT affect the other.
{
'id': '12345',
'title': 'Product A',
'productImage': {
'id': 789,
'title': 'product-a.jpg',
'accessUrl': 'https://your-domain.com/index.php/s/AbCdEfG123',
'downloadUrl': 'https://your-domain.com/index.php/s/AbCdEfG123/download',
'published': '2024-01-15T10:30:00+00:00',
'size': 245678,
'type': 'image/jpeg'
}
}Files that are not publicly shared still have 'accessUrl' and 'downloadUrl' properties, but these URLs require authentication. This allows frontend applications to:
- Display file previews for logged-in users
- Provide download links that work within authenticated sessions
- Maintain security while offering convenient access
Non-shared files return URLs with the following format:
- Access URL:
/index.php/core/preview?fileId={fileId}&x=1920&y=1080&a=1 - Download URL:
/index.php/apps/openregister/api/files/{fileId}/download
These URLs require the user to be authenticated to Nextcloud.
{
'attachment': {
'id': 456,
'title': 'confidential-report.pdf',
'accessUrl': 'https://your-domain.com/index.php/core/preview?fileId=456&x=1920&y=1080&a=1',
'downloadUrl': 'https://your-domain.com/index.php/apps/openregister/api/files/456/download',
'published': null,
'size': 1234567,
'type': 'application/pdf'
}
}When a schema is configured to extract metadata fields like 'image' or 'logo' from file properties, the system automatically extracts the public share URL (or authenticated URL if not shared) and stores it in the object metadata.
{
'properties': {
'logo': {
'type': 'file',
'allowedTypes': ['image/png', 'image/jpeg'],
'autoPublish': true
}
},
'configuration': {
'objectImageField': 'logo'
}
}The object's '@self.image' field will contain the share URL:
{
'id': '12345',
'title': 'Company A',
'logo': {
'id': 789,
'accessUrl': 'https://your-domain.com/index.php/s/XyZ789',
'type': 'image/png'
},
'@self': {
'name': 'Company A',
'image': 'https://your-domain.com/index.php/s/XyZ789'
}
}This makes it easy to display company logos, product images, or other visual metadata in listings and search results.
Files can be deleted by setting the file property to 'null' (for single file properties) or an empty array (for array file properties).
PUT /api/objects/{register}/{schema}/{id}
Content-Type: application/json
{
'title': 'Updated Title',
'attachment': null
}This will:
- Delete the file from Nextcloud storage
- Remove the file record from the database
- Set the 'attachment' property to 'null' in the object data
PUT /api/objects/{register}/{schema}/{id}
Content-Type: application/json
{
'title': 'Updated Gallery',
'images': []
}This will:
- Delete all files in the array from Nextcloud storage
- Remove all file records from the database
- Set the 'images' property to an empty array in the object data
- Privacy Compliance: Remove sensitive files upon user request
- Storage Management: Clean up unused files
- Data Lifecycle: Remove temporary or expired files
- Error Correction: Remove incorrectly uploaded files
OpenRegister automatically blocks executable files from being uploaded for security reasons. This prevents malicious code execution and protects your Nextcloud instance.
Windows Executables
.exe,.bat,.cmd,.com,.msi,.scr.vbs,.vbe,.js,.jse,.wsf,.wsh.ps1(PowerShell),.dll
Unix/Linux Executables
.sh,.bash,.csh,.ksh,.zsh.run,.bin,.app.deb,.rpm(package files)
Scripts & Code
.php,.phtml,.php3,.php4,.php5,.phps,.phar.py,.pyc,.pyo,.pyw(Python).pl,.pm,.cgi(Perl).rb,.rbw(Ruby).jar,.war,.ear,.class(Java)
Containers & Packages
.appimage,.snap,.flatpak.dmg,.pkg,.command(macOS).apk(Android)
Binary Formats
.elf,.out,.o,.so,.dylib
OpenRegister uses multiple layers of detection:
Checks the file extension against a blacklist of dangerous extensions.
Checks the first bytes of the file content for executable signatures:
MZ- Windows PE/EXE files\x7FELF- Linux/Unix ELF executables#!/bin/sh- Shell scripts#!/bin/bash- Bash scripts<?php- PHP scripts\xCA\xFE\xBA\xBE- Java class files
Blocks dangerous MIME types:
application/x-executableapplication/x-dosexecapplication/x-msdownloadapplication/x-shapplication/x-phptext/x-shellscript- And more...
By default, ALL executable files are blocked.
POST /api/registers/docs/schemas/document/objects
{
"title": "My Document",
"attachment": "script.sh" // ❌ BLOCKED!
}Response:
{
"error": "File at attachment is an executable file (.sh). Executable files are blocked for security reasons. Allowed formats: documents, images, archives, data files."
}If you absolutely need to allow executables (e.g., for a software repository), you can set allowExecutables: true in your schema:
{
"properties": {
"softwarePackage": {
"type": "file",
"allowExecutables": true, // ⚠️ DANGEROUS!
"allowedTypes": ["application/x-deb"] // Still enforce MIME type
}
}
}allowExecutables: true if:
- You absolutely trust the source of files
- Users are administrators only
- You have other security measures in place (virus scanning, sandboxing)
- You understand the security risks
# Documents
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Report' \
-F 'attachment=@report.pdf' # ✅ OK
# Images
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Photo' \
-F 'image=@photo.jpg' # ✅ OK
# Archives
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Data' \
-F 'data=@archive.zip' # ✅ OK (ZIPs are allowed unless they're JARs)# Windows executable
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Software' \
-F 'file=@program.exe' # ❌ BLOCKED
# Shell script
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Script' \
-F 'file=@setup.sh' # ❌ BLOCKED
# PHP script
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Code' \
-F 'file=@index.php' # ❌ BLOCKEDOpenRegister detects renamed executables:
# Renamed EXE to TXT - Still blocked by magic bytes!
mv malware.exe document.txt
curl -X POST '/api/.../' -F 'file=@document.txt' # ❌ BLOCKED (magic bytes: MZ)
# PHP file renamed to JPG - Still blocked!
mv shell.php image.jpg
curl -X POST '/api/.../' -F 'file=@image.jpg' # ❌ BLOCKED (detects <?php){
"slug": "document",
"properties": {
"title": {
"type": "string"
},
"attachment": {
"type": "file",
"allowedTypes": ["application/pdf", "application/msword"],
"maxSize": 10485760 // 10MB
// allowExecutables defaults to false
}
}
}DO:
- ✅ Use the default behavior (block executables)
- ✅ Only allow documents, images, archives
- ✅ Combine with virus scanning (ClamAV)
DON'T:
- ❌ Set
allowExecutables: trueunless absolutely necessary - ❌ Allow untrusted users to upload files to executable-allowed schemas
- ❌ Assume file extensions are safe
Even with executable blocking, use additional security:
graph TD
A[File Upload] --> B[1. Extension Check]
B --> C[2. Magic Bytes Detection]
C --> D[3. MIME Type Validation]
D --> E[4. Size Validation]
E --> F[5. Virus Scan - ClamAV]
F --> G[6. Store in Nextcloud]
B --> X[❌ Block Executable]
C --> X
D --> X
F --> Y[❌ Virus Detected]
style X fill:#f44
style Y fill:#f44
style G fill:#4f4
All blocked uploads are logged:
# Check logs for blocked attempts
docker logs master-nextcloud-1 | grep "Executable file upload blocked"Minimal!
- Extension check: < 0.1ms
- Magic bytes check: < 1ms (only checks first 1KB)
- MIME type check: < 0.1ms
Total overhead: ~1-2ms per file upload
Q: Can I upload ZIP files?
A: ✅ Yes! ZIP files (.zip) are allowed by default. Only executable ZIPs like JARs are blocked.
Q: What about JavaScript files (.js)? A: ❌ Blocked by default (can be executed in browsers). Use JSON or TXT for data.
Q: Can I upload Python notebooks (.ipynb)?
A: ✅ Yes! .ipynb is JSON format, not an executable. Allowed by default.
Q: What if I need to share code files? A: Use:
- Text files (
.txt) with code inside - Archives (
.zipor.tar.gz) containing code - Git repositories
- Dedicated code hosting (GitHub, GitLab)
Q: Does this protect against all malware? A: No! This blocks known executable formats. Malicious documents (PDF with exploits, Office macros) need virus scanning. Use ClamAV for complete protection.
- Define File Types: Establish clear guidelines for what file types are allowed
- Set Size Limits: Define appropriate size limits for different file types
- Use Metadata: Add relevant metadata to improve searchability and context
- Consider Storage: Choose appropriate storage backends based on file types and access patterns
- Implement Retention Policies: Define how long files should be kept
- Plan for Backup: Ensure files are included in backup strategies
- Consider Performance: Optimize file storage for your access patterns
- Use Auto-Publish Wisely: Only enable property-level 'autoPublish' for files that should be publicly accessible. Remember: property 'autoPublish' (file sharing) is different from schema 'autoPublish' (object publishing)
- Document File Deletion: Maintain audit trails when files are deleted for compliance
- Handle Authentication: Use authenticated URLs for sensitive files
- Keep Executables Blocked: Use the default executable blocking behavior unless absolutely necessary
- Layer Security: Combine executable blocking with virus scanning for complete protection
Files in Open Register bridge the gap between structured data and unstructured content, providing a comprehensive solution for managing all types of information in your application. With advanced features like auto-sharing, authenticated access, metadata extraction, and flexible deletion options, Open Register creates a unified system where all your data—structured and unstructured—works together seamlessly.
This section provides detailed visualization of the file handling system's architecture and data flow.
sequenceDiagram
participant Client
participant API
participant ObjectService
participant SaveObject
participant FileService
participant Nextcloud
participant DB
participant BgJob
participant Extractor
participant Solr
Client->>API: POST /objects with file data
API->>ObjectService: saveObject(data)
ObjectService->>SaveObject: handle(data, uploadedFiles)
Note over SaveObject: 1. Detect file properties
SaveObject->>SaveObject: detectFileProperties(schema)
Note over SaveObject: 2. Process file data
SaveObject->>SaveObject: processFileProperty(fileData)
alt Base64 data
SaveObject->>SaveObject: decodeBase64()
else URL
SaveObject->>SaveObject: fetchFromURL()
else File object
SaveObject->>SaveObject: validateFileObject()
end
Note over SaveObject: 3. Create file
SaveObject->>FileService: createFile(objectEntity, data)
FileService->>FileService: determineFolder()
FileService->>Nextcloud: createFolder(path)
Nextcloud-->>FileService: Folder created
FileService->>Nextcloud: writeFile(path, content)
Nextcloud-->>FileService: File ID
FileService->>FileService: applyAutoTags()
FileService->>FileService: createShareLink()
FileService-->>SaveObject: File metadata
Note over SaveObject: 4. Update object data
SaveObject->>SaveObject: replaceFileDataWithIds()
SaveObject->>DB: INSERT/UPDATE object
DB-->>SaveObject: Object saved
SaveObject-->>ObjectService: ObjectEntity
ObjectService-->>API: Response
API-->>Client: Object with file IDs
Note over BgJob: Background Processing
Nextcloud->>BgJob: FileChangeListener
BgJob->>Extractor: extractText(fileId)
alt LLPhant
Extractor->>Extractor: extractWithLLPhant()
else Dolphin AI
Extractor->>Extractor: extractWithDolphin()
end
Extractor-->>BgJob: Extracted text
BgJob->>DB: Store extracted text
BgJob->>Solr: Index file chunks
Solr-->>BgJob: Indexed
graph TD
A[Object with File Property] --> B{Property Type}
B -->|type=file| C[Single File]
B -->|type=array items.type=file| D[Multiple Files]
C --> E{Data Type?}
D --> E
E -->|Base64 String| F[Decode Base64]
E -->|URL String| G[Fetch from URL]
E -->|File Object| H[Validate File Object]
E -->|File ID| I[Load Existing File]
F --> J[Validate MIME Type]
G --> J
H --> J
I --> J
J --> K{Size Check?}
K -->|Too Large| L[Reject Upload]
K -->|OK| M[Create File Entity]
M --> N[Determine Folder Path]
N --> O[/register/schema/object_uuid/]
O --> P[Write to Nextcloud]
P --> Q[Generate Filename]
Q --> R[Apply Auto Tags]
R --> S{Auto Publish?}
S -->|Yes| T[Create Share Link]
S -->|No| U[Skip Sharing]
T --> V[Store File Metadata]
U --> V
V --> W[Return File ID]
W --> X[Update Object Data]
L --> Y[Return Error]
style J fill:#e1f5ff
style P fill:#ffe1e1
style T fill:#fff4e1
sequenceDiagram
participant NC as Nextcloud
participant Listener as FileChangeListener
participant Job as FileTextExtractionJob
participant Service as FileTextExtractionService
participant LLPhant as LLPhant Extractor
participant Dolphin as Dolphin AI API
participant DB as Database
participant Solr as Solr Index
Note over NC: File uploaded/modified
NC->>Listener: post_create / post_update event
Listener->>Listener: Check extraction scope
alt Scope matches
Listener->>Job: Queue FileTextExtractionJob
Job-->>Listener: Job queued
else Scope doesn't match
Listener->>Listener: Skip extraction
end
Note over Job: Background job execution
Job->>Service: extractText(fileId)
Service->>Service: Check extraction mode
alt Mode=background
Service->>Service: Process immediately
else Mode=immediate
Service->>Service: Process in request
else Mode=manual
Service->>Service: Skip until manual trigger
end
Service->>DB: Get file info
DB-->>Service: File metadata
Service->>Service: Validate file type
alt LLPhant Selected
Service->>LLPhant: extract(filePath)
alt Native format (TXT, MD, HTML)
LLPhant->>LLPhant: Read directly
else PDF/DOCX
LLPhant->>LLPhant: Use library parser
else Image (JPG, PNG)
LLPhant->>LLPhant: Not supported
end
LLPhant-->>Service: Extracted text
else Dolphin AI Selected
Service->>Dolphin: POST /extract
Dolphin->>Dolphin: AI processing
alt Document
Dolphin->>Dolphin: Text + layout extraction
else Image
Dolphin->>Dolphin: OCR processing
end
Dolphin-->>Service: Extracted text + metadata
end
Service->>Service: Chunk document
Service->>DB: Store FileText entity with chunks
DB-->>Service: Stored
Note over Service: Chunks stored in DB for reuse
Note over Service: SOLR indexing is separate
Service->>DB: Update status=completed
DB-->>Service: Success
Service-->>Job: Extraction complete
graph TB
A[File Upload] --> B[FileService]
B --> C{Storage Backend?}
C -->|Nextcloud| D[Nextcloud Files API]
C -->|S3| E[S3 Compatible Storage]
C -->|Database| F[Direct DB Storage]
D --> G[Folder Structure]
G --> H[/openregister/]
H --> I[/register_id/]
I --> J[/schema_id/]
J --> K[/object_uuid/]
K --> L[file_timestamp.ext]
E --> M[Bucket Structure]
M --> N[openregister/register/schema/object/file]
F --> O[file_data BLOB]
L --> P[File Metadata]
N --> P
O --> P
P --> Q[(File Registry)]
Q --> R[file_id]
Q --> S[file_path]
Q --> T[share_link]
Q --> U[checksum]
Q --> V[tags]
style B fill:#e1f5ff
style Q fill:#ffe1e1
graph TD
A[File Type] --> B{Extraction Engine}
B -->|LLPhant| C[LLPhant Support]
B -->|Dolphin AI| D[Dolphin Support]
C --> E[Native: TXT, MD, HTML, JSON, XML, CSV]
C --> F[Library: PDF, DOCX, DOC, XLSX, XLS]
C --> G[Limited: PPTX, ODT, RTF]
C --> H[Not Supported: Images JPG, PNG, GIF, WebP]
D --> I[Full Support: All Documents]
D --> J[OCR: Images JPG, PNG, GIF, WebP]
D --> K[Advanced: Tables, Formulas]
D --> L[Multi-language OCR]
E --> M[✅ Instant]
F --> N[✅ 2-10s]
G --> O[⚠️ May Fail]
H --> P[❌ No Support]
I --> Q[✅ 3-15s]
J --> R[✅ 5-20s OCR]
K --> S[✅ Superior Quality]
L --> T[✅ 100+ Languages]
style C fill:#fff4e1
style D fill:#e1ffe1
graph LR
A[Settings] --> B{Extraction Engine}
B -->|llphant| C[LLPhant Extractor]
B -->|dolphin| D[Dolphin AI API]
A --> E{Extraction Scope}
E -->|none| F[Disabled]
E -->|all| G[All Files]
E -->|folders| H[Specific Folders]
E -->|objects| I[Object Files Only]
A --> J{Extraction Mode}
J -->|background| K[Async Processing]
J -->|immediate| L[Sync Processing]
J -->|manual| M[Manual Trigger]
C --> N{File Type}
D --> N
N -->|Supported| O[Extract Text]
N -->|Unsupported| P[Skip Extraction]
O --> Q[Store in FileText]
Q --> R[Index in Solr]
style B fill:#e1f5ff
style E fill:#fff4e1
style J fill:#ffe1e1
Note: As of v0.2.7, chunking happens during text extraction, not during SOLR indexing. Chunks are stored in the database and reused.
graph TD
A[Text Extraction] --> B[Chunk Document]
B --> C{Chunking Strategy}
C -->|Recursive Character| D[Smart splitting by paragraphs/sentences]
C -->|Fixed Size| E[1000 chars per chunk + 200 overlap]
D --> F[Create Chunks]
E --> F
F --> G[Store in Database]
G --> H[chunks_json column]
H --> I[Chunk 1: 0-1000]
H --> J[Chunk 2: 900-1900]
H --> K[Chunk 3: 1800-2800]
H --> L[...]
subgraph SOLR_Indexing[SOLR Indexing - Reads Pre-Chunked Data]
M[Read chunks_json from DB]
M --> N[Parse JSON chunks]
N --> O[Solr Documents]
end
I --> M
J --> M
K --> M
L --> M
O --> P[Index Fields]
P --> Q[file_id]
P --> R[chunk_index]
P --> S[chunk_text]
P --> T[chunk_start_offset]
P --> U[chunk_end_offset]
Q --> V[(Solr Index)]
R --> V
S --> V
T --> V
U --> V
style G fill:#e1f5ff
style H fill:#ffe1e1
style V fill:#e1ffe1
File Upload Performance:
Small files (<1MB): ~100-200ms
Medium files (1-10MB): ~500ms-2s
Large files (>10MB): ~2-10s
Very large (>100MB): ~10-60s
Text Extraction Performance:
LLPhant:
- TXT/MD/HTML: <1s (instant)
- PDF (10 pages): 2-5s (library parsing)
- DOCX: 3-8s (library parsing)
- Images: N/A (not supported)
Dolphin AI:
- TXT/MD/HTML: 1-2s (API latency)
- PDF (10 pages): 5-10s (AI processing)
- DOCX: 4-8s (AI processing)
- Images (OCR): 5-15s (OCR + AI)
Chunking and Indexing:
Text chunking: <100ms for 100KB text (now part of extraction)
Solr indexing: ~50-200ms per document (reads pre-chunked data)
Batch indexing: ~500ms for 100 chunks (faster with pre-chunked data)
Note: Since v0.2.7, chunking is performed once during text extraction and stored in the database. This makes SOLR indexing faster and allows chunks to be reused for vector embeddings, AI processing, or any other service that needs chunked text.
use OCA\OpenRegister\Service\FileService;
// Create file from base64
$fileMetadata = $fileService->createFile(
objectEntity: $object,
fileData: [
'content' => 'data:image/jpeg;base64,/9j/4AAQ...',
'tags' => ['profile', 'avatar']
]
);
// Create file from URL
$fileMetadata = $fileService->createFile(
objectEntity: $object,
fileData: [
'url' => 'https://example.com/document.pdf',
'tags' => ['imported', 'external']
]
);
// Access file metadata
$fileId = $fileMetadata['id'];
$shareLinkUrl = $fileMetadata['accessUrl'];
$downloadUrl = $fileMetadata['downloadUrl'];use OCA\OpenRegister\Service\FileTextExtractionService;
// Extract text from file
$extractionService->extractText($fileId);
// Get extraction status
$fileText = $fileTextMapper->findByFileId($fileId);
$status = $fileText->getExtractionStatus(); // 'pending', 'processing', 'completed', 'failed'
$text = $fileText->getTextContent();
// Manually trigger extraction
$extractionService->queueExtraction($fileId);// Search across file content in Solr
$results = $solrService->searchFiles([
'_search' => 'contract terms',
'mime_type' => 'application/pdf',
'_limit' => 20
]);
// Access chunk results
foreach ($results['hits'] as $hit) {
$fileId = $hit['file_id'];
$chunkIndex = $hit['chunk_index'];
$text = $hit['chunk_text'];
$highlighted = $hit['highlighted_text'];
}# Run file handling tests
vendor/bin/phpunit tests/Service/FileServiceTest.php
# Test text extraction
vendor/bin/phpunit tests/Service/FileTextExtractionServiceTest.php
# Test specific scenarios
vendor/bin/phpunit --filter testBase64FileUpload
vendor/bin/phpunit --filter testTextExtraction
vendor/bin/phpunit --filter testFileChunking
# Integration tests
vendor/bin/phpunit tests/Integration/FileIntegrationTest.phpTest Coverage:
- File upload (base64, URL, file object)
- File property processing
- Text extraction (LLPhant, Dolphin)
- Chunking and Solr indexing
- File deletion
- Share link generation
- Auto-tagging