Convert legacy log and data formats to PFC — the high-performance cold storage format for JSONL data.
Apache CLF · nginx · CSV/TSV · NDJSON → JSONL → .pfc
gzip · zstd · bzip2 · lz4 · xz → auto-decompressed transparently
Part of the PFC Ecosystem.
For pure compression-format migration (gzip/zstd → .pfc, content unchanged) see pfc-migrate.
pfc-convert is a schema converter: it reads old log/data formats line by line, rewrites them as structured JSONL with a proper timestamp field, and compresses the result as .pfc — including a block-level timestamp index for fast time-range queries.
| Input | What happens |
|---|---|
access.log.gz (Apache CLF) |
decompress → parse fields → JSONL → .pfc |
data.csv |
detect delimiter → header as keys → JSONL → .pfc |
events.jsonl.gz |
decompress → compress as .pfc |
s3://bucket/logs/ |
in-region conversion, no data egress |
pip install pfc-convertRequires the pfc_jsonl binary on your system. Set PFC_JSONL_BINARY=/path/to/pfc_jsonl if it is not in $PATH.
Install pfc_jsonl binary:
# Linux x64:
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
# macOS (Apple Silicon M1–M4):
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonlLicense note:
pfc_jsonlis free for personal and open-source use. Commercial use requires a written license — see pfc-jsonl.
Optional extras:
pip install "pfc-convert[s3]" # S3 support
pip install "pfc-convert[all]" # all cloud backends + compression formats# Single file — auto-detect schema and compression
pfc-convert convert access.log.gz
# Explicit schema
pfc-convert convert access.log.gz --schema apache --out archive.pfc
# CSV with specific timestamp column
pfc-convert convert events.csv --schema csv --timestamp-field event_time
# Output JSONL instead of .pfc (inspect before compressing)
pfc-convert convert access.log --schema apache --output-format jsonl
# Convert all logs in a directory
pfc-convert convert --dir /var/log/apache/ --schema apache --out-dir /archive/pfc/
# S3 in-region conversion
pfc-convert s3 --bucket my-logs --prefix apache/2024/ --schema apache| Schema flag | Format | Auto-detected |
|---|---|---|
apache |
Apache Common / Combined Log Format | ✓ |
nginx |
nginx CLF (structurally identical to Apache) | ✓ |
nginx-json |
nginx JSON log mode (already JSONL) | ✓ |
Example Apache CLF line and its JSONL output:
192.168.1.1 - alice [29/Apr/2026:10:30:00 +0200] "GET /api HTTP/1.1" 200 4523 "https://ref.example.com" "curl/7.88"
{"timestamp": "2026-04-29T10:30:00+02:00", "ip": "192.168.1.1", "user": "alice", "method": "GET", "path": "/api", "protocol": "HTTP/1.1", "status": 200, "bytes": 4523, "referer": "https://ref.example.com", "user_agent": "curl/7.88"}| Schema flag | Format | Notes |
|---|---|---|
csv |
CSV / TSV | Header row → JSON keys, delimiter auto-detected |
ndjson |
NDJSON / JSONL | No schema change, compress only |
| Format | Magic bytes |
|---|---|
| gzip | \x1f\x8b |
| zstd | \x28\xb5\x2f\xfd |
| bzip2 | BZh |
| lz4 | \x04\x22\x4d\x18 |
| xz | \xfd7zXZ\x00 |
Magic bytes are checked first — file extension is ignored. A file named .log that is actually gzip-compressed will be decompressed correctly.
pfc-convert convert server_logs.gz --schema apache --out archive.pfc
# gzip detected → decompress → CLF → JSONL → .pfc (no temp files)pfc-convert convert server_logs.gz --schema apache --output-format jsonl --out archive.jsonl
# → inspect archive.jsonl, verify it looks correct
pfc-migrate convert archive.jsonl --out archive.pfc
# → compress once verifiedpfc-convert convert server_logs.gz --schema apache --stdout \
| pfc-migrate convert --stdin --out archive.pfc
# streams directly, no intermediate file- Delimiter: auto-detected (comma, semicolon, tab, pipe)
- Header: first row becomes JSON keys; BOM (
) stripped automatically - Timestamp: auto-detected from column named
timestamp,time,ts,@timestamp,datetime,date, orevent_time; override with--timestamp-field - Encoding: UTF-8 with optional BOM; legacy Latin-1 files handled gracefully via
errors='replace'
Production logs are never clean. Use --on-error to control behaviour:
| Flag | Behaviour |
|---|---|
--on-error skip |
Skip unparseable lines, continue (default) |
--on-error fail |
Stop on first error |
--on-error log |
Skip + write errors to <input>.convert_errors.log |
The .pfc block index (.bidx sidecar file) stores the timestamp range per block.
This enables pfc-gateway and DuckDB to answer time-range queries without decompressing any data.
pfc-convert sets timestamps automatically:
| Schema | Source |
|---|---|
| Apache / nginx CLF | [29/Apr/2026:10:30:00 +0200] → ISO 8601 |
| nginx-json | time_local or time_iso8601 field |
| NDJSON | timestamp, ts, time, or @timestamp field |
| CSV | --timestamp-field or auto-detected column |
| Fallback | Processing time (with warning — reduces query efficiency) |
Every converted file can be recorded to a JSONL audit log:
pfc-convert convert --dir /var/log/ --schema apache --audit-log migration.logEach entry contains: logged_at, input, output, schema, compression, rows_ok, rows_err, input_mb, output_mb, ratio_pct, duration_s.
pfc-convert s3 \
--bucket my-logs \
--prefix apache-logs/2024/ \
--out-bucket my-pfc-archive \
--out-prefix pfc/2024/ \
--schema apache \
--region eu-central-1Data stays in-region — no egress costs. Both .pfc and .bidx files are uploaded.
AWS credentials are read from environment variables, IAM roles, or ~/.aws/credentials (standard boto3 chain). Explicit credentials via --access-key / --secret-key.
For programmatic use and integration with pfc-ingest-watchdog:
from pfc_convert import ConvertPipeline
pipeline = ConvertPipeline(
source = "/var/log/apache/access.log.gz",
destination = "/archive/access.pfc",
schema = "auto",
on_error = "log",
)
result = pipeline.run()
# result: {'rows_ok': 84231, 'rows_err': 3, 'output_mb': 2.14, 'ratio_pct': 8.7, 'duration_s': 1.2, ...}Legacy data pfc-convert PFC Ecosystem
─────────────────────────────────────────────────────────────────────
access.log.gz ───────► schema convert ► .pfc archive
data.csv ───────► + compress ► pfc-gateway (query)
events.jsonl ───────► ► pfc-duckdb (SQL)
s3://legacy/ ───────► in-region, no egress ► s3://archive/
After conversion, use:
- pfc-duckdb —
SELECT * FROM read_pfc_jsonl('archive.pfc') WHERE status = 500 - pfc-gateway — REST API with time-range filter
- pfc-grafana — Grafana data source plugin
| Input | Output |
|---|---|
access.log.gz |
access.pfc |
data.csv |
data.pfc |
events.jsonl.gz |
events.pfc |
Use --out to specify an explicit output path.
→ View all PFC tools & integrations
| Direct integration | Why |
|---|---|
| pfc-migrate | Pipe partner — pfc-convert --stdout | pfc-migrate --stdin for streaming compression |
| pfc-ingest-watchdog | Calls pfc-convert automatically when new files arrive in folder or S3 |
pfc-convert (this repository) is released under the MIT License — see LICENSE.
The PFC-JSONL binary (pfc_jsonl) is proprietary software — free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com