diff --git a/Lsof.8 b/Lsof.8 index 1e5fd037..cc81314c 100644 --- a/Lsof.8 +++ b/Lsof.8 @@ -924,7 +924,10 @@ is mutually exclusive with .B \-j and .BR \-t . -Warnings and errors are sent to stderr; stdout is always valid JSON. +Warnings and errors are sent to stderr; stdout is always valid JSON +(see +.B "CHARACTER ENCODING NOTE" +below). .TP \w'names'u+4 .B \-j selects JSON Lines output mode. Each open file produces one JSON @@ -940,6 +943,27 @@ is mutually exclusive with .B \-J and .BR \-t . +.IP +.B "Character encoding note:" +JSON (RFC\ 8259) mandates that strings be valid UTF\-8. +However, file names on Unix\-like systems are arbitrary byte sequences +and may contain bytes that are not valid UTF\-8. +When such bytes appear, +.B lsof +passes them through to the output unchanged. +This means the output is not strictly conformant JSON, but the +original file name can be recovered. +This is consistent with the behaviour of +.BR lsfd (1), +.BR ip (8) +.RB ( \-j ), +and other Linux utilities that produce JSON output. +Consumers that require strict RFC\ 8259 conformance should +filter or re\-encode such values (e.g.\& using +.BR iconv (1) +or Python's +.B surrogateescape +error handler). .TP \w'names'u+4 .BI \-i " [i]" selects the listing of files any of whose Internet address diff --git a/docs/options.md b/docs/options.md index 010010f8..993471ae 100644 --- a/docs/options.md +++ b/docs/options.md @@ -76,6 +76,23 @@ Lsof has these options to control its output format: - -F produce output that can be parsed by a subsequent program. +- -J produce nested JSON output. Instead of tabular or + field output, lsof emits a single JSON object with a + `processes` array. Field selection follows -F rules. + Mutually exclusive with -j and -t. + +- -j produce JSON Lines output. Each open file produces + one JSON object per line (denormalized with process + fields). Suitable for streaming pipelines and log + ingestion tools. Mutually exclusive with -J and -t. + + **Note:** Unix file names are arbitrary byte sequences and may + contain bytes that are not valid UTF-8. When this occurs, lsof + passes the raw bytes through unchanged, producing output that is + not strictly conformant with RFC 8259. This matches the behavior + of `lsfd(1)`, `ip -j`, `systemctl --output=json`, and other Linux + tools. + - -g print process group (PGID) IDs. - -l list UID numbers instead of login names. diff --git a/docs/tutorial.md b/docs/tutorial.md index 14b0644c..300b7ab9 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -602,7 +602,30 @@ homogeneous across Unix dialects. Thus, if you write a script to post-process field output for AIX, it probably will work for HP-UX, Solaris, and Ultrix as well. -Support for other formats e.g. JSON is planned. +### JSON Output + +Lsof supports two JSON output modes: + +- **`-J`** (nested JSON) — produces a single JSON object containing a + `processes` array, where each process has a `files` array of open-file + entries. Suitable for tools that consume a complete document (e.g. + `python3 -m json.tool`, `jq`). + +- **`-j`** (JSON Lines) — produces one JSON object per line, combining + process and file fields in a single denormalized record. Suitable for + streaming pipelines, log ingestion (Splunk, Datadog, Elastic Stack), + and line-oriented tools. + +Both modes reuse the `-F` field-selection mechanism. For example, +`lsof -J -Fpcfn` limits output to PID, command, fd, and name fields. + +**Encoding caveat:** JSON (RFC 8259) requires strings to be valid UTF-8, +but Unix file names are arbitrary byte sequences. When file names +contain non-UTF-8 bytes, lsof passes them through unchanged — the output +is technically not valid JSON, but preserves the original file name. +This is the same approach taken by `lsfd`, `ip -j`, and most Linux tools +that produce JSON. If your consumer requires strict UTF-8, use a filter +such as `iconv` or Python's `surrogateescape` codec error handler. ## The Lsof Exit Code and Shell Scripts diff --git a/src/print.c b/src/print.c index f750f3a8..625ee42f 100644 --- a/src/print.c +++ b/src/print.c @@ -99,6 +99,16 @@ static int human_readable_size(SZOFFTYPE sz, int print, int col); * JSON output helpers */ +/* + * json_puts_escaped() - write a C string as a JSON string value (without + * the surrounding quotes). + * + * Control characters (< 0x20) are escaped as \uXXXX. Bytes >= 0x80 are + * passed through unchanged. This means non-UTF-8 file names produce + * output that is not strictly RFC 8259 conformant, but preserves the + * original byte sequence. This is the same trade-off made by lsfd(1), + * ip(8) -j, and other Linux JSON-producing tools. See issue #354. + */ static void json_puts_escaped(const char *s) { const unsigned char *p = (const unsigned char *)s; while (*p) {