Skip to content

False positive .notdef glyph violation (6.2.11.8) for symbolic TrueType fonts with no /Encoding when char code 0 maps to a real glyph #1575

@ituzlukov

Description

@ituzlukov

Description

veraPDF incorrectly reports a PDF/A-3u (and PDF/A-2u) violation of clause 6.2.11.8 (.notdef glyph reference) for documents containing symbolic TrueType fonts with no explicit /Encoding entry, where character code 0x00 is legitimately mapped to a real glyph (e.g. space, GID 61) via the font's built-in cmap.

The document is valid: the font's internal encoding maps char code 0 to a real glyph, not .notdef. veraPDF reports a false positive.

Steps to reproduce

  1. Take a PDF/A-3u document with an embedded symbolic TrueType font (Flags bit 3 set, bit 6 clear) that has no /Encoding entry in the font dictionary.
  2. The font's cmap maps character code 0x00 to a real glyph (e.g. space, GID 61 - not GID 0).
  3. The content stream contains a text-showing operator (e.g. TJ) that uses character code 0x00.
  4. Run veraPDF validation against PDF/A-3u (or PDF/A-2u).

Result: veraPDF reports violation of clause 6.2.11.8:

"The document contains a reference to the .notdef glyph"

Expected: No violation. Char code 0 maps to a real glyph via the font's built-in encoding; no .notdef reference exists.

Test case

Attached file: 2039171-page01-text-only.pdfa-3u.pdf

The document contains three symbolic TrueType fonts, all without an explicit /Encoding entry:

Font resource BaseFont Flags
/F0 IAEBHR+TimesNewRomanPS-ItalicMT 6 (Symbolic)
/F1 WGHOTS+TimesNewRomanPS-BoldItalicMT 262150 (Symbolic)
/F2 YLQPBY+TimesNewRomanPSMT 6 (Symbolic)

veraPDF v.1.28.2, veraPDF v.1.31.5 output:

2039171-page01-text-only.pdfa... ... NOT COMPLIANT
  - Clause 6.2.11.8  (ISO 19005-3:2012) : Glyph
    A PDF/A-3 compliant document shall not contain a reference to the .notdef glyph from any of the text showing operators, regardless of text rendering mode, in any content stream
        test:    name != ".notdef"
        error:   The document contains a reference to the .notdef glyph
        context: root/document[0]/pages[0](7 0 obj PDPage)/contentStream[0](15 0 obj PDContentStream)/operators[6]/usedGlyphs[55](IAEBHR+TimesNewRomanPS-ItalicMT IAEBHR+TimesNewRomanPS-ItalicMT 0 0  0 false)
        context: root/document[0]/pages[0](7 0 obj PDPage)/contentStream[0](15 0 obj PDContentStream)/operators[362]/usedGlyphs[0](WGHOTS+TimesNewRomanPS-BoldItalicMT WGHOTS+TimesNewRomanPS-BoldItalicMT 0 0  0 false)
        context: root/document[0]/pages[0](7 0 obj PDPage)/contentStream[0](15 0 obj PDContentStream)/operators[506]/usedGlyphs[0](YLQPBY+TimesNewRomanPSMT YLQPBY+TimesNewRomanPSMT 0 0  0 false)

In every flagged context the character code is 0x00.
Analysis of the font program (via pikepdf + fontTools) confirms that char code 0x00 maps to glyph name space, GID 61 - not GID 0 (.notdef):

/F0

operators[6]:
  [ (\033) 10 (\() -87 (\000) 82 (6) 10 (7) 47 (,) 10 (;) 10 (,) 10 (4) 10 (>) 10 (0) 10 (5) 10 (4) 10 (,) -86 (\000) 82 (@) -86 (\000) 82 (0) 10 (4) -87 (\000) 82 (6) 10 (7) 10 (0) 10 (3) 10 (5) -87 (\000) 82 (2) 10 (:) 10 (5) 10 (.) 10 (5) -87 (\000) 82 (*) 10 (:) 10 (2) 10 (9) 10 (:) 10 (7) 10 (\() -87 (\000) 82 (+) 10 (,) 10 (2) 10 (2) 10 (\() -87 (\000) 82 (8) 10 (0) 10 (*) 10 (:) 10 (7) 47 (,) 10 (>) 10 (>) 10 (\() -87 (\000) ] TJ
  
operators[6] charcodes stats:
+------+-----+-----+------------+-----+---------+
| code | dec | hex | glyph name | GID | Unicode |
+------+-----+-----+------------+-----+---------+
| \000 | 00  | 00  | space      |  61 |         |   <- NOT .notdef
| \033 | 27  | 1b  | L          |  42 | L       |
| \(   | 40  | 28  | a          |  57 | a       |
...
+------+-----+-----+------------+-----+---------+

/F1

operators[362]:
  (\000) Tj
  
operators[362] charcodes stats:
  +------+-----+-----+------------+-----+---------+
  | code | dec | hex | glyph name | GID | Unicode |
  +------+-----+-----+------------+-----+---------+
  | \000 | 00  | 00  | space      |  47 |         |   <- NOT .notdef
  +------+-----+-----+------------+-----+---------+

/F2

operators[506]:
  (\000) Tj
  
operators[506] charcodes stats:
  +------+-----+-----+------------+-----+---------+
  | code | dec | hex | glyph name | GID | Unicode |
  +------+-----+-----+------------+-----+---------+
  | \000 | 00  | 00  | space      |  67 |         |   <- NOT .notdef
  +------+-----+-----+------------+-----+---------+

Root cause

File: validation-model/src/main/java/org/verapdf/gf/model/impl/operator/textshow/GFGlyph.java
Lines: 91–95

if (font instanceof PDSimpleFont) {
    Encoding encoding = font.getEncodingMapping();
    this.name = encoding == null ? null : encoding.getName(glyphCode);
    if (this.name == null && glyphCode == 0 && font instanceof PDTrueTypeFont) {
        this.name = ".notdef";  //  ¯\_(ツ)_/¯
    }
}

Chain of events for a symbolic TrueType font with no /Encoding

Step 1. font.getEncodingMapping() calls PDFont.getEncodingMappingFromCOSObject().
Since there is no /Encoding key in the font dictionary, cosEncoding.getDirectBase() is null, so it returns Encoding.empty().

Step 2. Encoding.empty().getName(0) is called.
Encoding.empty() has predefinedEncoding = new String[0] and differences = null.
Inside getName():

// Encoding.java:105
return (predefinedEncoding.length != 0) ? NOTDEF : null;
// predefinedEncoding.length == 0  ->  returns null

The comment on this very line reads:
"if no predefined encoding, the null result for using font encoding"
-> null is the intended signal to fall back to the font program's own encoding.

Step 3. Back in GFGlyph: this.name == null + glyphCode == 0 + font instanceof PDTrueTypeFont
-> hardcodes .notdef, completely ignoring what the font program would say.

Why the hardcode is wrong for symbolic TrueType fonts

ISO 32000-1:2008 9.6.6.4:

"When the font has no Encoding entry, or the font descriptor's Symbolic flag is set (in which case the Encoding entry is ignored), this shall occur:

  • If the font contains a (3, 0) subtable, the range of character codes shall be one of these: 0x0000 – 0x00FF, 0xF000 – 0xF0FF, 0xF100 – 0xF1FF, or 0xF200 – 0xF2FF. Depending on the range of codes, each byte from the string shall be prepended with the high byte of the range, to form a two-byte character, which shall be used to select the associated glyph description from the subtable.
  • Otherwise, if the font contains a (1, 0) subtable, single bytes from the string shall be used to look up the associated glyph descriptions from the subtable."

Per the spec, char code 0 is a valid single byte that shall be looked up in the font's cmap. If the cmap maps it to a real glyph (as it does here - space, GID 61), there is no .notdef reference. Assigning .notdef to char code 0 without consulting the font program contradicts the spec.

Note also the phrase "in which case the Encoding entry is ignored": for a symbolic font, the /Encoding entry is irrelevant regardless of whether it is present or absent. veraPDF does the opposite - it derives the glyph name from Encoding.getName() and, when that returns null for code 0, falls back to hardcoding .notdef instead of consulting the font's cmap.

Note that initForNotType3() already has an analogous workaround for the glyphPresent field:

// GFGlyph.java:181-183
// every font contains notdef glyph. But if we call method
// of font program we can't distinguish case of code 0
// and glyph that is not present indeed.
glyphPresent = glyphCode == 0 || font.glyphIsPresent(glyphCode);

This workaround correctly prevents a false "glyph not present" error for code 0. However, it does not affect the name field, which is what clause 6.2.11.8 actually checks.

Suggested fix

File: validation-model/src/main/java/org/verapdf/gf/model/impl/operator/textshow/GFGlyph.java

Replace the entire if (font instanceof PDSimpleFont) block with:

if (font instanceof PDSimpleFont) {
    Encoding encoding = (font instanceof PDTrueTypeFont && ((PDTrueTypeFont) font).isSymbolic())
            ? Encoding.empty() // ISO 32000, 9.6.6.4: Symbolic flag -> Encoding entry is ignored
            : font.getEncodingMapping();
    this.name = encoding == null ? null : encoding.getName(glyphCode);
    if (this.name == null && font instanceof PDTrueTypeFont) {
        // ISO 32000, 9.6.6.4: no Encoding or Symbolic -> consult font program (cmap)
        FontProgram fp = font.getFontProgram();
        if (fp != null) {
            String programName = fp.getGlyphName(glyphCode);
            this.name = (programName != null) ? programName
                    : (fp.containsCode(glyphCode) ? null : ".notdef");
        } else if (glyphCode == 0) {
            // conservative fallback: font program unavailable, assume .notdef for code 0
            this.name = ".notdef";
        }
    }
}

Key changes:

  1. For symbolic TrueType fonts, force Encoding.empty() so the /Encoding entry is ignored per spec. Encoding.empty().getName() returns null for all codes, which is then resolved by the font program.
  2. When getName() returns null for a TrueType font, consult the font program: getGlyphName() returns " " (non-.notdef sentinel) for symbolic fonts, or the actual glyph name for non-symbolic fonts. If getGlyphName() returns null, fall back to containsCode() (cmap lookup). This is intentionally broader than just code 0: it correctly handles any code for which the PDF-level encoding returns null but the font's cmap resolves to a real glyph.
  3. Conservative fallback: if the font program is unavailable entirely (fp == null), assume .notdef for code 0.

Note on the suggested fix

I am not a veraPDF developer and my reading of the internals may be incomplete.
If the root cause analysis above is wrong, I hope the attached test case is sufficient to reproduce the issue and help you locate the real problem.
Either way, happy to provide any additional information.
Thank you for building and maintaining veraPDF.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions