Skip to content

File tokenizer obscures encoding-related exceptions #151461

@johnslavik

Description

@johnslavik

Bug report

Bug description:

When a source file declares an invalid or incompatible encoding, direct execution currently reports a generic SyntaxError: encoding problem: <encoding>. Importing the same file, or executing it via -m, reports the underlying cause, such as unknown encoding: ... or the concrete codec decode error. This makes direct execution less informative and inconsistent with the import/runpy path.

Unknown encoding: current vs expected

❯ cat t.py
# coding: dict-unpacking-at-home
print('Hi') 

❯ ./python.exe t.py
-SyntaxError: encoding problem: dict-unpacking-at-home
+  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+SyntaxError: unknown encoding: dict-unpacking-at-home

❯ cat t2.py
import t

❯ ./python.exe t2.py
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
    import t
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home

❯ ./python.exe -m t
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
                               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 877, in get_code
  File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
  File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home

Encoding error: current vs expected

❯ cat t.py
# coding: shift-jis

lubię Pythona i jego team

❯ ./python.exe t.py
-SyntaxError: encoding problem: shift-jis
+  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+    # coding: shift-jis
+SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 8: illegal multibyte sequence

❯ cat t2.py
import t

❯ ./python.exe t2.py
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
    import t
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence

❯ ./python.exe -m t
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
                               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 877, in get_code
  File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
  File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence

Investigation

Direct execution uses the file tokenizer, which reads and decodes the file incrementally. Importing uses importlib to read the source into memory and then compiles the full string, so it goes through the string tokenizer.

Both tokenizers interpret the declared source code encoding differently when invoked via set_readline() in this line of _PyTokenizer_check_coding_spec:

if (strcmp(cs, "utf-8") != 0 && !set_readline(tok, cs)) {
_PyTokenizer_error_ret(tok);
PyErr_Format(PyExc_SyntaxError, "encoding problem: %s", cs);
PyMem_Free(cs);
return 0;
}
tok->encoding = cs;

  • If set_readline is fp_setreadl (file tokenizer), it creates a file reader with builtins.open and uses that new reader to re-read the encoding declaration line and the remaining lines of the file.

    Only fp_setreadl can allow SyntaxError: encoding problem from line 422 to bubble up, unless the erroneous bytes aren't translated by the new reader; either because of (1) no newline at the end of file, which deferred the translation in the incremental newline decoder, or (2) because the erroneous bytes are in a future chunk that was not yet read from the disk (around 8kB after the encoding spec line).

  • If set_readline is buf_setreadl (string tokenizer), it just sets tok->enc and always succeeds. The encoding error only happens later in _PyTokenizer_translate_into_utf8 which bails out before the parser is created; it goes through _PyPegen_raise_tokenizer_init_error in _PyPegen_run_parser_from_string to ensure that it is a SyntaxError. This explains why the vague exception never occurred when the code was imported / ran via runpy.

In the file tokenizer path, _PyTokenizer_check_coding_spec() overwrites the concrete exception raised by set_readline() with the generic SyntaxError: encoding problem: <encoding>. The user is no longer given the information of whether there was codec lookup error or a file decoding error.

The linked PR fixes this by dropping the encoding problem exception entirely and replacing it with the _PyPegen_raise_tokenizer_init_error(tok->filename) call since all set_readline failure paths provide an active exception. This way we have a consistent experience in both tokenizers.

EDIT: I moved the tokenizer-init error wrapper from pegen to tokenizer helpers. This used to live in pegen because only parser entry points needed it. The file tokenizer now needs the same normalization at the point where set_readline() fails, so the helper was moved to tokenizer helpers rather than making tokenizer code depend on pegen internals. I'm happy to adjust the layering if there is a better home for this helper.

cc @lysnikolaou @pablogsal @serhiy-storchaka

CPython versions tested on:

3.16, main

Operating systems tested on:

macOS

Linked PRs

Metadata

Metadata

Assignees

Labels

3.14bugs and security fixes3.15pre-release feature fixes, bugs and security fixes3.16new features, bugs and security fixesinterpreter-core(Objects, Python, Grammar, and Parser dirs)topic-parsertype-bugAn unexpected behavior, bug, or error
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions