File tokenizer obscures encoding-related exceptions

# Bug report

### Bug description:

When a source file declares an invalid or incompatible encoding, direct execution currently reports a generic `SyntaxError: encoding problem: <encoding>`. Importing the same file, or executing it via `-m`, reports the underlying cause, such as unknown encoding: ... or the concrete codec decode error. This makes direct execution less informative and inconsistent with the import/runpy path.


### Unknown encoding: current vs expected

```diff
❯ cat t.py
# coding: dict-unpacking-at-home
print('Hi') 

❯ ./python.exe t.py
-SyntaxError: encoding problem: dict-unpacking-at-home
+  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+SyntaxError: unknown encoding: dict-unpacking-at-home

❯ cat t2.py
import t

❯ ./python.exe t2.py
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
    import t
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home

❯ ./python.exe -m t
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
                               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 877, in get_code
  File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
  File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home
```

### Encoding error: current vs expected

```diff
❯ cat t.py
# coding: shift-jis

lubię Pythona i jego team

❯ ./python.exe t.py
-SyntaxError: encoding problem: shift-jis
+  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+    # coding: shift-jis
+SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 8: illegal multibyte sequence

❯ cat t2.py
import t

❯ ./python.exe t2.py
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
    import t
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence

❯ ./python.exe -m t
Traceback (most recent call last):
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
                               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 877, in get_code
  File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
  File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
  File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence
```

## Investigation

Direct execution uses the file tokenizer, which reads and decodes the file incrementally. Importing uses importlib to read the source into memory and then compiles the full string, so it goes through the string tokenizer.

Both tokenizers interpret [the declared source code encoding](https://docs.python.org/3/tutorial/interpreter.html#source-code-encoding) differently when invoked via `set_readline()` in this line of `_PyTokenizer_check_coding_spec`:

https://github.com/python/cpython/blob/6679ac07d881f6e0ce30b7cc28b5671eafa20d9d/Parser/tokenizer/helpers.c#L420-L426

- If `set_readline` is `fp_setreadl` (file tokenizer), it creates a file reader with `builtins.open` and uses that new reader to re-read the encoding declaration line and the remaining lines of the file.

  Only `fp_setreadl` can allow `SyntaxError: encoding problem` from line 422 to bubble up, *unless* the erroneous bytes aren't translated by the new reader; either because of (1) no newline at the end of file, which deferred the translation in the incremental newline decoder, or (2) because the erroneous bytes are in a future chunk that was not yet read from the disk (around 8kB after the encoding spec line).

- If `set_readline` is `buf_setreadl` (string tokenizer), it just sets `tok->enc` and always succeeds. The encoding error only happens later in `_PyTokenizer_translate_into_utf8` which bails out _before_ the parser is created; it goes through `_PyPegen_raise_tokenizer_init_error` in `_PyPegen_run_parser_from_string` to ensure that it is a `SyntaxError`. This explains why the vague exception never occurred when the code was imported / ran via runpy.

In the file tokenizer path, `_PyTokenizer_check_coding_spec()` overwrites the concrete exception raised by `set_readline()` with the generic `SyntaxError: encoding problem: <encoding>`. The user is no longer given the information of whether there was codec lookup error or a file decoding error.

The linked PR fixes this by dropping the `encoding problem` exception entirely and replacing it with the `_PyPegen_raise_tokenizer_init_error(tok->filename)` call since all `set_readline` failure paths provide an active exception. This way we have a consistent experience in both tokenizers.

EDIT: I moved the tokenizer-init error wrapper from pegen to tokenizer helpers. This used to live in pegen because only parser entry points needed it. The file tokenizer now needs the same normalization at the point where `set_readline()` fails, so the helper was moved to tokenizer helpers rather than making tokenizer code depend on pegen internals. I'm happy to adjust the layering if there is a better home for this helper.

cc @lysnikolaou @pablogsal @serhiy-storchaka

### CPython versions tested on:

3.16, main

### Operating systems tested on:

macOS


### Linked PRs
* gh-151462

	if (strcmp(cs, "utf-8") != 0 && !set_readline(tok, cs)) {
	_PyTokenizer_error_ret(tok);
	PyErr_Format(PyExc_SyntaxError, "encoding problem: %s", cs);
	PyMem_Free(cs);
	return 0;
	}
	tok->encoding = cs;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File tokenizer obscures encoding-related exceptions #151461

Bug report

Bug description:

Unknown encoding: current vs expected

Encoding error: current vs expected

Investigation

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

File tokenizer obscures encoding-related exceptions #151461

Description

Bug report

Bug description:

Unknown encoding: current vs expected

Encoding error: current vs expected

Investigation

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions