Hopefully settle line endings / position mapping#2879
Hopefully settle line endings / position mapping#2879jakebailey wants to merge 9 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors and corrects position/column calculations across the compiler and tooling to disambiguate byte offsets vs UTF-16 code unit offsets (notably for source maps and diagnostic locations), and updates affected baselines accordingly.
Changes:
- Introduces UTF-16-aware line/column APIs (and helpers) and updates call sites to use UTF-16 vs byte offsets appropriately.
- Fixes source map position decoding/recording to convert UTF-16 “character” offsets back into byte positions.
- Updates printer/formatter/diagnostic writer behavior and refreshes affected compiler baselines.
Reviewed changes
Copilot reviewed 35 out of 37 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/core/core.go | Renames PositionToLineAndCharacter to byte-offset naming and adds UTF16Len helper. |
| internal/scanner/scanner.go | Splits ECMAScript line/column APIs into UTF-16 character vs byte-offset variants; adds UTF-16 → byte position conversion. |
| internal/sourcemap/source_mapper.go | Updates source map mapping decode to use UTF-16 character offsets when computing positions. |
| internal/testutil/harnessutil/sourcemap_recorder.go | Fixes source-map span text extraction by converting UTF-16 offsets to byte positions. |
| internal/printer/utilities.go | Updates source-map line/character cache to count UTF-16 code units. |
| internal/printer/textwriter.go | Changes writer column calculation to UTF-16 code units for generated source-map columns. |
| internal/diagnosticwriter/diagnosticwriter.go | Updates diagnostic location formatting to use UTF-16 character offsets; adjusts line-start calculations to byte offsets. |
| internal/transformers/jsxtransforms/jsx.go | Uses UTF-16 columns for jsxDEV __source metadata. |
| internal/format/span.go | Switches formatting logic that needs byte offsets to use the byte-offset line/column API. |
| internal/format/indent.go | Uses byte-offset APIs for indentation computations and refines non-whitespace scanning. |
| internal/compiler/emit_test.go | Updates test comment to reference UTF-16-based position API. |
| internal/astnav/tokens_test.go | Updates test to use renamed byte-offset API. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols.diff | Baseline diff removed/updated due to corrected UTF-16-based positions. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols | Updated symbol baseline positions. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt.diff | Baseline diff updated due to corrected source map position mapping. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt | Updated sourcemap baseline output. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map.diff | Baseline diff updated due to corrected generated mappings/columns. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map | Updated .js.map baseline content. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt.diff | Baseline diff updated due to corrected UTF-16-based diagnostic positions. |
| testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt | Updated diagnostic baseline positions. |
| testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt.diff | Baseline diff updated for corrected line-break/source map mapping. |
| testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt | Updated sourcemap baseline output. |
| testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map.diff | Baseline diff removed/updated due to corrected source map output. |
| testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map | Updated .js.map baseline content. |
| testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols.diff | Baseline diff removed/updated due to corrected UTF-16-based positions. |
| testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols | Updated symbol baseline positions. |
| testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols.diff | Baseline diff updated due to corrected UTF-16-based positions. |
| testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols | Updated symbol baseline positions. |
| testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols.diff | Baseline diff updated due to corrected UTF-16-based positions. |
| testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols | Updated symbol baseline positions. |
| // GetColumn returns the column position measured in UTF-16 code units | ||
| // for source map compatibility. | ||
| func (w *textWriter) GetColumn() core.UTF16Offset { | ||
| if w.lineStart { | ||
| return w.indent * w.indentSize | ||
| return core.UTF16Offset(w.indent * w.indentSize) | ||
| } | ||
| return w.builder.Len() - w.linePos | ||
| // Count UTF-16 code units from the last line start. | ||
| // For ASCII-only output (the common case), this equals the byte count. | ||
| return core.UTF16Len(w.builder.String()[w.linePos:]) |
There was a problem hiding this comment.
GetColumn() now calls w.builder.String() and then scans the entire current line to compute a UTF-16 column. During emit/source-map generation, GetColumn is called very frequently, so this can reintroduce O(n²) behavior on long lines and may also trigger costly builder-to-string conversions. A more efficient approach is to track the current UTF-16 column incrementally as text is written (resetting on newline, and accounting for indentation when lineStart is true), so GetColumn stays O(1) without calling builder.String().
There was a problem hiding this comment.
.String() is free, so we're really just talking about how long it takes to calculate, and I think it's better to calculate the len on request rather than repeatedly len-ing on every single write.
There was a problem hiding this comment.
What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?
There was a problem hiding this comment.
No, it's internally an "unsafe" direct access so long as you don't escape it
| fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar))) | ||
| fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar))) | ||
| case lastLine: | ||
| // Squiggle until the final character. | ||
| fmt.Fprint(writer, strings.Repeat("~", lastLineChar)) | ||
| fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar))) | ||
| default: | ||
| // Squiggle the entire line. | ||
| fmt.Fprint(writer, strings.Repeat("~", len(lineContent))) | ||
| fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent)))) |
There was a problem hiding this comment.
This is very odd, and arguably should actually be rune counts, since these are visible characters.
There was a problem hiding this comment.
Are you just holding off on this to avoid a massive baseline avalanche?
There was a problem hiding this comment.
No matter what we do, things will be just a little weird, so I just didn't change it
...ata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt
Show resolved
Hide resolved
DanielRosenwasser
left a comment
There was a problem hiding this comment.
Looks good, just wanted to get some clarifications before approval.
| fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar))) | ||
| fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar))) | ||
| case lastLine: | ||
| // Squiggle until the final character. | ||
| fmt.Fprint(writer, strings.Repeat("~", lastLineChar)) | ||
| fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar))) | ||
| default: | ||
| // Squiggle the entire line. | ||
| fmt.Fprint(writer, strings.Repeat("~", len(lineContent))) | ||
| fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent)))) |
There was a problem hiding this comment.
Are you just holding off on this to avoid a massive baseline avalanche?
| // raw UTF-8 byte offset from the start of that line for the given byte position. | ||
| // Uses ECMAScript line separators (LF, CR, CRLF, LS, PS). | ||
| // Unlike GetECMALineAndUTF16CharacterOfPosition, the offset is in bytes, not UTF-16 code units. | ||
| func GetECMALineAndByteOffsetOfPosition(sourceFile ast.SourceFileLike, pos int) (line int, byteOffset int) { |
There was a problem hiding this comment.
Minor - do you think From reads better here than Of?
There was a problem hiding this comment.
I'm indifferent, I was just trying to not over-rename if someone goes looking
|
|
||
| func GetIndentationForNode(n *ast.Node, ignoreActualIndentationRange *core.TextRange, sourceFile *ast.SourceFile, options *lsutil.FormatCodeSettings) int { | ||
| startline, startpos := scanner.GetECMALineAndCharacterOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false)) | ||
| startline, startpos := scanner.GetECMALineAndByteOffsetOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false)) |
There was a problem hiding this comment.
Is it just me, or is it surprising that the formatter uses ECMAScript-based lines instead of LSP-based lines?
There was a problem hiding this comment.
I totally agree, and we could plumb it differently, but I'm not sure it actually matters for these.
| // GetColumn returns the column position measured in UTF-16 code units | ||
| // for source map compatibility. | ||
| func (w *textWriter) GetColumn() core.UTF16Offset { | ||
| if w.lineStart { | ||
| return w.indent * w.indentSize | ||
| return core.UTF16Offset(w.indent * w.indentSize) | ||
| } | ||
| return w.builder.Len() - w.linePos | ||
| // Count UTF-16 code units from the last line start. | ||
| // For ASCII-only output (the common case), this equals the byte count. | ||
| return core.UTF16Len(w.builder.String()[w.linePos:]) |
There was a problem hiding this comment.
What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?
Fixes #1578
There's a lot of fuzting here, but in essensce, I've renamed some functions, introduced a new
UTF16Offsettype, and then fixed things. This should achieve the goal in #2872 (comment) and #1578.PositionToLineAndCharacter→PositionToLineAndByteOffsetGetECMALineAndCharacterOfPosition→GetECMALineAndUTF16CharacterOfPosition/GetECMALineAndByteOffsetOfPositionGetECMAPositionOfLineAndCharacter→GetECMAPositionOfLineAndUTF16Character/GetECMAPositionOfLineAndByteOffsetComputePositionOfLineAndCharacter→ removed (replaced byComputePositionOfLineAndByteOffset)ComputePositionOfLineAndCharacterEx→ComputePositionOfLineAndUTF16CharacterGetColumn() int→GetColumn() core.UTF16OffsetThis fixes a few baselines, and critically some of the sourcemap ones I expected to find were broken.