Skip to content

Comments

Hopefully settle line endings / position mapping#2879

Open
jakebailey wants to merge 9 commits intomainfrom
jabaile/deal-with-position-helpers
Open

Hopefully settle line endings / position mapping#2879
jakebailey wants to merge 9 commits intomainfrom
jabaile/deal-with-position-helpers

Conversation

@jakebailey
Copy link
Member

Fixes #1578

There's a lot of fuzting here, but in essensce, I've renamed some functions, introduced a new UTF16Offset type, and then fixed things. This should achieve the goal in #2872 (comment) and #1578.

  • PositionToLineAndCharacterPositionToLineAndByteOffset
  • GetECMALineAndCharacterOfPositionGetECMALineAndUTF16CharacterOfPosition / GetECMALineAndByteOffsetOfPosition
  • GetECMAPositionOfLineAndCharacterGetECMAPositionOfLineAndUTF16Character / GetECMAPositionOfLineAndByteOffset
  • ComputePositionOfLineAndCharacter → removed (replaced by ComputePositionOfLineAndByteOffset)
  • ComputePositionOfLineAndCharacterExComputePositionOfLineAndUTF16Character
  • GetColumn() intGetColumn() core.UTF16Offset

This fixes a few baselines, and critically some of the sourcemap ones I expected to find were broken.

Copilot AI review requested due to automatic review settings February 23, 2026 19:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors and corrects position/column calculations across the compiler and tooling to disambiguate byte offsets vs UTF-16 code unit offsets (notably for source maps and diagnostic locations), and updates affected baselines accordingly.

Changes:

  • Introduces UTF-16-aware line/column APIs (and helpers) and updates call sites to use UTF-16 vs byte offsets appropriately.
  • Fixes source map position decoding/recording to convert UTF-16 “character” offsets back into byte positions.
  • Updates printer/formatter/diagnostic writer behavior and refreshes affected compiler baselines.

Reviewed changes

Copilot reviewed 35 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/core/core.go Renames PositionToLineAndCharacter to byte-offset naming and adds UTF16Len helper.
internal/scanner/scanner.go Splits ECMAScript line/column APIs into UTF-16 character vs byte-offset variants; adds UTF-16 → byte position conversion.
internal/sourcemap/source_mapper.go Updates source map mapping decode to use UTF-16 character offsets when computing positions.
internal/testutil/harnessutil/sourcemap_recorder.go Fixes source-map span text extraction by converting UTF-16 offsets to byte positions.
internal/printer/utilities.go Updates source-map line/character cache to count UTF-16 code units.
internal/printer/textwriter.go Changes writer column calculation to UTF-16 code units for generated source-map columns.
internal/diagnosticwriter/diagnosticwriter.go Updates diagnostic location formatting to use UTF-16 character offsets; adjusts line-start calculations to byte offsets.
internal/transformers/jsxtransforms/jsx.go Uses UTF-16 columns for jsxDEV __source metadata.
internal/format/span.go Switches formatting logic that needs byte offsets to use the byte-offset line/column API.
internal/format/indent.go Uses byte-offset APIs for indentation computations and refines non-whitespace scanning.
internal/compiler/emit_test.go Updates test comment to reference UTF-16-based position API.
internal/astnav/tokens_test.go Updates test to use renamed byte-offset API.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols.diff Baseline diff removed/updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt.diff Baseline diff updated due to corrected source map position mapping.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt Updated sourcemap baseline output.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map.diff Baseline diff updated due to corrected generated mappings/columns.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map Updated .js.map baseline content.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt.diff Baseline diff updated due to corrected UTF-16-based diagnostic positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt Updated diagnostic baseline positions.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt.diff Baseline diff updated for corrected line-break/source map mapping.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt Updated sourcemap baseline output.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map.diff Baseline diff removed/updated due to corrected source map output.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map Updated .js.map baseline content.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols.diff Baseline diff removed/updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols.diff Baseline diff updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols.diff Baseline diff updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols Updated symbol baseline positions.

Comment on lines +34 to +42
// GetColumn returns the column position measured in UTF-16 code units
// for source map compatibility.
func (w *textWriter) GetColumn() core.UTF16Offset {
if w.lineStart {
return w.indent * w.indentSize
return core.UTF16Offset(w.indent * w.indentSize)
}
return w.builder.Len() - w.linePos
// Count UTF-16 code units from the last line start.
// For ASCII-only output (the common case), this equals the byte count.
return core.UTF16Len(w.builder.String()[w.linePos:])
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetColumn() now calls w.builder.String() and then scans the entire current line to compute a UTF-16 column. During emit/source-map generation, GetColumn is called very frequently, so this can reintroduce O(n²) behavior on long lines and may also trigger costly builder-to-string conversions. A more efficient approach is to track the current UTF-16 column incrementally as text is written (resetting on newline, and accounting for indentation when lineStart is true), so GetColumn stays O(1) without calling builder.String().

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.String() is free, so we're really just talking about how long it takes to calculate, and I think it's better to calculate the len on request rather than repeatedly len-ing on every single write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's internally an "unsafe" direct access so long as you don't escape it

Comment on lines +239 to +246
fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar)))
fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar)))
case lastLine:
// Squiggle until the final character.
fmt.Fprint(writer, strings.Repeat("~", lastLineChar))
fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar)))
default:
// Squiggle the entire line.
fmt.Fprint(writer, strings.Repeat("~", len(lineContent)))
fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent))))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very odd, and arguably should actually be rune counts, since these are visible characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you just holding off on this to avoid a massive baseline avalanche?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter what we do, things will be just a little weird, so I just didn't change it

Copy link
Member

@DanielRosenwasser DanielRosenwasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just wanted to get some clarifications before approval.

Comment on lines +239 to +246
fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar)))
fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar)))
case lastLine:
// Squiggle until the final character.
fmt.Fprint(writer, strings.Repeat("~", lastLineChar))
fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar)))
default:
// Squiggle the entire line.
fmt.Fprint(writer, strings.Repeat("~", len(lineContent)))
fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you just holding off on this to avoid a massive baseline avalanche?

// raw UTF-8 byte offset from the start of that line for the given byte position.
// Uses ECMAScript line separators (LF, CR, CRLF, LS, PS).
// Unlike GetECMALineAndUTF16CharacterOfPosition, the offset is in bytes, not UTF-16 code units.
func GetECMALineAndByteOffsetOfPosition(sourceFile ast.SourceFileLike, pos int) (line int, byteOffset int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - do you think From reads better here than Of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm indifferent, I was just trying to not over-rename if someone goes looking


func GetIndentationForNode(n *ast.Node, ignoreActualIndentationRange *core.TextRange, sourceFile *ast.SourceFile, options *lsutil.FormatCodeSettings) int {
startline, startpos := scanner.GetECMALineAndCharacterOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false))
startline, startpos := scanner.GetECMALineAndByteOffsetOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just me, or is it surprising that the formatter uses ECMAScript-based lines instead of LSP-based lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, and we could plumb it differently, but I'm not sure it actually matters for these.

Comment on lines +34 to +42
// GetColumn returns the column position measured in UTF-16 code units
// for source map compatibility.
func (w *textWriter) GetColumn() core.UTF16Offset {
if w.lineStart {
return w.indent * w.indentSize
return core.UTF16Offset(w.indent * w.indentSize)
}
return w.builder.Len() - w.linePos
// Count UTF-16 code units from the last line start.
// For ASCII-only output (the common case), this equals the byte count.
return core.UTF16Len(w.builder.String()[w.linePos:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix GetLineAndCharacterOfPosition and audit all users

2 participants