Hopefully settle line endings / position mapping by jakebailey · Pull Request #2879 · microsoft/typescript-go

jakebailey · 2026-02-23T19:58:28Z

There's a lot of fuzting here, but in essensce, I've renamed some functions, introduced a new UTF16Offset type, and then fixed things. This should achieve the goal in #2872 (comment) and #1578.

PositionToLineAndCharacter → PositionToLineAndByteOffset
GetECMALineAndCharacterOfPosition → GetECMALineAndUTF16CharacterOfPosition / GetECMALineAndByteOffsetOfPosition
GetECMAPositionOfLineAndCharacter → GetECMAPositionOfLineAndUTF16Character / GetECMAPositionOfLineAndByteOffset
ComputePositionOfLineAndCharacter → removed (replaced by ComputePositionOfLineAndByteOffset)
ComputePositionOfLineAndCharacterEx → ComputePositionOfLineAndUTF16Character
GetColumn() int → GetColumn() core.UTF16Offset

This fixes a few baselines, and critically some of the sourcemap ones I expected to find were broken.

Copilot

Pull request overview

This PR refactors and corrects position/column calculations across the compiler and tooling to disambiguate byte offsets vs UTF-16 code unit offsets (notably for source maps and diagnostic locations), and updates affected baselines accordingly.

Changes:

Introduces UTF-16-aware line/column APIs (and helpers) and updates call sites to use UTF-16 vs byte offsets appropriately.
Fixes source map position decoding/recording to convert UTF-16 “character” offsets back into byte positions.
Updates printer/formatter/diagnostic writer behavior and refreshes affected compiler baselines.

Reviewed changes

Copilot reviewed 35 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
internal/core/core.go	Renames `PositionToLineAndCharacter` to byte-offset naming and adds `UTF16Len` helper.
internal/scanner/scanner.go	Splits ECMAScript line/column APIs into UTF-16 character vs byte-offset variants; adds UTF-16 → byte position conversion.
internal/sourcemap/source_mapper.go	Updates source map mapping decode to use UTF-16 character offsets when computing positions.
internal/testutil/harnessutil/sourcemap_recorder.go	Fixes source-map span text extraction by converting UTF-16 offsets to byte positions.
internal/printer/utilities.go	Updates source-map line/character cache to count UTF-16 code units.
internal/printer/textwriter.go	Changes writer column calculation to UTF-16 code units for generated source-map columns.
internal/diagnosticwriter/diagnosticwriter.go	Updates diagnostic location formatting to use UTF-16 character offsets; adjusts line-start calculations to byte offsets.
internal/transformers/jsxtransforms/jsx.go	Uses UTF-16 columns for `jsxDEV` `__source` metadata.
internal/format/span.go	Switches formatting logic that needs byte offsets to use the byte-offset line/column API.
internal/format/indent.go	Uses byte-offset APIs for indentation computations and refines non-whitespace scanning.
internal/compiler/emit_test.go	Updates test comment to reference UTF-16-based position API.
internal/astnav/tokens_test.go	Updates test to use renamed byte-offset API.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols.diff	Baseline diff removed/updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).symbols	Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt.diff	Baseline diff updated due to corrected source map position mapping.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).sourcemap.txt	Updated sourcemap baseline output.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map.diff	Baseline diff updated due to corrected generated mappings/columns.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).js.map	Updated `.js.map` baseline content.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt.diff	Baseline diff updated due to corrected UTF-16-based diagnostic positions.
testdata/baselines/reference/submodule/compiler/unicodeEscapesInNames02(target=es2015).errors.txt	Updated diagnostic baseline positions.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt.diff	Baseline diff updated for corrected line-break/source map mapping.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt	Updated sourcemap baseline output.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map.diff	Baseline diff removed/updated due to corrected source map output.
testdata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).js.map	Updated `.js.map` baseline content.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols.diff	Baseline diff removed/updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiersJSDoc.symbols	Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols.diff	Baseline diff updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/extendedUnicodePlaneIdentifiers.symbols	Updated symbol baseline positions.
testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols.diff	Baseline diff updated due to corrected UTF-16-based positions.
testdata/baselines/reference/submodule/compiler/declarationEmitLateBoundAssignments2.symbols	Updated symbol baseline positions.

internal/printer/textwriter.go

internal/scanner/scanner.go

Copilot · 2026-02-23T20:06:35Z

internal/printer/textwriter.go

+// GetColumn returns the column position measured in UTF-16 code units
+// for source map compatibility.
+func (w *textWriter) GetColumn() core.UTF16Offset {
 	if w.lineStart {
-		return w.indent * w.indentSize
+		return core.UTF16Offset(w.indent * w.indentSize)
 	}
-	return w.builder.Len() - w.linePos
+	// Count UTF-16 code units from the last line start.
+	// For ASCII-only output (the common case), this equals the byte count.
+	return core.UTF16Len(w.builder.String()[w.linePos:])


GetColumn() now calls w.builder.String() and then scans the entire current line to compute a UTF-16 column. During emit/source-map generation, GetColumn is called very frequently, so this can reintroduce O(n²) behavior on long lines and may also trigger costly builder-to-string conversions. A more efficient approach is to track the current UTF-16 column incrementally as text is written (resetting on newline, and accounting for indentation when lineStart is true), so GetColumn stays O(1) without calling builder.String().

.String() is free, so we're really just talking about how long it takes to calculate, and I think it's better to calculate the len on request rather than repeatedly len-ing on every single write.

What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?

No, it's internally an "unsafe" direct access so long as you don't escape it

internal/diagnosticwriter/diagnosticwriter.go

jakebailey · 2026-02-23T20:29:02Z

internal/diagnosticwriter/diagnosticwriter.go

+			fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar)))
+			fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar)))
 		case lastLine:
 			// Squiggle until the final character.
-			fmt.Fprint(writer, strings.Repeat("~", lastLineChar))
+			fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar)))
 		default:
 			// Squiggle the entire line.
-			fmt.Fprint(writer, strings.Repeat("~", len(lineContent)))
+			fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent))))


This is very odd, and arguably should actually be rune counts, since these are visible characters.

Are you just holding off on this to avoid a massive baseline avalanche?

No matter what we do, things will be just a little weird, so I just didn't change it

...ata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt

DanielRosenwasser

Looks good, just wanted to get some clarifications before approval.

DanielRosenwasser · 2026-02-24T06:54:18Z

internal/diagnosticwriter/diagnosticwriter.go

+			fmt.Fprint(writer, strings.Repeat(" ", int(firstLineChar)))
+			fmt.Fprint(writer, strings.Repeat("~", lastCharForLine-int(firstLineChar)))
 		case lastLine:
 			// Squiggle until the final character.
-			fmt.Fprint(writer, strings.Repeat("~", lastLineChar))
+			fmt.Fprint(writer, strings.Repeat("~", int(lastLineChar)))
 		default:
 			// Squiggle the entire line.
-			fmt.Fprint(writer, strings.Repeat("~", len(lineContent)))
+			fmt.Fprint(writer, strings.Repeat("~", int(core.UTF16Len(lineContent))))


Are you just holding off on this to avoid a massive baseline avalanche?

DanielRosenwasser · 2026-02-24T06:55:36Z

internal/scanner/scanner.go

+// raw UTF-8 byte offset from the start of that line for the given byte position.
+// Uses ECMAScript line separators (LF, CR, CRLF, LS, PS).
+// Unlike GetECMALineAndUTF16CharacterOfPosition, the offset is in bytes, not UTF-16 code units.
+func GetECMALineAndByteOffsetOfPosition(sourceFile ast.SourceFileLike, pos int) (line int, byteOffset int) {


Minor - do you think From reads better here than Of?

I'm indifferent, I was just trying to not over-rename if someone goes looking

DanielRosenwasser · 2026-02-24T07:01:44Z

internal/format/indent.go


 func GetIndentationForNode(n *ast.Node, ignoreActualIndentationRange *core.TextRange, sourceFile *ast.SourceFile, options *lsutil.FormatCodeSettings) int {
-	startline, startpos := scanner.GetECMALineAndCharacterOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false))
+	startline, startpos := scanner.GetECMALineAndByteOffsetOfPosition(sourceFile, scanner.GetTokenPosOfNode(n, sourceFile, false))


Is it just me, or is it surprising that the formatter uses ECMAScript-based lines instead of LSP-based lines?

I totally agree, and we could plumb it differently, but I'm not sure it actually matters for these.

DanielRosenwasser · 2026-02-24T07:13:40Z

internal/printer/textwriter.go

+// GetColumn returns the column position measured in UTF-16 code units
+// for source map compatibility.
+func (w *textWriter) GetColumn() core.UTF16Offset {
 	if w.lineStart {
-		return w.indent * w.indentSize
+		return core.UTF16Offset(w.indent * w.indentSize)
 	}
-	return w.builder.Len() - w.linePos
+	// Count UTF-16 code units from the last line start.
+	// For ASCII-only output (the common case), this equals the byte count.
+	return core.UTF16Len(w.builder.String()[w.linePos:])


What do you mean by ".String() is free"? Doesn't it have to reallocate on any new content additions?

internal/testutil/harnessutil/sourcemap_recorder.go

jakebailey added 6 commits February 23, 2026 09:52

Starting

5124913

Fix formatting it seems

e095fbb

annoying copilot

110e328

more dedupe

a29d7c0

Rename functions

b9ee006

Introduce UTF16Offset

b15a5e3

Copilot AI review requested due to automatic review settings February 23, 2026 19:58

Copilot started reviewing on behalf of jakebailey February 23, 2026 20:00 View session

Speed up UTF16Len

6281436

Copilot AI reviewed Feb 23, 2026

View reviewed changes

jakebailey added 2 commits February 23, 2026 12:11

lint fix

8d80876

Copilot feedback

8c90fe7

jakebailey commented Feb 23, 2026

View reviewed changes

...ata/baselines/reference/submodule/compiler/sourceMap-LineBreaks(target=es2015).sourcemap.txt Show resolved Hide resolved

DanielRosenwasser reviewed Feb 24, 2026

View reviewed changes

Comments

Conversation

jakebailey commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DanielRosenwasser left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants