Skip to content

Commit 90bbbd1

Browse files
matt-aitkenclaude
andauthored
fix(webapp): recover from ClickHouse JSON parse failures on out-of-range integers (#3759)
## Summary Second class of poisoned-row failure in the runs replication path. PR #3708 plugged lone UTF-16 surrogates; this one handles bare JSON integer literals outside ClickHouse's `Int64`..`UInt64` range. Recovery stays purely reactive — the existing `sanitizeRows` walker just gains an extra branch, so the hot replication path pays nothing on healthy rows. Fixes the still-firing customer-facing symptom from [TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755): `scan-social-profiles` runs continued to be stranded in `EXECUTING` on the Tasks page after #3708 deployed. CloudWatch showed `Dropped batch — ClickHouse JSON parse error but sanitizer found nothing to fix` firing **8/8 times** since the previous deploy (zero successful sanitizations). Root cause: upstream JS Number precision loss on a 21-digit Google Plus ID (`117039831458782873093` → `117039831458782870000`) — the precision-lossy value still serialises as a bare integer that exceeds `UInt64.MAX`, which ClickHouse rejects with `INCORRECT_DATA`. ## How the bug ships The customer task emits an output containing a Poshmark profile's `spec_format`: ```json {"key":"gp_id","proper_key":"Gp Id","value":117039831458782870000,"type":"int"} ``` That value is `1.17e20` — comfortably above `UInt64.MAX` (`1.84e19`) but comfortably below `1e21`. `Number.prototype.toString` only switches to exponential form at `|value| >= 1e21`, so `JSON.stringify` emits the bare token `117039831458782870000` and the ClickHouse `JSON(max_dynamic_paths)` column fails with: ``` Code: 117. DB::Exception: Cannot parse JSON object here: {…}: (while reading the value of key output): (at row 1) : While executing ParallelParsingBlockInputFormat. (INCORRECT_DATA) (version 25.12.x) ``` Same error verbatim as prod. The same number quoted (`"117039831458782870000"`) inserts fine — ClickHouse's dynamic JSON column accepts a `String` subtype on the same path. ## What changed `apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts`: - New private `isUnsafeJsonInteger(value)` helper — true iff `value` is a finite integer-valued JS Number where `|value| < 1e21` (so `JSON.stringify` emits integer form, not exponent) **and** `value` falls outside `[Int64.MIN, UInt64.MAX]`. - `sanitizeUnknownInPlace` gains a number-branch: when the predicate holds, replace the Number with `String(value)`. The downstream JSON column dynamic-types the path as String for that row — fine, since the value was already precision-lossy upstream (no JS Number above 2^53 is numerically meaningful anyway). - Float-valued numbers, large floats (>= 1e21), NaN and Infinity are left alone — `JSON.stringify` emits them with exponents or as `null`, both of which ClickHouse accepts. `apps/webapp/test/sanitizeRowsOnParseError.test.ts`: four new unit tests + an extension to `sanitizeRows` covering surrogate + integer fixes counted together across rows. The unit suite now covers: - Positive value above `UInt64.MAX` (`117039831458782870000` — the actual prod value) - Negative value below `Int64.MIN` - Boundary values pass through (`42`, `Number.MAX_SAFE_INTEGER`, `2^63`) - Non-integer numbers untouched (floats, `1e25`, NaN, Infinity) - The actual `scan-social-profiles` nested shape — finds the offending `gp_id` deep inside `output.data.profiles[].spec_format[].platform_variables[].value` `.server-changes/runs-replication-bigint-recovery.md` — release notes entry. ## Why reactive, not pre-flight `#prepareJson` runs millions of times per day on the replication hot path. Walking every JSON tree to look for oversized integers would add bounded-but-real CPU on every healthy row. `sanitizeRows` only fires after a ClickHouse parse-error rejection, which is a few times a day platform-wide. Extending it costs effectively zero on healthy traffic and gains us recovery on the rare poisoned row. ## Verification - Reproduced 1:1 in a throwaway Docker `clickhouse/clickhouse-server:25.12.11.4` (closest available to the prod `25.12.1.1579` build). Pre-sanitize JSON fails with the exact prod error; post-sanitize JSON inserts cleanly and the row is readable with `gp_id` stored as a String subtype. - `pnpm --filter webapp exec vitest run test/sanitizeRowsOnParseError.test.ts` — 22/22 passing (18 existing + 4 new). - `pnpm run typecheck --filter webapp` — clean. ## Test plan - [x] `pnpm run typecheck --filter webapp` - [x] Unit tests pass against new + existing cases - [x] End-to-end Docker ClickHouse repro confirms recovery - [ ] Post-deploy: confirm `Sanitizing batch after ClickHouse JSON parse error` warns fire instead of `Dropped batch …` errors when `scan-social-profiles` outputs trip CH again - [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stops climbing in `/stp/trigger-app-prod/ecs/replication/service-container/process-logs` ## What this does NOT do - Doesn't backfill the ~120k+ existing stranded `EXECUTING` rows in production. Same as #3708 — that needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix #3). - Doesn't address the upstream root cause (the customer task emitting a JS-Number-precision-lossy big int). That's a customer-task concern; our replication path needs to be robust to whatever shape arrives. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 5ba1b32 commit 90bbbd1

3 files changed

Lines changed: 197 additions & 0 deletions

File tree

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
area: webapp
3+
type: fix
4+
---
5+
6+
Extend the runs-replication sanitizer (`sanitizeUnknownInPlace`) to detect
7+
JS Numbers that JSON-serialise as bare integer tokens outside the
8+
Int64..UInt64 range and replace them with their string form, so a
9+
following retry insert no longer trips ClickHouse's
10+
`INCORRECT_DATA` parser failure on `JSON(max_dynamic_paths)` columns.
11+
12+
This is the second class of poisoned-row failure that was stranding
13+
`scan-social-profiles` runs in `EXECUTING` on the Tasks page even after
14+
the UTF-16 surrogate fix (#3708 / TRI-9755). Root cause: upstream JS
15+
Number precision loss on a 21-digit Google Plus ID
16+
(`117039831458782873093``117039831458782870000`) — the precision-lossy
17+
value still serialises as a bare integer that exceeds UInt64.MAX,
18+
which CH's JSON column rejects with `Cannot parse JSON object here`.
19+
20+
Recovery stays purely reactive (no extra cost on the hot replication
21+
path); the sanitizer only runs after a ClickHouse parse-error rejection.

apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,48 @@ import { detectBadJsonStrings } from "~/utils/detectBadJsonStrings";
77
*/
88
export const INVALID_UTF16_SENTINEL = "[invalid-utf16]";
99

10+
/**
11+
* ClickHouse's `JSON(max_dynamic_paths)` column fits each bare-integer
12+
* JSON token into Int64 (signed) or UInt64 (unsigned). Bare integers
13+
* outside `[-2^63, 2^64 - 1]` are rejected with `INCORRECT_DATA` (no
14+
* silent fallback to Float64). `JSON.stringify` emits any integer-valued
15+
* Number with `|value| < 1e21` as a bare integer (no exponent), so any
16+
* JS Number above ~9.2e18 that *happens* to be integer-valued lands on
17+
* the wire as a token CH cannot accept.
18+
*
19+
* The fix: replace such Numbers with their string form. CH's dynamic
20+
* JSON column accepts a `String` subtype on the same path, so the row
21+
* inserts cleanly on retry. The numeric value was already
22+
* precision-lossy upstream (JS Number can't represent integers above
23+
* 2^53 faithfully), so type-flipping to string is information-preserving
24+
* relative to what arrived.
25+
*
26+
* Float-valued numbers (including very large ones like `1e25`) serialise
27+
* with an exponent and are accepted by CH at any magnitude, so they're
28+
* left alone.
29+
*/
30+
const UINT64_MAX = 18446744073709551615n;
31+
const INT64_MIN = -9223372036854775808n;
32+
33+
function isUnsafeJsonInteger(value: number): boolean {
34+
if (!Number.isFinite(value)) return false;
35+
if (!Number.isInteger(value)) return false;
36+
// JSON.stringify emits integer-valued Numbers as bare integer tokens
37+
// (no exponent) only while `|value| < 1e21`; at or above that
38+
// threshold `Number.prototype.toString` switches to exponential form,
39+
// which CH accepts as Float64 at any magnitude. So the dangerous band
40+
// is strictly between the Int64/UInt64 boundary and 1e21.
41+
if (Math.abs(value) >= 1e21) return false;
42+
// Compare via BigInt for exactness. The Number literal 18446744073709551615
43+
// is rounded to 2**64 in float64 (the float spacing near 2^64 is 2048), so a
44+
// direct `value > 18446744073709551615` would miss a Number whose float64
45+
// value is exactly 2**64 — `JSON.stringify` of that emits
46+
// "18446744073709552000", which exceeds UInt64.MAX and ClickHouse rejects.
47+
// `BigInt(value)` is safe here because we already gated on Number.isInteger.
48+
const asBigInt = BigInt(value);
49+
return asBigInt > UINT64_MAX || asBigInt < INT64_MIN;
50+
}
51+
1052
export type SanitizeResult = {
1153
/** How many rows had at least one string field replaced. */
1254
rowsTouched: number;
@@ -62,6 +104,10 @@ export function sanitizeUnknownInPlace(value: unknown): { value: unknown; fixed:
62104
return { value, fixed: 0 };
63105
}
64106

107+
if (typeof value === "number" && isUnsafeJsonInteger(value)) {
108+
return { value: String(value), fixed: 1 };
109+
}
110+
65111
if (Array.isArray(value)) {
66112
let fixed = 0;
67113
for (let i = 0; i < value.length; i++) {

apps/webapp/test/sanitizeRowsOnParseError.test.ts

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,104 @@ describe("sanitizeUnknownInPlace", () => {
105105
expect(sanitizeUnknownInPlace(null)).toEqual({ value: null, fixed: 0 });
106106
expect(sanitizeUnknownInPlace(undefined)).toEqual({ value: undefined, fixed: 0 });
107107
});
108+
109+
// ─── Out-of-range integers (TRI-9755) ──────────────────────────────────────
110+
// ClickHouse's JSON(max_dynamic_paths) column rejects bare integer tokens
111+
// outside [Int64.MIN, UInt64.MAX]. Such Numbers serialise as bare integer
112+
// form via JSON.stringify (no exponent, since |value| < 1e21) so they reach
113+
// ClickHouse as unquoted oversized ints. Sanitizer replaces them with the
114+
// string form, which ClickHouse's dynamic JSON column accepts as a String
115+
// subtype on that path.
116+
117+
it("replaces an integer-valued Number above UInt64.MAX with its string form", () => {
118+
// 117039831458782870000 is the actual prod value (Google Plus ID after
119+
// upstream JS-Number precision loss from 117039831458782873093).
120+
const result = sanitizeUnknownInPlace(117039831458782870000);
121+
expect(result.value).toBe("117039831458782870000");
122+
expect(result.fixed).toBe(1);
123+
});
124+
125+
it("catches the float64 boundary at exactly 2**64 (UInt64.MAX + 1)", () => {
126+
// float64 cannot represent UInt64.MAX (2^64 - 1) exactly — the literal
127+
// 18446744073709551615 in JS source rounds to 2^64. JSON.stringify
128+
// emits this Number as "18446744073709552000", which exceeds UInt64.MAX
129+
// and trips ClickHouse. Regression for the BigInt-based comparison;
130+
// a naïve `value > 18446744073709551615` would let this pass.
131+
const result = sanitizeUnknownInPlace(2 ** 64);
132+
expect(result.value).toBe("18446744073709552000");
133+
expect(result.fixed).toBe(1);
134+
});
135+
136+
it("replaces an integer-valued Number below Int64.MIN with its string form", () => {
137+
// -9223372036854775809 is the first failing negative; in float64 it
138+
// rounds to the same representation as Int64.MIN (-9223372036854775808),
139+
// but for completeness we check a clearly-out-of-range negative.
140+
const result = sanitizeUnknownInPlace(-1e20);
141+
expect(result.value).toBe("-100000000000000000000");
142+
expect(result.fixed).toBe(1);
143+
});
144+
145+
it("leaves safe integers and boundary values untouched", () => {
146+
// 42 — safe integer
147+
expect(sanitizeUnknownInPlace(42)).toEqual({ value: 42, fixed: 0 });
148+
// Number.MAX_SAFE_INTEGER (2^53 - 1) — JSON.stringify still emits as integer
149+
expect(sanitizeUnknownInPlace(Number.MAX_SAFE_INTEGER)).toEqual({
150+
value: Number.MAX_SAFE_INTEGER,
151+
fixed: 0,
152+
});
153+
// 2^63 (Int64.MAX + 1) — still fits in UInt64, CH accepts it
154+
expect(sanitizeUnknownInPlace(2 ** 63)).toEqual({ value: 2 ** 63, fixed: 0 });
155+
});
156+
157+
it("leaves non-integer numbers untouched (floats, NaN, Infinity)", () => {
158+
// Numbers with a fractional part — emitted with `.` in JSON
159+
expect(sanitizeUnknownInPlace(3.14)).toEqual({ value: 3.14, fixed: 0 });
160+
// Very large float-form (>= 1e21) — JSON.stringify uses exponent form,
161+
// CH parses as Float64 successfully
162+
expect(sanitizeUnknownInPlace(1e25)).toEqual({ value: 1e25, fixed: 0 });
163+
// NaN / Infinity — JSON.stringify emits `null`, so harmless on the wire
164+
expect(sanitizeUnknownInPlace(Number.NaN)).toEqual({ value: Number.NaN, fixed: 0 });
165+
expect(sanitizeUnknownInPlace(Number.POSITIVE_INFINITY)).toEqual({
166+
value: Number.POSITIVE_INFINITY,
167+
fixed: 0,
168+
});
169+
});
170+
171+
it("finds an oversized integer nested deep inside the actual scan-social-profiles shape", () => {
172+
const row = {
173+
output: {
174+
data: {
175+
profiles: [
176+
{ module: "linktree", query: "x@example.com" },
177+
{
178+
module: "poshmark",
179+
spec_format: [
180+
{
181+
platform_variables: [
182+
{
183+
key: "gp_id",
184+
proper_key: "Gp Id",
185+
// The actual prod value — bare JSON integer > UInt64.MAX
186+
value: 117039831458782870000,
187+
type: "int",
188+
},
189+
],
190+
},
191+
],
192+
},
193+
],
194+
},
195+
},
196+
};
197+
const result = sanitizeUnknownInPlace(row);
198+
expect(result.fixed).toBe(1);
199+
expect(
200+
(row.output.data.profiles[1].spec_format![0].platform_variables[0] as any).value
201+
).toBe("117039831458782870000");
202+
// Untouched neighbours
203+
expect(row.output.data.profiles[0].module).toBe("linktree");
204+
expect(row.output.data.profiles[1].spec_format![0].platform_variables[0].type).toBe("int");
205+
});
108206
});
109207

110208
describe("sanitizeRows", () => {
@@ -158,4 +256,36 @@ describe("sanitizeRows", () => {
158256
expect(result.rowsTouched).toBe(1);
159257
expect(result.fieldsSanitized).toBe(2);
160258
});
259+
260+
it("counts surrogate fixes and out-of-range integer fixes together (TRI-9755)", () => {
261+
const rows = [
262+
{
263+
id: "r0",
264+
attributes: {
265+
surrogate: `bad ${HIGH_SURROGATE}`,
266+
bigint: 117039831458782870000,
267+
clean: "fine",
268+
safe: 42,
269+
},
270+
},
271+
{
272+
id: "r1",
273+
attributes: {
274+
bigint: -1e20,
275+
clean: "still fine",
276+
},
277+
},
278+
{
279+
id: "r2",
280+
attributes: { clean: "no fixes needed" },
281+
},
282+
];
283+
const result = sanitizeRows(rows);
284+
expect(result.rowsTouched).toBe(2);
285+
expect(result.fieldsSanitized).toBe(3);
286+
expect(rows[0].attributes.surrogate).toBe(INVALID_UTF16_SENTINEL);
287+
expect(rows[0].attributes.bigint).toBe("117039831458782870000");
288+
expect(rows[0].attributes.safe).toBe(42);
289+
expect(rows[1].attributes.bigint).toBe("-100000000000000000000");
290+
});
161291
});

0 commit comments

Comments
 (0)