feat: Add "Convert to SQL" option for streaming supervisors#19547
feat: Add "Convert to SQL" option for streaming supervisors#19547beenhead wants to merge 4 commits into
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 4 |
| P3 | 0 |
| Total | 6 |
Reviewed 9 of 9 changed files.
This is an automated review by Codex GPT-5.5
| } | ||
|
|
||
| // Build EXTERN expression with proper escaping | ||
| const columnSchema = uniqueExternColumns.map(col => ({name: col, type: 'string'})); |
There was a problem hiding this comment.
[P1] EXTERN declares numeric inputs as strings
The generated EXTERN signature marks every raw column as string, including metric fields used by SUM, MIN, MAX, sketches, and typed dimensions. A common longSum or doubleSum supervisor will therefore generate numeric aggregations over VARCHAR inputs, which can fail planning or ingest with wrong types. Preserve metric and dimension native types as the existing spec converter does.
There was a problem hiding this comment.
Good catch. The EXTERN signature now declares each column with its native type instead of string. Dimensions map via dimensionTypeToSignatureType (long/float/double preserved, json → COMPLEX), and metric fieldNames map via metricFieldSignatureType, mirroring the ingestion-spec converter (longSum → LONG, doubleSum → DOUBLE, sketches read from a STRING input, etc.). A dimension's type wins when a name is both a dimension and a metric field, so SUM/MIN/MAX/sketches now aggregate over correctly-typed inputs. Added a test asserting the resulting signature types.
| sqlString = `INSERT INTO ${C(datasource)}\n${sqlString}`; | ||
|
|
||
| // Append PARTITIONED BY | ||
| sqlString += `\nPARTITIONED BY DAY`; |
There was a problem hiding this comment.
[P1] Supervisor granularity is ignored
The conversion always appends PARTITIONED BY DAY and never applies dataSchema.granularitySpec.queryGranularity. Supervisors commonly use HOUR segment granularity or non-none query granularity, so the generated batch SQL can create different segment layout and different rollup grouping from the streaming supervisor.
There was a problem hiding this comment.
Fixed in aaa3143. Both halves are now handled:
- Segment granularity →
PARTITIONED BY:instead of the hardcodedPARTITIONED BY DAY, we readdataSchema.granularitySpec.segmentGranularity— all → ALL TIME, otherwise uppercased (e.g. HOUR →PARTITIONED BY HOUR), falling back to DAY only when the spec omits it. - Query granularity → rollup grouping: the parsed timestamp is wrapped in
TIME_FLOOR(..., '<period>')based ongranularitySpec.queryGranularity (hour → PT1H, etc.; none/unset → no floor). The same floored expression is used in bothSELECT ... AS "__time" and GROUP BY, so the rollup buckets match the streaming supervisor.
So a segmentGranularity: HOUR / queryGranularity: hour supervisor now generates TIME_FLOOR(TIME_PARSE("timestamp", 'iso'), 'PT1H') AS "__time" … GROUP BY TIME_FLOOR(...) … PARTITIONED BY HOUR. Added tests covering the HOUR/hour case and the queryGranularity: none` (no TIME_FLOOR) case.
Note: this currently maps the named granularities; a custom { "type": "period", "period": "..." } query granularity falls through to no floor. Happy to extend it to pass the period through if that's a case you'd like covered.
| const columnSchema = uniqueExternColumns.map(col => ({name: col, type: 'string'})); | ||
| const columnSchemaJson = JSON.stringify(columnSchema); | ||
| const inputSourceJson = JSON.stringify(inputSource); | ||
| const inputFormatJson = JSON.stringify({type: inputFormatType}); |
There was a problem hiding this comment.
[P2] Input format details are discarded
The input format is reduced to only {type: fileType}, dropping the supervisor's ioConfig.inputFormat settings such as CSV columns/header handling, JSON flattenSpec, Kafka metadata formats, Avro/Protobuf schemas, and other parser options. Those supervisors convert to SQL that reads different columns or cannot parse the backfill files.
There was a problem hiding this comment.
Addressed. When the supervisor has an ioConfig.inputFormat, its settings (CSV columns/header, JSON flattenSpec, etc.) are now preserved and only type is overridden with the file type chosen in the dialog. When no inputFormat is present we still fall back to { type: fileType }. Added a test confirming flattenSpec is retained while the type is replaced.
| return ( | ||
| <SupervisorToSqlDialog | ||
| onConvert={(sql: string) => { | ||
| this.handleQueryStringChange(sql); |
There was a problem hiding this comment.
[P2] Conversion overwrites the active tab
Supervisor conversion writes the generated SQL into the current workbench tab, unlike the existing ingestion-spec conversion that opens a new tab. This can replace an unsaved query and also preserves the current tab's explicit engine/context, so an INSERT/EXTERN query can remain on sql-native or another unsuitable engine.
There was a problem hiding this comment.
Fixed to match the ingestion-spec flow. Conversion now opens a new workbench tab via handleNewTab(WorkbenchQuery.blank()…) titled Convert <datasource>, rather than calling handleQueryStringChange on the current tab. Using a blank query also means the engine/context are derived fresh for the INSERT/EXTERN instead of inheriting the active tab's, so an unsaved query is no longer replaced.
| } | ||
|
|
||
| function parsePastedSupervisor() { | ||
| if (!pastedSupervisor.trim()) { |
There was a problem hiding this comment.
[P2] Paste mode can submit stale specs
When the pasted JSON is empty, parsePastedSupervisor returns without clearing supervisorSpec. Switching from selected-supervisor mode to paste mode, or clearing a previously valid paste, can leave Generate SQL enabled and convert a hidden stale supervisor instead of the visible input.
There was a problem hiding this comment.
Fixed. parsePastedSupervisor now clears supervisorSpec (and the error) when the textarea is empty, and the paste effect always reparses — so switching from select to paste mode, or clearing a valid paste, drops the previously loaded spec and disables Generate SQL. The select effect likewise clears the spec when nothing is selected.
| const timeParseExpression = | ||
| timestampFormat === 'auto' | ||
| ? F('TIME_PARSE', C(timestampColumn)) | ||
| : F('TIME_PARSE', C(timestampColumn), SqlExpression.parse(`'${timestampFormat}'`)); |
There was a problem hiding this comment.
[P2] Timestamp formats are not SQL-escaped
Custom timestamp formats are interpolated directly into a quoted SQL literal. Valid Druid/Joda formats often contain single quotes, for example to quote a literal T, and those produce invalid SQL here. Build this argument with the query toolkit literal helper or otherwise escape single quotes.
There was a problem hiding this comment.
Fixed. Custom timestamp formats are now emitted through the query-toolkit L() literal helper, which doubles embedded single quotes — so a format like yyyy-MM-dd'T'HH:mm:ss produces valid SQL. (I also switched the EXTERN JSON arguments to L() for the same escaping guarantee.) Added a test covering the quoted-T case.
- Preserve native column types in the EXTERN signature so numeric metric and typed dimension fields are not declared as strings - Apply the supervisor's segment granularity to PARTITIONED BY and query granularity via TIME_FLOOR (also used in GROUP BY) - Preserve the supervisor's inputFormat settings, overriding only the type - Escape custom timestamp formats with the query-toolkit literal helper - Clear stale specs in the paste-mode dialog so Generate SQL can't submit a hidden supervisor - Open the converted query in a new workbench tab instead of overwriting the active tab - Add tests for the new behaviors and update snapshots Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
||
| import { C, F, L, SqlExpression, SqlQuery } from 'druid-query-toolkit'; | ||
|
|
||
| interface MetricSpec { |
There was a problem hiding this comment.
This interface already exists in src/druid-models/metric-spec/metric-spec.tsx it should be used here (and extended if needed)
There was a problem hiding this comment.
Resolved by deleting it entirely. Rather than reuse just the interface, the converter now rewrites the supervisor into an index_parallel spec and delegates to the existing convertSpecToSql, so there's no metric handling left in this file at all.
| ); | ||
| } | ||
|
|
||
| function metricSpecToSqlExpression(metricSpec: MetricSpec): SqlExpression | undefined { |
There was a problem hiding this comment.
This function is a metric spec utility and should live in src/druid-models/metric-spec/metric-spec.tsx
There was a problem hiding this comment.
Also gone — metricSpecToSqlExpression/extraArgs were duplicates of what convertSpecToSql already does. Delegating to it removed ~400 lines of parallel logic, so there's nothing left to relocate.
| } | ||
| } | ||
|
|
||
| export interface SupervisorSpec { |
There was a problem hiding this comment.
Can we reuse the IngestionSpec interface here?
There was a problem hiding this comment.
Done. convertSupervisorToSql now takes IngestionSpec (a kafka/kinesis supervisor already fits it), and the dialog uses IngestionSpec too. The bespoke SupervisorSpec interface is deleted. The only supervisor-specific work left is swapping the streaming ioConfig for a file inputSource/inputFormat before handing off to convertSpecToSql.
| color: #ff7373; | ||
| } | ||
|
|
||
| .bp4-form-group { |
There was a problem hiding this comment.
are these selectors doing anything? We use blueprint 5 not 4 so at the very least they should be .bp5-... really they should use .#{$bp-ns} (search for that usage in the code to see some examples) or possibly assign actual classes to these things
There was a problem hiding this comment.
Fixed — the scss now @imports ../../variables and uses .#{$bp-ns}-*. I also removed the raw bp4-select/bp4-fill markup in the TSX, replacing the native <select> with a Blueprint Button + Menu dropdown that matches the engine-selector styling.
| * limitations under the License. | ||
| */ | ||
|
|
||
| export * from './supervisor-to-sql-dialog'; No newline at end of file |
There was a problem hiding this comment.
In other components we avoid adding an index file with a single export
There was a problem hiding this comment.
Removed the folder index.ts; dialogs/index.ts now exports ./supervisor-to-sql-dialog/supervisor-to-sql-dialog directly, matching spec-dialog and the others.
| <div className={Classes.DIALOG_BODY}> | ||
| <p> | ||
| Convert a streaming supervisor specification to an MSQ (Multi-Stage Query) ingestion SQL | ||
| statement.{' '} |
There was a problem hiding this comment.
I feel like this text should make it very clear that the resulting SQL statement is a one shot thing and will not be streaming.
There was a problem hiding this comment.
Reworded the intro: it now states this "generates a one-time batch ingestion that reads the supplied files — it does not start a streaming ingestion and will not continuously ingest new data."
| fileType: 'json', | ||
| }); | ||
|
|
||
| expect(converted.queryString).toMatchSnapshot(); |
There was a problem hiding this comment.
I think it would be cleaner to use toMatchInlineSnapshot in these files
There was a problem hiding this comment.
Done — the full-output cases use toMatchInlineSnapshot and the external .snap file is deleted.
Reuse the existing ingestion-spec converter rather than reimplementing it:
- convertSupervisorToSql now rewrites the supervisor into an index_parallel
spec (file inputSource/inputFormat, default segment granularity, leading
dimension clustering) and delegates to convertSpecToSql, so column types,
granularity, timestamp parsing, and metric aggregation are all shared
- Reuse the IngestionSpec interface instead of a bespoke SupervisorSpec; drop
the duplicated MetricSpec interface and metric-to-SQL helpers (~400 lines)
- Convert the conversion tests to inline snapshots
- Dialog: use IngestionSpec, fix Blueprint 5 scss namespace (.#{$bp-ns}),
replace the native select with a Button + Menu dropdown matching the
console style, and clarify the SQL is a one-time batch (not streaming)
- Don't auto-select the first supervisor; the button shows "Select
supervisor" until one is chosen
- Remove the single-export index.ts barrel
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @vogievetsky — addressed all of these in |
FrankChen021
left a comment
There was a problem hiding this comment.
Reviewed 7 of 7 changed files for this reply follow-up. The latest revision resolves the previously raised issues: supervisor conversion now delegates to the ingestion-spec converter for shared type, timestamp, granularity, input-format, and rollup behavior; workbench conversion opens a new tab; and paste/select mode no longer keeps stale specs. No further inline reply is useful.
This is an automated review by Codex GPT-5.5
Have you ever stood up a streaming supervisor, only to later wish you could replay that same data as a batch (SQL-based) ingestion? Well, be prepared to have your experience taken to the next level!
This PR adds a
Convert supervisor to SQLflow to the web console query view. It complements the existing Convert ingestion spec to SQL tool: where that one migrates native batch and Hadoop specs, this one takes a streaming supervisor (Kafka/Kinesis) and generates an equivalent Multi-Stage Query INSERT statement that reads from files instead of a stream — handy for backfills, reprocessing, and one-off migrations.Convert supervisor to SQL dialog
A new Convert supervisor to SQL item is available from the ... menu in the query view.
Clicking it opens a dialog that walks you through the conversion.
Pick your supervisor
You can either select an existing supervisor (the dialog fetches the list from
/druid/indexer/v1/supervisorand loads the spec on selection) or paste a supervisor JSON spec directly. Pasting is validated as you type, so malformed JSON surfaces an inline error rather than failing silently.Point it at your data
Because a streaming supervisor has no batch input source, the dialog asks where the equivalent files live. It pre-populates the file location from the supervisor's
ioConfig.inputSourcewhen one is present, and you can pick the file type (JSON, CSV, Parquet, or ORC). The location scheme is used to build the right input source:s3://… → an s3 input source (with an objectGlob added automatically for directory locations)
gs://… → a google input source
http://… / https://… → an http input source
anything else (including file://…) → a local input source
Generate the SQL
Clicking Generate SQL converts the spec and drops the resulting query into a new tab for you to review and edit before running. The conversion:
Tests
Added
supervisor-conversion.spec.tscovering the conversion helper: rollup vs. non-rollup queries, the full set of supported metric aggregations and their non-default arguments, dropping of unsupported metrics, input-source detection for each scheme, timestamp handling, partitioning/clustering, and the error paths.The existing
supervisor-to-sql-dialog.spec.tsxcovers basic rendering and the close action.