Conversation
🦋 Changeset detectedLatest commit: 6fed8ca The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Code ReviewCritical Issues
Important Issues
Minor Issues
Recommendation: Fix the SQL injection and feature flag logic before merge. |
E2E Test Results✅ All tests passed • 46 passed • 3 skipped • 816s
Tests ran across 4 shards in parallel. |
| exports[`renderChartConfig HAVING clause should render HAVING clause with SQL language 1`] = `"SELECT count(),severity FROM default.logs WHERE (timestamp >= fromUnixTimestamp64Milli(1739318400000) AND timestamp <= fromUnixTimestamp64Milli(1739491200000)) GROUP BY severity HAVING count(*) > 100"`; | ||
| exports[`renderChartConfig HAVING clause should render HAVING clause with SQL language 1`] = `"SELECT count(),severity FROM default.logs WHERE (timestamp >= fromUnixTimestamp64Milli(1739318400000) AND timestamp <= fromUnixTimestamp64Milli(1739491200000)) GROUP BY severity HAVING COUNT(*) > 100"`; | ||
|
|
||
| exports[`renderChartConfig HAVING clause should render HAVING clause with granularity and groupBy 1`] = `"SELECT count(),event_type,toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\` FROM default.events WHERE (timestamp >= fromUnixTimestamp64Milli(1739318400000) AND timestamp <= fromUnixTimestamp64Milli(1739491200000)) GROUP BY event_type,toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\` HAVING count(*) > 50 ORDER BY toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\`"`; | ||
| exports[`renderChartConfig HAVING clause should render HAVING clause with granularity and groupBy 1`] = `"SELECT count(),event_type,toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\` FROM default.events WHERE (timestamp >= fromUnixTimestamp64Milli(1739318400000) AND timestamp <= fromUnixTimestamp64Milli(1739491200000)) GROUP BY event_type,toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\` HAVING COUNT(*) > 50 ORDER BY toStartOfInterval(toDateTime(timestamp), INTERVAL 5 minute) AS \`__hdx_time_bucket\`"`; |
There was a problem hiding this comment.
These cases never ran it through the sql parser previously, so the sql parser just capitalizes a few things. Otherwise the tests didn't change
| expect(actual.toLowerCase()).toContain( | ||
| 'avg(response_time) > 500 and count(*) > 10', | ||
| ); |
| }; | ||
| }; | ||
|
|
||
| const optimizeMapAccessWhere = ({ |
There was a problem hiding this comment.
Builds ast, traverses extracting each ident, adds mapContains to the ast if the proper conditions are met, and builds back into sql
| case 'column_ref': { | ||
| const ident = extractIdent(node as ColumnRef); | ||
| if (ident) { | ||
| idents.push({ doesContain: true, ident }); |
There was a problem hiding this comment.
Either a column or map, we want both since a materialized column could be a map key optimization in disguise
| // replace materialized idents with map ident | ||
| for (const curIdent of idents) { | ||
| if (curIdent.ident.type === 'column') { | ||
| const materializedMapIdent = materializedColumnToMapIdent.get( | ||
| curIdent.ident.name, | ||
| ); | ||
| if (materializedMapIdent) { | ||
| curIdent.ident = materializedMapIdent; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This allows us to add mapContains even if that map entry is materialized, which is advantageous to still use the map key index
There was a problem hiding this comment.
I'm not sure if the materialized field would be beneficial from this optimization. but good to know
| import { CustomSchemaSQLSerializerV2, SearchQueryBuilder } from '@/queryParser'; | ||
|
|
||
| const MAP_CONTAINS_OPTIMIZATION_ENABLED = | ||
| process.env.NEXT_PUBLIC_MAP_CONTAINS_OPTIMIZATION_ENABLED || |
There was a problem hiding this comment.
style: better to follow the pattern like (process.env.NEXT_PUBLIC_MAP_CONTAINS_OPTIMIZATION_ENABLED ?? 'false') === 'true'
and move it to config.ts
| const maps = new Set( | ||
| columns.filter(v => v.type.startsWith('Map')).map(v => v.name), | ||
| ); |
There was a problem hiding this comment.
perf: We can move this out of the method, and we don’t need to traverse the AST if it’s empty. Also maps is a bit ambiguous, maybe mapFieldNames ?
|
|
||
| return parser.sqlify(ast); | ||
| } catch { | ||
| // ignore |
There was a problem hiding this comment.
Should we log errors during development for debugging purposes?
| } | ||
|
|
||
| // add map idents to AST | ||
| const addIdentToAst = (ident: SQLMapValueIdent, doesContain: boolean) => { |
There was a problem hiding this comment.
style: We can probably move this function out instead of manipulating the AST directly
|
|
||
| const MAP_CONTAINS_OPTIMIZATION_ENABLED = | ||
| process.env.NEXT_PUBLIC_MAP_CONTAINS_OPTIMIZATION_ENABLED || | ||
| process.env.NODE_ENV !== 'production'; |
There was a problem hiding this comment.
I'd suggest removing this (in case somehow NODE_ENV is not set properly in prod) and add NEXT_PUBLIC_MAP_CONTAINS_OPTIMIZATION_ENABLED to your .env.local instead
|
Closing this. I don't think it'll be easy to cover all possible SQL cases, I'm going to just focus on modifying the Lucene queries for now. |
Closes HDX-3070
Adding
mapContainsallows a bloom filter index to be used to not search a granule if a key for a given map is not present in that granule. In some testing I've done it yielded 40% less rows scanned