## Fix SQL query validation: replace strictValidation with SqlValidation enum#1881
## Fix SQL query validation: replace strictValidation with SqlValidation enum#1881zaleslaw wants to merge 4 commits into
strictValidation with SqlValidation enum#1881Conversation
Refactored SQL query/table validation to replace the `strictValidation` parameter with a more flexible `SqlValidation` enum introducing `ReadOnly` and `None` modes. Enhanced validation to block multi-statement queries and DDL/DML keywords in `ReadOnly` mode while allowing comments and literals inside queries. Updated tests and documentation accordingly.
|
@zaleslaw please be a bit more brief in describing all changes in the PR description (like in "Test changes"). We can see the changes when we review the files, you know ;P |
| * [SqlValidation.ReadOnly] blocks DDL/DML; [SqlValidation.None] (default) skips validation. | ||
| * Table name validation is always strict regardless of this parameter. | ||
| * @param [configureStatement] optional lambda to configure the [PreparedStatement] before execution. | ||
| * This allows for custom tuning of fetch size, query timeout, and other JDBC parameters. |
There was a problem hiding this comment.
inconsistent aligning of param text. You use 3 styles in each KDoc:
@param [limit] |the maximum...
|`null` (def...
|or positive...
|@param [dbType] the type of...
|in that case the [dbType] w...
@param [configureStatement]| optional lambda
|This allows for c
In the rest of the library we mostly just use 2 spaces:
@param [dbType] the type of...
in that case the [dbType] w...
There was a problem hiding this comment.
I cannot leave comments on this file... Something may have gone wrong creating/saving it.
There was a problem hiding this comment.
I see, it is saved as CRLF (Windows) instead of LF (Linux). I'll see if I can fix it for now, but check your settings :)
There was a problem hiding this comment.
Anyways. I cannot see the descriptions of ReadOnly and None when selecting them in the IDE. Please add the KDoc to these enum entries itself.
|
I think the rest looks good :) |
Closes #1671
Problem
The existing SQL validation rejected valid queries that any database would happily accept:
Root cause: validation ran regex patterns against the raw SQL string, so a semicolon inside
'a;b'triggered the same rule asSELECT 1; DROP TABLE t. Comments were blocked entirely, preventing legitimate annotated queries.Additionally,
strictValidation: Boolean = truewas the wrong abstraction — a boolean gives no hint about what is being validated or why, and the default-on behaviour meant every CTE or function call with a semicolon failed immediately.Solution
1.
SqlValidationenum replacesstrictValidation: BooleanDefault changed to
SqlValidation.None.readSqlQueryis a developer API; the database is the authoritative SQL validator.ReadOnlyis opt-in for LLM-generated SQL or other less-trusted query sources.2. Tokenizer-based validation in
ReadOnlymodestripSqlNoise()does a character-by-character pass stripping single-quoted strings ('...'with''escape), double-quoted identifiers, line comments (--), and block comments (/* */) before any keyword matching. No external parser dependency.Three checks on the stripped SQL:
{SELECT, WITH, VALUES, TABLE, EXPLAIN};followed by non-whitespace (multiple statements)DROP,DELETE,INSERT,UPDATE,CREATE,ALTER,TRUNCATE,GRANT,REVOKE,MERGE,EXEC,EXECUTE) as a whole word3. Table name validation stays always strict
readSqlTablegenerates SQL from the table name (SELECT * FROM <name>). That makes it a different threat model from user-provided SQL. Thevalidationparameter was removed from allreadSqlTableoverloads — table names are always validated by regex (TABLE_NAME_VALID_PATTERN).hasForbiddenPatterns()removed entirely.Code changes
validationUtil.ktSqlValidationenum;stripSqlNoise()tokenizer; rewroteisValidSqlQuery()on top of tokenizer; simplifiedisValidTableName()to regex-only; removedFORBIDDEN_PATTERNS_REGEX/hasForbiddenPatterns()readJdbc.ktstrictValidation: Boolean = true→validation: SqlValidation = SqlValidation.None;readSqlTableoverloads: validation parameter removed, table name always validated; fixed privatereadTableAsDataFrameleftover callreadDataFrameSchema.ktstrictValidation = trueargumentsTest changes
readSqlQuery should reject malicious SQL queries→ split into:readSqlQuery with ReadOnly validation should block multi-statement and DDL queries— only queries that should actually be blockedreadSqlQuery with ReadOnly validation should allow comments and semicolons inside literals— explicitly verifies the false positives are gonereadSqlQuery with default SqlValidation None does not throw for any query— verifies default passes everything throughreadFromTable should work with non-standard table names when strictValidation is disabled→ replaced byreadSqlTable should always reject non-identifier table names(table validation is now unconditional)readSqlQuery should execute DROP TABLE when validation is disabled→ renamed to reflect thatSqlValidation.Noneis simply the default, not a special moderead from table with name from reserved SQL keywords— removed falseshouldThrowblock (double-quoted identifiers like"ALTER"are stripped before matching, soSELECT * FROM "ALTER"correctly passesReadOnlyvalidation)readSqlTable should reject pre-quoted table names and suggest unquoted formreadSqlTable should accept unquoted schema-qualified names— verifiesPUBLIC.Customerworks, quoting added byDbType.quoteIdentifier()"DROP TABLE users"and"users; DROP TABLE users"to the existing table name injection test (confirms regex protection is sufficient withouthasForbiddenPatterns)Tradeoffs and decisions
Noneas default, notReadOnlyreadSqlQueryis called by developers writing known SQL, not handling user input. Blocking their CTEs and comments by default is hostile. If someone is passing LLM-generated or user-controlled SQL, they can explicitly opt intoReadOnly. The database always provides the real last line of defence.Table name validation always strict, no bypass
The library builds SQL strings from the table name. Providing a bypass would make the library itself the source of injection. Users with non-standard names have
readSqlQueryas the explicit escape hatch. Pre-quoted names like"schema"."table"are rejected with an error message directing users to pass unquotedschema.table— the library adds DB-specific quoting viaDbType.quoteIdentifier()automatically.No SQL parser dependency
JSqlParser or similar would be more accurate but brings dialect coverage risk and a heavy Java dependency. The tokenizer approach is dialect-neutral: stripping literals and comments is the same in every SQL dialect, so the subsequent keyword checks are reliable enough for the
ReadOnlyguard use case.ReadOnlyallowlist is conservative{SELECT, WITH, VALUES, TABLE, EXPLAIN}— if a dialect has other read-only statement types not covered here, the user can useSqlValidation.Noneand rely on DB-level permissions instead.