fix(pretty-print): quote reserved-keyword and mixed-case identifiers across all emitters#755
Conversation
…across all emitters Several emitters printed identifiers as bare tokens instead of using the quote-aware helper, so names that require quoting (reserved keywords, multi-word, mixed-case) lost their qutes and produced invalid or semantically different SQL. Similar to (supabase-community#749), routed them through `emit_identifier_maybe_quoted` instead of using bare `TokenKind::IDENT` tokens, and added test cases covering one statement per fixed emitter.
|
thanks for the contribution! it looks good to me, but can you check why the AST comparison that is embedded into the formatter entry point didn't catch these cases? |
|
Hey, the |
|
Coverting to draft again, I ran a fuzzer on my corpus which found a few more emitters with the same bug |
|
No worries! Thanks for hardening the formatter, would love to stabilise and properly release it but my time is limited at the moment. Greatly appreciate the support! |
|
Think I found a bug where the original/formatted AST is the same, even though the meaning is not the same:
#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")"
WORK="$(mktemp -d)"
trap 'rm -rf "$WORK"' EXIT
ORIG="$WORK/orig.sql"
FMT="$WORK/fmt.sql"
# The ORIGINAL input: case-sensitive identifiers, correctly quoted.
cat > "$ORIG" <<'SQL'
SET search_path = "MySchema";
SET ROLE "MyRole";
SET search_path = "MySchema", public;
SET xmloption = DOCUMENT;
SQL
cp "$ORIG" "$FMT"
cargo build -q -p pgls_cli
cargo build -q -p pgls_pretty_print --example dump_ast
CLI=./target/debug/postgres-language-server
DUMP() { cargo run -q -p pgls_pretty_print --example dump_ast -- "$1" "$2"; }
echo
echo "### Running CLI: format --write on the copy"
"$CLI" format --write "$FMT" || true
echo
echo "===================== INPUT (orig.sql) ====================="
cat "$ORIG"
echo "================== CLI OUTPUT (fmt.sql) ===================="
cat "$FMT"
echo
echo "############################################################"
echo "# 1) RAW ASTs -- orig vs CLI-formatted"
echo "# A non-empty diff here proves the CLI changed the AST."
echo "############################################################"
if diff -u <(DUMP "$ORIG" raw) <(DUMP "$FMT" raw); then
echo ">>> RAW ASTs are IDENTICAL (no bug)"
else
echo ">>> RAW ASTs DIFFER ==> the CLI silently changed the parsed meaning."
fi
echo
echo "############################################################"
echo "# 2) NORMALIZED ASTs -- orig vs CLI-formatted"
echo "# An EMPTY diff here is why the guard returned Ok and"
echo "# let the corrupting rewrite through."
echo "############################################################"
if diff -u <(DUMP "$ORIG" normalized) <(DUMP "$FMT" normalized); then
echo ">>> NORMALIZED ASTs are IDENTICAL ==> guard sees no difference ==> Ok."
else
echo ">>> NORMALIZED ASTs DIFFER (guard would have caught it)"
fiOutput: I think the issue is in NodeMut::VariableSetStmt(n) => {
// SET statement values are case-insensitive in PostgreSQL
// e.g., SET xmloption = DOCUMENT vs SET xmloption = document
for arg in &mut (*n).args {
if let Some(NodeEnum::AConst(aconst)) = arg.node.as_mut() {
aconst.location = 0;
if let Some(pgls_query::protobuf::a_const::Val::Sval(s)) =
aconst.val.as_mut()
{
s.sval = s.sval.to_lowercase(); // "MySchema" and "myschema" become myschema
}
}
}
}This bug is a bit different, so I wasn't sure if it fits in this branch/PR, @psteinroe how would you handle it? |
|
@JohannesKnopp thanks for the repro! a separate pr would be great for the normalisation fix. :) |
|
@psteinroe Sounds good, in that case this PR would be ready from my side! |
Several emitters printed identifiers as bare tokens instead of using the quote-aware helper, so names that require quoting (reserved keywords, multi-word, mixed-case) lost their qutes and produced invalid or semantically different SQL.
Similar to (#749), routed them through
emit_identifier_maybe_quotedinstead of using bareTokenKind::IDENTtokens, and added test cases covering one statement per fixed emitter.What kind of change does this PR introduce?
Bug fix
What is the current behavior?
Many emitters wrote a dynamic identifier (a user-supplied name read from the AST) straight into a raw
TokenKind::IDENTtoken, bypassing the quote aware helper, leading to the quotes always being dropped, which can produce invalid SQL in case of reserved keywords or multi-token identifiers, or change the semantics if the quoted identifier is case-sensitive (e.g. quoted schema names are case-sensitive).1.
parse-fail: reserved keyword emitted bare => invalid SQL2.
ast-mismatch: mixed-case name folded => different meaningFull
cargo testoutput against un-fixed codeCorpus sweep failures
What is the new behavior?
Routed emitters with dynamic identifiers through the
emit_identifier_maybe_quotedhelper.Additional context
I have a small
corpus_sweephelper, with which I compared the ASTs of the SQL files I downloaded against the produced SQL of the pretty-print. I wasn't sure whether to include it in the PR, but it allowed me to find the first instance of the bug, from which I looked for similar issues!