diff --git a/dev-guide/src/grammar.md b/dev-guide/src/grammar.md index b1979df673..2f8d41f822 100644 --- a/dev-guide/src/grammar.md +++ b/dev-guide/src/grammar.md @@ -35,7 +35,9 @@ Name -> + Expression -> Sequence (` `* `|` ` `* Sequence)* -Sequence -> (` `* AdornedExpr)+ +Sequence -> + (` `* AdornedExpr)* ` `* Cut + | (` `* AdornedExpr)+ AdornedExpr -> ExprRepeat Suffix? Footnote? @@ -92,6 +94,8 @@ Prose -> `<` ~[`>` LF]+ `>` Group -> `(` ` `* Expression ` `* `)` NegativeExpression -> `~` ( Charset | Terminal | NonTerminal ) + +Cut -> `^` Sequence ``` The general format is a series of productions separated by blank lines. The expressions are as follows: @@ -110,6 +114,7 @@ The general format is a series of productions separated by blank lines. The expr | Prose | \ | An English description of what should be matched, surrounded in angle brackets. | | Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. | | NegativeExpression | ~\[\` \` LF\] | Matches anything except the given Charset, Terminal, or Nonterminal. | +| Cut | Expr1 ^ Expr2 \| Expr3 | The hard cut operator. Once the expressions preceding `^` in the sequence match, the rest of the sequence must match or parsing fails unconditionally --- no enclosing expression can backtrack past the cut point. | | Sequence | \`fn\` Name Parameters | A sequence of expressions that must match in order. | | Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. | | Suffix | \_except \[LazyBooleanExpression\]\_ | Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. | diff --git a/src/notation.md b/src/notation.md index e88679220c..cda298a734 100644 --- a/src/notation.md +++ b/src/notation.md @@ -24,6 +24,7 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets: | ~\[ ] | ~\[`b` `B`] | Any characters, except those listed | | ~`string` | ~`\n`, ~`*/` | Any characters, except this sequence | | ( ) | (`,` _Parameter_)? | Groups items | +| ^ | `b'` ^ ASCII_FOR_CHAR | The rest of the sequence must match or parsing fails unconditionally ([hard cut operator]) | | U+xxxx | U+0060 | A single unicode character | | \ | \ | An English description of what should be matched | | Rule suffix | IDENTIFIER_OR_KEYWORD _except `crate`_ | A modification to the previous rule | @@ -31,6 +32,15 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets: Sequences have a higher precedence than `|` alternation. +r[notation.grammar.cut] +### The hard cut operator + +The grammar uses ordered alternation: the parser tries alternatives left to right and takes the first that matches. If an alternative fails partway through a sequence, the parser normally backtracks and tries the next alternative. The cut operator (`^`) prevents this. Once every expression to the left of `^` in a sequence has matched, the rest of the sequence must match or parsing fails unconditionally. + +Mizushima et al. introduced [cut operators][cut operator paper] to parsing expression grammars. In the PEG literature, a *soft cut* prevents backtracking only within the immediately enclosing ordered choice --- outer choices can still recover. A *hard cut* prevents all backtracking past the cut point; failure is definitive. The `^` used in this grammar is a hard cut. + +The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens. + r[notation.grammar.string-tables] ### String table productions @@ -52,7 +62,13 @@ r[notation.grammar.visualizations] Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal. [binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators +[byte literals]: tokens.md#r-lex.token.byte.syntax +[byte string literals]: tokens.md#r-lex.token.str-byte.syntax +[cut after `c"`]: tokens.md#r-lex.token.str-c.syntax +[cut operator paper]: https://kmizu.github.io/papers/paste513-mizushima.pdf +[hard cut operator]: notation.md#the-hard-cut-operator [keywords]: keywords.md +[raw string literals]: tokens.md#r-lex.token.literal.str-raw.syntax [syntax diagram]: https://en.wikipedia.org/wiki/Syntax_diagram [tokens]: tokens.md [unary operators]: expressions/operator-expr.md#borrow-operators diff --git a/src/tokens.md b/src/tokens.md index 571fa9849d..b6a0124320 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -217,7 +217,7 @@ r[lex.token.literal.str-raw.syntax] RAW_STRING_LITERAL -> `r` RAW_STRING_CONTENT SUFFIX? RAW_STRING_CONTENT -> - `"` ( ~CR )*? `"` + `"` ^ ( ~CR )*? `"` | `#` RAW_STRING_CONTENT `#` ``` @@ -251,7 +251,7 @@ r[lex.token.byte] r[lex.token.byte.syntax] ```grammar,lexer BYTE_LITERAL -> - `b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX? + `b'` ^ ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX? ASCII_FOR_CHAR -> @@ -270,7 +270,7 @@ r[lex.token.str-byte] r[lex.token.str-byte.syntax] ```grammar,lexer BYTE_STRING_LITERAL -> - `b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX? + `b"` ^ ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX? ASCII_FOR_STRING -> @@ -306,7 +306,7 @@ RAW_BYTE_STRING_LITERAL -> `br` RAW_BYTE_STRING_CONTENT SUFFIX? RAW_BYTE_STRING_CONTENT -> - `"` ASCII_FOR_RAW*? `"` + `"` ^ ASCII_FOR_RAW*? `"` | `#` RAW_BYTE_STRING_CONTENT `#` ASCII_FOR_RAW -> @@ -343,13 +343,12 @@ r[lex.token.str-c] r[lex.token.str-c.syntax] ```grammar,lexer C_STRING_LITERAL -> - `c"` ( + `c"` ^ ( ~[`"` `\` CR NUL] | BYTE_ESCAPE _except `\0` or `\x00`_ | UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_ | STRING_CONTINUE )* `"` SUFFIX? - ``` r[lex.token.str-c.intro] @@ -402,7 +401,7 @@ RAW_C_STRING_LITERAL -> `cr` RAW_C_STRING_CONTENT SUFFIX? RAW_C_STRING_CONTENT -> - `"` ( ~[CR NUL] )*? `"` + `"` ^ ( ~[CR NUL] )*? `"` | `#` RAW_C_STRING_CONTENT `#` ``` diff --git a/tools/grammar/src/lib.rs b/tools/grammar/src/lib.rs index 197fd2f5cf..70e1a8f9a8 100644 --- a/tools/grammar/src/lib.rs +++ b/tools/grammar/src/lib.rs @@ -76,6 +76,8 @@ pub enum ExpressionKind { Charset(Vec), /// ``~[` ` LF]`` NegExpression(Box), + /// `^ A B C` + Cut(Box), /// `U+0060` Unicode(String), } @@ -116,7 +118,8 @@ impl Expression { | ExpressionKind::RepeatPlus(e) | ExpressionKind::RepeatPlusNonGreedy(e) | ExpressionKind::RepeatRange(e, _, _) - | ExpressionKind::NegExpression(e) => { + | ExpressionKind::NegExpression(e) + | ExpressionKind::Cut(e) => { e.visit_nt(callback); } ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => { diff --git a/tools/grammar/src/parser.rs b/tools/grammar/src/parser.rs index 39bba771e3..d4240ae4d7 100644 --- a/tools/grammar/src/parser.rs +++ b/tools/grammar/src/parser.rs @@ -173,11 +173,7 @@ impl Parser<'_> { match es.len() { 0 => Ok(None), 1 => Ok(Some(es.pop().unwrap())), - _ => Ok(Some(Expression { - kind: ExpressionKind::Alt(es), - suffix: None, - footnote: None, - })), + _ => Ok(Some(Expression::new_kind(ExpressionKind::Alt(es)))), } } @@ -185,6 +181,11 @@ impl Parser<'_> { let mut es = Vec::new(); loop { self.space0(); + if self.peek() == Some(b'^') { + let cut = self.parse_cut()?; + es.push(cut); + break; + } let Some(e) = self.parse_expr1()? else { break; }; @@ -201,6 +202,19 @@ impl Parser<'_> { } } + /// Parse cut (`^`) operator. + fn parse_cut(&mut self) -> Result { + self.expect("^", "expected `^`")?; + let Some(rhs) = self.parse_seq()? else { + bail!(self, "expected expression after cut operator"); + }; + Ok(Expression { + kind: ExpressionKind::Cut(Box::new(rhs)), + suffix: None, + footnote: None, + }) + } + fn parse_expr1(&mut self) -> Result> { let Some(next) = self.peek() else { return Ok(None); @@ -506,13 +520,71 @@ fn translate_position(input: &str, index: usize) -> (&str, usize, usize) { ("", line_number + 1, 0) } -#[test] -fn translate_tests() { - assert_eq!(translate_position("", 0), ("", 0, 0)); - assert_eq!(translate_position("test", 0), ("test", 1, 1)); - assert_eq!(translate_position("test", 3), ("test", 1, 4)); - assert_eq!(translate_position("test", 4), ("test", 1, 5)); - assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5)); - assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1)); - assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0)); +#[cfg(test)] +mod tests { + use crate::parser::{parse_grammar, translate_position}; + use crate::{ExpressionKind, Grammar}; + use std::path::Path; + + #[test] + fn test_translate() { + assert_eq!(translate_position("", 0), ("", 0, 0)); + assert_eq!(translate_position("test", 0), ("test", 1, 1)); + assert_eq!(translate_position("test", 3), ("test", 1, 4)); + assert_eq!(translate_position("test", 4), ("test", 1, 5)); + assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5)); + assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1)); + assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0)); + } + + fn parse(input: &str) -> Result { + let mut grammar = Grammar::default(); + parse_grammar(input, &mut grammar, "test", Path::new("test.md")) + .map_err(|e| e.to_string())?; + Ok(grammar) + } + + #[test] + fn test_cut() { + let input = "Rule -> A ^ B | C"; + let grammar = parse(input).unwrap(); + grammar.productions.get("Rule").unwrap(); + } + + #[test] + fn test_cut_captures() { + let input = "Rule -> A ^ B C | D"; + let grammar = parse(input).unwrap(); + let rule = grammar.productions.get("Rule").unwrap(); + // The top-level expression is an alternation: (A ^ B C) | D. + let ExpressionKind::Alt(alts) = &rule.expression.kind else { + panic!("expected Alt, got {:?}", rule.expression.kind); + }; + assert_eq!(alts.len(), 2); + // First alternative is a sequence: A, Cut(Sequence(B, C)). + let ExpressionKind::Sequence(seq) = &alts[0].kind else { + panic!("expected Sequence, got {:?}", alts[0].kind); + }; + assert_eq!(seq.len(), 2); + assert!(matches!(&seq[0].kind, ExpressionKind::Nt(n) if n == "A")); + // The cut captures the rest of the sequence (B and C). + let ExpressionKind::Cut(cut_inner) = &seq[1].kind else { + panic!("expected Cut, got {:?}", seq[1].kind); + }; + let ExpressionKind::Sequence(cut_seq) = &cut_inner.kind else { + panic!("expected Sequence inside Cut, got {:?}", cut_inner.kind); + }; + assert_eq!(cut_seq.len(), 2); + assert!(matches!(&cut_seq[0].kind, ExpressionKind::Nt(n) if n == "B")); + assert!(matches!(&cut_seq[1].kind, ExpressionKind::Nt(n) if n == "C")); + // Second alternative is just D. + assert!(matches!(&alts[1].kind, ExpressionKind::Nt(n) if n == "D")); + } + + #[test] + fn test_cut_fail_trailing() { + let input = "Rule -> A ^"; + let err = parse(input).unwrap_err(); + assert!(err.contains("expected expression after cut operator")); + } } diff --git a/tools/mdbook-spec/src/grammar/render_markdown.rs b/tools/mdbook-spec/src/grammar/render_markdown.rs index 5584b4641a..a5540b4169 100644 --- a/tools/mdbook-spec/src/grammar/render_markdown.rs +++ b/tools/mdbook-spec/src/grammar/render_markdown.rs @@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind { | ExpressionKind::Comment(_) | ExpressionKind::Charset(_) | ExpressionKind::NegExpression(_) + | ExpressionKind::Cut(_) | ExpressionKind::Unicode(_) => &expr.kind, } } @@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) { output.push('~'); render_expression(e, cx, output); } + ExpressionKind::Cut(e) => { + output.push_str("^ "); + render_expression(e, cx, output); + } ExpressionKind::Unicode(s) => { output.push_str("U+"); output.push_str(s); diff --git a/tools/mdbook-spec/src/grammar/render_railroad.rs b/tools/mdbook-spec/src/grammar/render_railroad.rs index f16cadf557..6efb065a34 100644 --- a/tools/mdbook-spec/src/grammar/render_railroad.rs +++ b/tools/mdbook-spec/src/grammar/render_railroad.rs @@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option { + let rhs = render_expression(e, cx, stack)?; + let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string())); + Box::new(lbox) + } ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))), }; }