Skip to content

Commit aee21d8

Browse files
committed
Add cut operator (^) to grammar
The cut operator (`^`) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See *Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space*, Mizushima et al., <https://kmizu.github.io/papers/paste513-mizushima.pdf>. This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g., `c"\0"`), we may instead lex it successfully as two separate tokens (`c` `"\0"`), and that would be incorrect. As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG. Let's add the cut operator to our grammar and use it for C string literals and some similar constructs. In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.
1 parent faee4d3 commit aee21d8

7 files changed

Lines changed: 128 additions & 23 deletions

File tree

dev-guide/src/grammar.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@ Name -> <Alphanumeric or `_`>+
3535
3636
Expression -> Sequence (` `* `|` ` `* Sequence)*
3737
38-
Sequence -> (` `* AdornedExpr)+
38+
Sequence ->
39+
(` `* AdornedExpr)* ` `* Cut
40+
| (` `* AdornedExpr)+
3941
4042
AdornedExpr -> ExprRepeat Suffix? Footnote?
4143
@@ -92,6 +94,8 @@ Prose -> `<` ~[`>` LF]+ `>`
9294
Group -> `(` ` `* Expression ` `* `)`
9395
9496
NegativeExpression -> `~` ( Charset | Terminal | NonTerminal )
97+
98+
Cut -> `^` Sequence
9599
```
96100

97101
The general format is a series of productions separated by blank lines. The expressions are as follows:
@@ -110,6 +114,7 @@ The general format is a series of productions separated by blank lines. The expr
110114
| Prose | \<any ASCII character except CR\> | An English description of what should be matched, surrounded in angle brackets. |
111115
| Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. |
112116
| NegativeExpression | ~\[\` \` LF\] | Matches anything except the given Charset, Terminal, or Nonterminal. |
117+
| Cut | Expr1 ^ Expr2 \| Expr3 | The hard cut operator. Once the expressions preceding `^` in the sequence match, the rest of the sequence must match or parsing fails unconditionally --- no enclosing expression can backtrack past the cut point. |
113118
| Sequence | \`fn\` Name Parameters | A sequence of expressions that must match in order. |
114119
| Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. |
115120
| Suffix | \_except \[LazyBooleanExpression\]\_ | Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. |

src/notation.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,23 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets:
2424
| ~\[ ] | ~\[`b` `B`] | Any characters, except those listed |
2525
| ~`string` | ~`\n`, ~`*/` | Any characters, except this sequence |
2626
| ( ) | (`,` _Parameter_)<sup>?</sup> | Groups items |
27+
| ^ | `b'` ^ ASCII_FOR_CHAR | The rest of the sequence must match or parsing fails unconditionally ([hard cut operator]) |
2728
| U+xxxx | U+0060 | A single unicode character |
2829
| \<text\> | \<any ASCII char except CR\> | An English description of what should be matched |
2930
| Rule <sub>suffix</sub> | IDENTIFIER_OR_KEYWORD <sub>_except `crate`_</sub> | A modification to the previous rule |
3031
| // Comment. | // Single line comment. | A comment extending to the end of the line. |
3132

3233
Sequences have a higher precedence than `|` alternation.
3334

35+
r[notation.grammar.cut]
36+
### The hard cut operator
37+
38+
The grammar uses ordered alternation: the parser tries alternatives left to right and takes the first that matches. If an alternative fails partway through a sequence, the parser normally backtracks and tries the next alternative. The cut operator (`^`) prevents this. Once every expression to the left of `^` in a sequence has matched, the rest of the sequence must match or parsing fails unconditionally.
39+
40+
Mizushima et al. introduced [cut operators][cut operator paper] to parsing expression grammars. In the PEG literature, a *soft cut* prevents backtracking only within the immediately enclosing ordered choice --- outer choices can still recover. A *hard cut* prevents all backtracking past the cut point; failure is definitive. The `^` used in this grammar is a hard cut.
41+
42+
The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.
43+
3444
r[notation.grammar.string-tables]
3545
### String table productions
3646

@@ -52,7 +62,13 @@ r[notation.grammar.visualizations]
5262
Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal.
5363

5464
[binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators
65+
[byte literals]: tokens.md#r-lex.token.byte.syntax
66+
[byte string literals]: tokens.md#r-lex.token.str-byte.syntax
67+
[cut after `c"`]: tokens.md#r-lex.token.str-c.syntax
68+
[cut operator paper]: https://kmizu.github.io/papers/paste513-mizushima.pdf
69+
[hard cut operator]: notation.md#the-hard-cut-operator
5570
[keywords]: keywords.md
71+
[raw string literals]: tokens.md#r-lex.token.literal.str-raw.syntax
5672
[syntax diagram]: https://en.wikipedia.org/wiki/Syntax_diagram
5773
[tokens]: tokens.md
5874
[unary operators]: expressions/operator-expr.md#borrow-operators

src/tokens.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ r[lex.token.literal.str-raw.syntax]
217217
RAW_STRING_LITERAL -> `r` RAW_STRING_CONTENT SUFFIX?
218218
219219
RAW_STRING_CONTENT ->
220-
`"` ( ~CR )*? `"`
220+
`"` ^ ( ~CR )*? `"`
221221
| `#` RAW_STRING_CONTENT `#`
222222
```
223223

@@ -251,7 +251,7 @@ r[lex.token.byte]
251251
r[lex.token.byte.syntax]
252252
```grammar,lexer
253253
BYTE_LITERAL ->
254-
`b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?
254+
`b'` ^ ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?
255255
256256
ASCII_FOR_CHAR ->
257257
<any ASCII (i.e. 0x00 to 0x7F) except `'`, `\`, LF, CR, or TAB>
@@ -270,7 +270,7 @@ r[lex.token.str-byte]
270270
r[lex.token.str-byte.syntax]
271271
```grammar,lexer
272272
BYTE_STRING_LITERAL ->
273-
`b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
273+
`b"` ^ ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
274274
275275
ASCII_FOR_STRING ->
276276
<any ASCII (i.e 0x00 to 0x7F) except `"`, `\`, or CR>
@@ -306,7 +306,7 @@ RAW_BYTE_STRING_LITERAL ->
306306
`br` RAW_BYTE_STRING_CONTENT SUFFIX?
307307
308308
RAW_BYTE_STRING_CONTENT ->
309-
`"` ASCII_FOR_RAW*? `"`
309+
`"` ^ ASCII_FOR_RAW*? `"`
310310
| `#` RAW_BYTE_STRING_CONTENT `#`
311311
312312
ASCII_FOR_RAW ->
@@ -343,13 +343,12 @@ r[lex.token.str-c]
343343
r[lex.token.str-c.syntax]
344344
```grammar,lexer
345345
C_STRING_LITERAL ->
346-
`c"` (
346+
`c"` ^ (
347347
~[`"` `\` CR NUL]
348348
| BYTE_ESCAPE _except `\0` or `\x00`_
349349
| UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_
350350
| STRING_CONTINUE
351351
)* `"` SUFFIX?
352-
353352
```
354353

355354
r[lex.token.str-c.intro]
@@ -402,7 +401,7 @@ RAW_C_STRING_LITERAL ->
402401
`cr` RAW_C_STRING_CONTENT SUFFIX?
403402
404403
RAW_C_STRING_CONTENT ->
405-
`"` ( ~[CR NUL] )*? `"`
404+
`"` ^ ( ~[CR NUL] )*? `"`
406405
| `#` RAW_C_STRING_CONTENT `#`
407406
```
408407

tools/grammar/src/lib.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ pub enum ExpressionKind {
7676
Charset(Vec<Characters>),
7777
/// ``~[` ` LF]``
7878
NegExpression(Box<Expression>),
79+
/// `^ A B C`
80+
Cut(Box<Expression>),
7981
/// `U+0060`
8082
Unicode(String),
8183
}
@@ -116,7 +118,8 @@ impl Expression {
116118
| ExpressionKind::RepeatPlus(e)
117119
| ExpressionKind::RepeatPlusNonGreedy(e)
118120
| ExpressionKind::RepeatRange(e, _, _)
119-
| ExpressionKind::NegExpression(e) => {
121+
| ExpressionKind::NegExpression(e)
122+
| ExpressionKind::Cut(e) => {
120123
e.visit_nt(callback);
121124
}
122125
ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => {

tools/grammar/src/parser.rs

Lines changed: 86 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -173,18 +173,19 @@ impl Parser<'_> {
173173
match es.len() {
174174
0 => Ok(None),
175175
1 => Ok(Some(es.pop().unwrap())),
176-
_ => Ok(Some(Expression {
177-
kind: ExpressionKind::Alt(es),
178-
suffix: None,
179-
footnote: None,
180-
})),
176+
_ => Ok(Some(Expression::new_kind(ExpressionKind::Alt(es)))),
181177
}
182178
}
183179

184180
fn parse_seq(&mut self) -> Result<Option<Expression>> {
185181
let mut es = Vec::new();
186182
loop {
187183
self.space0();
184+
if self.peek() == Some(b'^') {
185+
let cut = self.parse_cut()?;
186+
es.push(cut);
187+
break;
188+
}
188189
let Some(e) = self.parse_expr1()? else {
189190
break;
190191
};
@@ -201,6 +202,19 @@ impl Parser<'_> {
201202
}
202203
}
203204

205+
/// Parse cut (`^`) operator.
206+
fn parse_cut(&mut self) -> Result<Expression> {
207+
self.expect("^", "expected `^`")?;
208+
let Some(rhs) = self.parse_seq()? else {
209+
bail!(self, "expected expression after cut operator");
210+
};
211+
Ok(Expression {
212+
kind: ExpressionKind::Cut(Box::new(rhs)),
213+
suffix: None,
214+
footnote: None,
215+
})
216+
}
217+
204218
fn parse_expr1(&mut self) -> Result<Option<Expression>> {
205219
let Some(next) = self.peek() else {
206220
return Ok(None);
@@ -506,13 +520,71 @@ fn translate_position(input: &str, index: usize) -> (&str, usize, usize) {
506520
("", line_number + 1, 0)
507521
}
508522

509-
#[test]
510-
fn translate_tests() {
511-
assert_eq!(translate_position("", 0), ("", 0, 0));
512-
assert_eq!(translate_position("test", 0), ("test", 1, 1));
513-
assert_eq!(translate_position("test", 3), ("test", 1, 4));
514-
assert_eq!(translate_position("test", 4), ("test", 1, 5));
515-
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
516-
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
517-
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
523+
#[cfg(test)]
524+
mod tests {
525+
use crate::parser::{parse_grammar, translate_position};
526+
use crate::{ExpressionKind, Grammar};
527+
use std::path::Path;
528+
529+
#[test]
530+
fn test_translate() {
531+
assert_eq!(translate_position("", 0), ("", 0, 0));
532+
assert_eq!(translate_position("test", 0), ("test", 1, 1));
533+
assert_eq!(translate_position("test", 3), ("test", 1, 4));
534+
assert_eq!(translate_position("test", 4), ("test", 1, 5));
535+
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
536+
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
537+
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
538+
}
539+
540+
fn parse(input: &str) -> Result<Grammar, String> {
541+
let mut grammar = Grammar::default();
542+
parse_grammar(input, &mut grammar, "test", Path::new("test.md"))
543+
.map_err(|e| e.to_string())?;
544+
Ok(grammar)
545+
}
546+
547+
#[test]
548+
fn test_cut() {
549+
let input = "Rule -> A ^ B | C";
550+
let grammar = parse(input).unwrap();
551+
grammar.productions.get("Rule").unwrap();
552+
}
553+
554+
#[test]
555+
fn test_cut_captures() {
556+
let input = "Rule -> A ^ B C | D";
557+
let grammar = parse(input).unwrap();
558+
let rule = grammar.productions.get("Rule").unwrap();
559+
// The top-level expression is an alternation: (A ^ B C) | D.
560+
let ExpressionKind::Alt(alts) = &rule.expression.kind else {
561+
panic!("expected Alt, got {:?}", rule.expression.kind);
562+
};
563+
assert_eq!(alts.len(), 2);
564+
// First alternative is a sequence: A, Cut(Sequence(B, C)).
565+
let ExpressionKind::Sequence(seq) = &alts[0].kind else {
566+
panic!("expected Sequence, got {:?}", alts[0].kind);
567+
};
568+
assert_eq!(seq.len(), 2);
569+
assert!(matches!(&seq[0].kind, ExpressionKind::Nt(n) if n == "A"));
570+
// The cut captures the rest of the sequence (B and C).
571+
let ExpressionKind::Cut(cut_inner) = &seq[1].kind else {
572+
panic!("expected Cut, got {:?}", seq[1].kind);
573+
};
574+
let ExpressionKind::Sequence(cut_seq) = &cut_inner.kind else {
575+
panic!("expected Sequence inside Cut, got {:?}", cut_inner.kind);
576+
};
577+
assert_eq!(cut_seq.len(), 2);
578+
assert!(matches!(&cut_seq[0].kind, ExpressionKind::Nt(n) if n == "B"));
579+
assert!(matches!(&cut_seq[1].kind, ExpressionKind::Nt(n) if n == "C"));
580+
// Second alternative is just D.
581+
assert!(matches!(&alts[1].kind, ExpressionKind::Nt(n) if n == "D"));
582+
}
583+
584+
#[test]
585+
fn test_cut_fail_trailing() {
586+
let input = "Rule -> A ^";
587+
let err = parse(input).unwrap_err();
588+
assert!(err.contains("expected expression after cut operator"));
589+
}
518590
}

tools/mdbook-spec/src/grammar/render_markdown.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind {
7979
| ExpressionKind::Comment(_)
8080
| ExpressionKind::Charset(_)
8181
| ExpressionKind::NegExpression(_)
82+
| ExpressionKind::Cut(_)
8283
| ExpressionKind::Unicode(_) => &expr.kind,
8384
}
8485
}
@@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) {
171172
output.push('~');
172173
render_expression(e, cx, output);
173174
}
175+
ExpressionKind::Cut(e) => {
176+
output.push_str("^ ");
177+
render_expression(e, cx, output);
178+
}
174179
ExpressionKind::Unicode(s) => {
175180
output.push_str("U+");
176181
output.push_str(s);

tools/mdbook-spec/src/grammar/render_railroad.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option<B
214214
let ch = node_for_nt(cx, "CHAR");
215215
Box::new(Except::new(Box::new(ch), n))
216216
}
217+
ExpressionKind::Cut(e) => {
218+
let rhs = render_expression(e, cx, stack)?;
219+
let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string()));
220+
Box::new(lbox)
221+
}
217222
ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))),
218223
};
219224
}

0 commit comments

Comments
 (0)