Skip to content

Commit 9330955

Browse files
traviscrossehuss
authored andcommitted
Add cut operator (^) to grammar
The cut operator (`^`) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See *Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space*, Mizushima et al., <https://kmizu.github.io/papers/paste513-mizushima.pdf>. This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g. `c"\0"`), we may instead lex it successfully as two separate tokens (`c "\0"), and that would be incorrect. As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG. Let's add the cut operator to our grammar and use it for C string literals and some similar constructs. In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.
1 parent faee4d3 commit 9330955

7 files changed

Lines changed: 85 additions & 23 deletions

File tree

dev-guide/src/grammar.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@ Name -> <Alphanumeric or `_`>+
3535
3636
Expression -> Sequence (` `* `|` ` `* Sequence)*
3737
38-
Sequence -> (` `* AdornedExpr)+
38+
Sequence ->
39+
(` `* AdornedExpr)* ` `* Cut
40+
| (` `* AdornedExpr)+
3941
4042
AdornedExpr -> ExprRepeat Suffix? Footnote?
4143
@@ -63,6 +65,7 @@ Expr1 ->
6365
| Prose
6466
| Group
6567
| NegativeExpression
68+
| Cut
6669
6770
Unicode -> `U+` [`A`-`Z` `0`-`9`]4..4
6871
@@ -92,6 +95,8 @@ Prose -> `<` ~[`>` LF]+ `>`
9295
Group -> `(` ` `* Expression ` `* `)`
9396
9497
NegativeExpression -> `~` ( Charset | Terminal | NonTerminal )
98+
99+
Cut -> `^` Sequence
95100
```
96101

97102
The general format is a series of productions separated by blank lines. The expressions are as follows:
@@ -110,6 +115,7 @@ The general format is a series of productions separated by blank lines. The expr
110115
| Prose | \<any ASCII character except CR\> | An English description of what should be matched, surrounded in angle brackets. |
111116
| Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. |
112117
| NegativeExpression | ~\[\` \` LF\] | Matches anything except the given Charset, Terminal, or Nonterminal. |
118+
| Cut | Expr1 ^ Expr2 \| Expr3 | The cut operator. Commits to the current alternative if parsing has reached the cut operator. It is a syntax error if the remaining expressions in the sequence do not match. |
113119
| Sequence | \`fn\` Name Parameters | A sequence of expressions that must match in order. |
114120
| Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. |
115121
| Suffix | \_except \[LazyBooleanExpression\]\_ | Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. |

src/notation.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets:
2424
| ~\[ ] | ~\[`b` `B`] | Any characters, except those listed |
2525
| ~`string` | ~`\n`, ~`*/` | Any characters, except this sequence |
2626
| ( ) | (`,` _Parameter_)<sup>?</sup> | Groups items |
27+
| ^ | `c"` ^ _CStringRest_ | Commit to an alternative ([cut operator]) |
2728
| U+xxxx | U+0060 | A single unicode character |
2829
| \<text\> | \<any ASCII char except CR\> | An English description of what should be matched |
2930
| Rule <sub>suffix</sub> | IDENTIFIER_OR_KEYWORD <sub>_except `crate`_</sub> | A modification to the previous rule |
@@ -52,6 +53,7 @@ r[notation.grammar.visualizations]
5253
Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal.
5354

5455
[binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators
56+
[cut operator]: https://kmizu.github.io/papers/paste513-mizushima.pdf
5557
[keywords]: keywords.md
5658
[syntax diagram]: https://en.wikipedia.org/wiki/Syntax_diagram
5759
[tokens]: tokens.md

src/tokens.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ r[lex.token.literal.str-raw.syntax]
217217
RAW_STRING_LITERAL -> `r` RAW_STRING_CONTENT SUFFIX?
218218
219219
RAW_STRING_CONTENT ->
220-
`"` ( ~CR )*? `"`
220+
`"` ^ ( ~CR )*? `"`
221221
| `#` RAW_STRING_CONTENT `#`
222222
```
223223

@@ -251,7 +251,7 @@ r[lex.token.byte]
251251
r[lex.token.byte.syntax]
252252
```grammar,lexer
253253
BYTE_LITERAL ->
254-
`b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?
254+
`b'` ^ ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?
255255
256256
ASCII_FOR_CHAR ->
257257
<any ASCII (i.e. 0x00 to 0x7F) except `'`, `\`, LF, CR, or TAB>
@@ -270,7 +270,7 @@ r[lex.token.str-byte]
270270
r[lex.token.str-byte.syntax]
271271
```grammar,lexer
272272
BYTE_STRING_LITERAL ->
273-
`b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
273+
`b"` ^ ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
274274
275275
ASCII_FOR_STRING ->
276276
<any ASCII (i.e 0x00 to 0x7F) except `"`, `\`, or CR>
@@ -306,7 +306,7 @@ RAW_BYTE_STRING_LITERAL ->
306306
`br` RAW_BYTE_STRING_CONTENT SUFFIX?
307307
308308
RAW_BYTE_STRING_CONTENT ->
309-
`"` ASCII_FOR_RAW*? `"`
309+
`"` ^ ASCII_FOR_RAW*? `"`
310310
| `#` RAW_BYTE_STRING_CONTENT `#`
311311
312312
ASCII_FOR_RAW ->
@@ -343,13 +343,12 @@ r[lex.token.str-c]
343343
r[lex.token.str-c.syntax]
344344
```grammar,lexer
345345
C_STRING_LITERAL ->
346-
`c"` (
346+
`c"` ^ (
347347
~[`"` `\` CR NUL]
348348
| BYTE_ESCAPE _except `\0` or `\x00`_
349349
| UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_
350350
| STRING_CONTINUE
351351
)* `"` SUFFIX?
352-
353352
```
354353

355354
r[lex.token.str-c.intro]
@@ -402,7 +401,7 @@ RAW_C_STRING_LITERAL ->
402401
`cr` RAW_C_STRING_CONTENT SUFFIX?
403402
404403
RAW_C_STRING_CONTENT ->
405-
`"` ( ~[CR NUL] )*? `"`
404+
`"` ^ ( ~[CR NUL] )*? `"`
406405
| `#` RAW_C_STRING_CONTENT `#`
407406
```
408407

tools/grammar/src/lib.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ pub enum ExpressionKind {
7676
Charset(Vec<Characters>),
7777
/// ``~[` ` LF]``
7878
NegExpression(Box<Expression>),
79+
/// `^ A`
80+
Cut(Box<Expression>),
7981
/// `U+0060`
8082
Unicode(String),
8183
}
@@ -116,7 +118,8 @@ impl Expression {
116118
| ExpressionKind::RepeatPlus(e)
117119
| ExpressionKind::RepeatPlusNonGreedy(e)
118120
| ExpressionKind::RepeatRange(e, _, _)
119-
| ExpressionKind::NegExpression(e) => {
121+
| ExpressionKind::NegExpression(e)
122+
| ExpressionKind::Cut(e) => {
120123
e.visit_nt(callback);
121124
}
122125
ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => {

tools/grammar/src/parser.rs

Lines changed: 56 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -173,18 +173,19 @@ impl Parser<'_> {
173173
match es.len() {
174174
0 => Ok(None),
175175
1 => Ok(Some(es.pop().unwrap())),
176-
_ => Ok(Some(Expression {
177-
kind: ExpressionKind::Alt(es),
178-
suffix: None,
179-
footnote: None,
180-
})),
176+
_ => Ok(Some(Expression::new_kind(ExpressionKind::Alt(es)))),
181177
}
182178
}
183179

184180
fn parse_seq(&mut self) -> Result<Option<Expression>> {
185181
let mut es = Vec::new();
186182
loop {
187183
self.space0();
184+
if self.peek() == Some(b'^') {
185+
let cut = self.parse_cut()?;
186+
es.push(cut);
187+
break;
188+
}
188189
let Some(e) = self.parse_expr1()? else {
189190
break;
190191
};
@@ -201,6 +202,19 @@ impl Parser<'_> {
201202
}
202203
}
203204

205+
/// Parse cut (`^`) operator.
206+
fn parse_cut(&mut self) -> Result<Expression> {
207+
self.expect("^", "expected `^`")?;
208+
let Some(rhs) = self.parse_seq()? else {
209+
bail!(self, "expected expression after cut operator");
210+
};
211+
Ok(Expression {
212+
kind: ExpressionKind::Cut(Box::new(rhs)),
213+
suffix: None,
214+
footnote: None,
215+
})
216+
}
217+
204218
fn parse_expr1(&mut self) -> Result<Option<Expression>> {
205219
let Some(next) = self.peek() else {
206220
return Ok(None);
@@ -506,13 +520,41 @@ fn translate_position(input: &str, index: usize) -> (&str, usize, usize) {
506520
("", line_number + 1, 0)
507521
}
508522

509-
#[test]
510-
fn translate_tests() {
511-
assert_eq!(translate_position("", 0), ("", 0, 0));
512-
assert_eq!(translate_position("test", 0), ("test", 1, 1));
513-
assert_eq!(translate_position("test", 3), ("test", 1, 4));
514-
assert_eq!(translate_position("test", 4), ("test", 1, 5));
515-
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
516-
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
517-
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
523+
#[cfg(test)]
524+
mod tests {
525+
use crate::grammar::Grammar;
526+
use crate::grammar::parser::{parse_grammar, translate_position};
527+
use std::path::Path;
528+
529+
#[test]
530+
fn test_translate() {
531+
assert_eq!(translate_position("", 0), ("", 0, 0));
532+
assert_eq!(translate_position("test", 0), ("test", 1, 1));
533+
assert_eq!(translate_position("test", 3), ("test", 1, 4));
534+
assert_eq!(translate_position("test", 4), ("test", 1, 5));
535+
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
536+
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
537+
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
538+
}
539+
540+
fn parse(input: &str) -> Result<Grammar, String> {
541+
let mut grammar = Grammar::default();
542+
parse_grammar(input, &mut grammar, "test", Path::new("test.md"))
543+
.map_err(|e| e.to_string())?;
544+
Ok(grammar)
545+
}
546+
547+
#[test]
548+
fn test_cut() {
549+
let input = "Rule -> A ^ B | C";
550+
let grammar = parse(input).unwrap();
551+
grammar.productions.get("Rule").unwrap();
552+
}
553+
554+
#[test]
555+
fn test_cut_fail_trailing() {
556+
let input = "Rule -> A ^";
557+
let err = parse(input).unwrap_err();
558+
assert!(err.contains("expected expression after cut operator"));
559+
}
518560
}

tools/mdbook-spec/src/grammar/render_markdown.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind {
7979
| ExpressionKind::Comment(_)
8080
| ExpressionKind::Charset(_)
8181
| ExpressionKind::NegExpression(_)
82+
| ExpressionKind::Cut(_)
8283
| ExpressionKind::Unicode(_) => &expr.kind,
8384
}
8485
}
@@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) {
171172
output.push('~');
172173
render_expression(e, cx, output);
173174
}
175+
ExpressionKind::Cut(e) => {
176+
output.push_str("^ ");
177+
render_expression(e, cx, output);
178+
}
174179
ExpressionKind::Unicode(s) => {
175180
output.push_str("U+");
176181
output.push_str(s);

tools/mdbook-spec/src/grammar/render_railroad.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option<B
214214
let ch = node_for_nt(cx, "CHAR");
215215
Box::new(Except::new(Box::new(ch), n))
216216
}
217+
ExpressionKind::Cut(e) => {
218+
let rhs = render_expression(e, cx, stack)?;
219+
let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string()));
220+
Box::new(lbox)
221+
}
217222
ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))),
218223
};
219224
}

0 commit comments

Comments
 (0)