Rewrite lexer and parser by Schamper · Pull Request #146 · fox-it/dissect.cstruct

Schamper · 2026-03-03T16:19:09Z

Closes #85, partially #142, and will make #86 and #138 a lot easier to implement. Fixes #149.

This PR will (finally) replace the shoddy C syntax parser I originally wrote many moons ago, when I discovered the existence of re.Scanner and ran with it. This PR aims to add a somewhat decent lexer and separate parser. I'm still not a compsci 1337coder, so this is just what I came up with (with some help) and definitely not a textbook implementation. All feedback is welcome.

New lexer
New C syntax parser that utilizes the new lexer
Expression parser re-uses the new lexer
Reworked how sizeof works in the expression parser, and added offsetof

The new parser has made changing parsing behavior a lot easier. As such, this PR already makes the following changes:

The new parser is slightly stricter, requiring proper semicolon endings for example. We'll need to fix this in any dissect code that has this.
An important semantic change is how named nested structures are handled. In my infinite wisdom, I originally figured that named nested structures do not "exist" in the top level scope. That's not true, so now named nested structures get properly registered with the cstruct instance:

struct a {
    struct b {
        ...
    };
};

// Will register both `a` and `b`

Another important change is how we deal with struct { ... } name;. We used to parse this first as an anonymous struct, then capture name as the structure type name. That's not strictly correct, name is a variable of an anonymous unnamed struct, so we now treat it as such. We don't error on this, but rather we silently ignore name and skip until we reach a ;
typedef enum ... is now allowed
Probably some other things I'm forgetting

This probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.

codecov · 2026-03-03T16:25:39Z

Codecov Report

❌ Patch coverage is 0% with 761 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (9f12b1e) to head (c75df78).

Files with missing lines	Patch %	Lines
dissect/cstruct/lexer.py	0.00%	341 Missing ⚠️
dissect/cstruct/parser.py	0.00%	293 Missing ⚠️
dissect/cstruct/expression.py	0.00%	111 Missing ⚠️
dissect/cstruct/utils.py	0.00%	12 Missing ⚠️
dissect/cstruct/cstruct.py	0.00%	2 Missing ⚠️
dissect/cstruct/exceptions.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##            main    #146    +/-   ##
======================================
  Coverage   0.00%   0.00%            
======================================
  Files         21      22     +1     
  Lines       2482    2594   +112     
======================================
- Misses      2482    2594   +112

Flag	Coverage Δ
unittests	`0.00% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-03-03T16:27:44Z

Merging this PR will not alter performance

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 9 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`test_benchmark_expression_evaluate`	78.2 µs	127.8 µs	-38.83%
⚡	`test_benchmark_expression_parse`	344.4 µs	263.8 µs	+30.54%
⚡	`test_benchmark_lexer_and_parser`	15.7 ms	12.9 ms	+21.29%
🆕	`test_benchmark_parser`	N/A	10.5 ms	N/A
🆕	`test_benchmark_lexer`	N/A	2.4 ms	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing rewrite-parser (c75df78) with main (9f12b1e)}

Schamper · 2026-03-03T16:30:24Z

@sMezaOrellana would be interested in your thoughts on these changes!

twiggler

Architecure looks ok, found 2 possible issues TBC

Migrated-from: fox-it/dissect.cstruct#146

twiggler · 2026-04-16T14:40:28Z

This PR has been migrated to the dissect monorepo: twiggler/dissect-monorepo-test#5

The original diff and commit history have been preserved on the migrate/dissect.cstruct/pr-146 branch.

Schamper · 2026-04-16T21:08:57Z

@twiggler I was a bit sad that your message was just this test, as I thought maybe you left more comments 😉.

Migrated-from: fox-it/dissect.cstruct#146

twiggler

LGTM

Since it is an impactful rewrite I asked @Miauwkeru to QA

Miauwkeru · 2026-04-28T10:50:39Z

I did some checking on our other projects to see what goes wrong, and found some regressions regarding to where it crashed unexpectedly.

it cannot evaluate a ternary operation, which occurred in dissect.vmfs and gives a LexerError when trying to evaluate it as it doesn't understand the ? token. While I am aware that those definitions don't function yet, it probably shouldn't crash when trying to evaluate it

#define func(x) ( x ? 1 : 0 )

Without a ternary it makes it into a string:

# define func(x)    ( x == 0)
>>> c_struct.func
' ( x ) ( x = = 0)`

Another issue I found was when it was trying to create define a flag in dissect.apfs. Here it couldn't evaluate the . after the INODE in the definition.

flag INODE {
  ...
};

#define APFS_INODE_PINNED_MASK            INODE.PINNED_TO_MAIN | INODE.PINNED_TO_TIER2

I'll look through the code now to see whether I can find some additional issues.

Schamper · 2026-04-29T12:51:47Z

@Miauwkeru fixed.

Miauwkeru · 2026-05-07T14:48:12Z

Only found one more thing that might be an oddity:

#define ADCRYPT_MAGIC ADCRYPT\00

This now fails to parse due to the \00 at the end. Which is fixed by using quotes. I don't know if this was an intended change tho

Schamper · 2026-05-07T15:46:37Z

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

Miauwkeru · 2026-05-11T09:09:28Z

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

I think it makes most sense to go the C route. In the example I gave:

#define ADCRYPT_MAGIC ADCRYPT\00

C wouldn't compile it as it would expect ADCRYPT to be another definition or something else it can resolve. (Besides not knowing what to do with the \0). So I think it would be better if we require strings to be explicitly quoted.

What do you think? @Schamper @JSCU-CNI

Schamper · 2026-05-11T09:43:04Z

I was leaning in that direction too.

JSCU-CNI · 2026-05-13T12:11:03Z

Agreed. Something like #define ADCRYPT_MAGIC "ADCRYPT\00" or #define ADCRYPT_MAGIC b"ADCRYPT\x00" would make sense.

Schamper · 2026-05-15T13:39:01Z

I changed a bit how #define values are handled with 3a37ab0. Feedback is welcome.

Both ways now actually work.

Miauwkeru · 2026-05-18T09:28:21Z

+    #define RAW somevalue
+    #define STR "hello"
+    #define BYTES b"world"
+    #define NULLRAW ADCRYPT\00


Don't we want it to be explicitly quoted so that it fails on this kind of definition?

My "new approach" (which is basically, don't tokenize anything after #define NAME, but just take its raw value until the end of the line) allows this to work again. I think as long as it's unit tested, it should be fine to keep in.

The reason why I slightly prefer this new approach is so that in the parser, we get a more "true" representation of that the define value actually is, including spacing and such. The downside being that the parser now has to deal a little bit with some basic string parsing.

Miauwkeru · 2026-06-01T13:43:42Z

+                    raise self._error("unterminated string literal", token=token)
+
+                # Remove the surrounding and any duplicate quotes
+                value = "".join(ch for ch in value if ch != quote)


It is pretty much an edge case, but with for example the following define:
#define test "\'\"a'b\""

Dissect evaluates it to "'a'b" where a c compiler turns it into "'\"a'b\"" (checked with godbolt)

and another note, adding a \n to the string will also create LexerError: unexpected end of input

This is moreso a limitation of Python than cstruct I think:

In [10]: """abc\"\'\n"""[3] Out[10]: '"' In [11]: """abc\"\'\n"""[4] Out[11]: "'" In [12]: """abc\"\'\n"""[5] Out[12]: '\n'

There's no way for us to see the escape characters unless you make it a raw string (r""").

The proper way to do it would be to double escape. I've added a test for that and a fix to make sure it works.

Co-authored-by: Copilot <copilot@github.com>

Miauwkeru

LGTM, want to release it as 5.0? Then we don't have to update every project immediately.
Otherwise we might need to already have the patch sets ready before it gets merged in.

Schamper · 2026-06-08T13:34:56Z

Yes, see the OP:

This probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.

How would we merge this though? Merging now would result in a 4.7.dev and all CI failing.

Miauwkeru · 2026-06-08T14:32:41Z

How would we merge this though? Merging now would result in a 4.7.dev and all CI failing.

We can yank/delete the 4.7.dev# it creates on pypi. Ofc not ideal, but it will avoid the CI from falling.
In some cases we might also need to invalidate some caches on projects.

There is also the option to update the publish step to only work when you tag it or disable publishing it in its entirety

Schamper · 2026-06-10T17:09:38Z

Temporarily disabling the publish step is probably the easiest.

Schamper force-pushed the rewrite-parser branch from 1bb3aac to 5f43faa Compare March 3, 2026 16:24

Schamper mentioned this pull request Mar 17, 2026

ValueError: Cannot use capturing groups in re.Scanner on Python 3.15 #149

Open

Schamper force-pushed the rewrite-parser branch from 5f43faa to 16e6b8b Compare March 24, 2026 13:21

twiggler requested changes Apr 1, 2026

View reviewed changes

Comment thread dissect/cstruct/expression.py

Comment thread dissect/cstruct/parser.py Outdated

Schamper requested a review from twiggler April 13, 2026 16:01

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026

Rewrite lexer and parser

82b6163

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026

Process review feedback

054f448

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 16, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#5

Closed

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Rewrite lexer and parser

db65c16

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Process review feedback

90c6fe4

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 20, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#6

Closed

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Rewrite lexer and parser

065d903

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Process review feedback

14d3ab1

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 20, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#7

Draft

twiggler requested changes Apr 22, 2026

View reviewed changes

Comment thread dissect/cstruct/utils.py Outdated

Comment thread dissect/cstruct/parser.py Outdated

Comment thread dissect/cstruct/cstruct.py Outdated

Comment thread dissect/cstruct/parser.py Outdated

Comment thread dissect/cstruct/lexer.py Outdated

Schamper requested a review from twiggler April 22, 2026 12:59

twiggler requested a review from Miauwkeru April 23, 2026 11:26

twiggler previously approved these changes Apr 23, 2026

View reviewed changes

Miauwkeru requested changes Apr 28, 2026

View reviewed changes

Comment thread dissect/cstruct/lexer.py Outdated

Comment thread tests/test_lexer.py

Comment thread dissect/cstruct/lexer.py

Comment thread tests/test_lexer.py Outdated

Schamper dismissed twiggler’s stale review via f95b499 April 29, 2026 09:29

Schamper requested a review from Miauwkeru April 29, 2026 12:51

Miauwkeru reviewed May 7, 2026

View reviewed changes

Comment thread tests/test_parser.py

Schamper force-pushed the rewrite-parser branch from f989504 to 0d4ec4d Compare May 7, 2026 11:23

Miauwkeru reviewed May 18, 2026

View reviewed changes

Miauwkeru reviewed Jun 1, 2026

View reviewed changes

Schamper force-pushed the rewrite-parser branch from ece2581 to 6b61425 Compare June 3, 2026 22:33

Schamper and others added 12 commits June 4, 2026 00:36

Rewrite lexer and parser

4777318

Process review feedback

1c9793a

Process review feedback

b213cad

Process review feedback

c5b1eb8

Co-authored-by: Copilot <copilot@github.com>

Process feedback

91b0299

Co-authored-by: Copilot <copilot@github.com>

Address review feedback

c0f5a4b

Merge _read_while and _read_until

89736f3

Fix docs error

c2bdfcd

Different approach for conditional reading

6bae80a

Fix linter

5b2a382

Change how #define values are handled

86e3651

Small tweak in string define parsing

5eb5424

Schamper force-pushed the rewrite-parser branch from 6b61425 to 5eb5424 Compare June 3, 2026 22:36

Miauwkeru previously approved these changes Jun 8, 2026

View reviewed changes

Temporarily disable publishing on push to main

c75df78

Schamper dismissed Miauwkeru’s stale review via c75df78 June 10, 2026 17:10

Schamper requested a review from Miauwkeru June 10, 2026 17:10

Miauwkeru approved these changes Jun 11, 2026

View reviewed changes

Conversation

Schamper commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

Schamper commented Mar 3, 2026

Uh oh!

twiggler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

twiggler commented Apr 16, 2026

Uh oh!

Schamper commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

twiggler left a comment

Choose a reason for hiding this comment

Uh oh!

Miauwkeru commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Schamper commented Apr 29, 2026

Uh oh!

Uh oh!

Miauwkeru commented May 7, 2026

Uh oh!

Schamper commented May 7, 2026

Uh oh!

Miauwkeru commented May 11, 2026

Uh oh!

Schamper commented May 11, 2026

Uh oh!

JSCU-CNI commented May 13, 2026

Uh oh!

Schamper commented May 15, 2026

Uh oh!

Miauwkeru May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Schamper May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Miauwkeru Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Miauwkeru Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Schamper Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Miauwkeru left a comment

Choose a reason for hiding this comment

Uh oh!

Schamper commented Jun 8, 2026

Uh oh!

Miauwkeru commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Schamper commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Schamper commented Mar 3, 2026 •

edited

Loading

codecov Bot commented Mar 3, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 3, 2026 •

edited

Loading

Miauwkeru commented Jun 8, 2026 •

edited

Loading