Skip to content

Rewrite lexer and parser#146

Open
Schamper wants to merge 13 commits into
mainfrom
rewrite-parser
Open

Rewrite lexer and parser#146
Schamper wants to merge 13 commits into
mainfrom
rewrite-parser

Conversation

@Schamper

@Schamper Schamper commented Mar 3, 2026

Copy link
Copy Markdown
Member

Closes #85, partially #142, and will make #86 and #138 a lot easier to implement. Fixes #149.

This PR will (finally) replace the shoddy C syntax parser I originally wrote many moons ago, when I discovered the existence of re.Scanner and ran with it. This PR aims to add a somewhat decent lexer and separate parser. I'm still not a compsci 1337coder, so this is just what I came up with (with some help) and definitely not a textbook implementation. All feedback is welcome.

  • New lexer
  • New C syntax parser that utilizes the new lexer
  • Expression parser re-uses the new lexer
  • Reworked how sizeof works in the expression parser, and added offsetof

The new parser has made changing parsing behavior a lot easier. As such, this PR already makes the following changes:

  • The new parser is slightly stricter, requiring proper semicolon endings for example. We'll need to fix this in any dissect code that has this.

  • An important semantic change is how named nested structures are handled. In my infinite wisdom, I originally figured that named nested structures do not "exist" in the top level scope. That's not true, so now named nested structures get properly registered with the cstruct instance:

struct a {
    struct b {
        ...
    };
};

// Will register both `a` and `b`
  • Another important change is how we deal with struct { ... } name;. We used to parse this first as an anonymous struct, then capture name as the structure type name. That's not strictly correct, name is a variable of an anonymous unnamed struct, so we now treat it as such. We don't error on this, but rather we silently ignore name and skip until we reach a ;
  • typedef enum ... is now allowed
  • Probably some other things I'm forgetting

This probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.

@codecov

codecov Bot commented Mar 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 761 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (9f12b1e) to head (c75df78).

Files with missing lines Patch % Lines
dissect/cstruct/lexer.py 0.00% 341 Missing ⚠️
dissect/cstruct/parser.py 0.00% 293 Missing ⚠️
dissect/cstruct/expression.py 0.00% 111 Missing ⚠️
dissect/cstruct/utils.py 0.00% 12 Missing ⚠️
dissect/cstruct/cstruct.py 0.00% 2 Missing ⚠️
dissect/cstruct/exceptions.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main    #146    +/-   ##
======================================
  Coverage   0.00%   0.00%            
======================================
  Files         21      22     +1     
  Lines       2482    2594   +112     
======================================
- Misses      2482    2594   +112     
Flag Coverage Δ
unittests 0.00% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq

codspeed-hq Bot commented Mar 3, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 9 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_benchmark_expression_evaluate 78.2 µs 127.8 µs -38.83%
test_benchmark_expression_parse 344.4 µs 263.8 µs +30.54%
test_benchmark_lexer_and_parser 15.7 ms 12.9 ms +21.29%
🆕 test_benchmark_parser N/A 10.5 ms N/A
🆕 test_benchmark_lexer N/A 2.4 ms N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rewrite-parser (c75df78) with main (9f12b1e)

Open in CodSpeed

@Schamper

Schamper commented Mar 3, 2026

Copy link
Copy Markdown
Member Author

@sMezaOrellana would be interested in your thoughts on these changes!

@twiggler twiggler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecure looks ok, found 2 possible issues TBC

Comment thread dissect/cstruct/expression.py
Comment thread dissect/cstruct/parser.py Outdated
@Schamper Schamper requested a review from twiggler April 13, 2026 16:01
twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026
twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026
@twiggler

Copy link
Copy Markdown
Contributor

This PR has been migrated to the dissect monorepo: twiggler/dissect-monorepo-test#5

The original diff and commit history have been preserved on the migrate/dissect.cstruct/pr-146 branch.

@Schamper

Copy link
Copy Markdown
Member Author

@twiggler I was a bit sad that your message was just this test, as I thought maybe you left more comments 😉.

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026
twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026
twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026
twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026
Comment thread dissect/cstruct/utils.py Outdated
Comment thread dissect/cstruct/parser.py Outdated
Comment thread dissect/cstruct/cstruct.py Outdated
Comment thread dissect/cstruct/parser.py Outdated
Comment thread dissect/cstruct/lexer.py Outdated
@Schamper Schamper requested a review from twiggler April 22, 2026 12:59
@twiggler twiggler requested a review from Miauwkeru April 23, 2026 11:26
twiggler
twiggler previously approved these changes Apr 23, 2026

@twiggler twiggler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Since it is an impactful rewrite I asked @Miauwkeru to QA

@Miauwkeru

Copy link
Copy Markdown
Contributor

I did some checking on our other projects to see what goes wrong, and found some regressions regarding to where it crashed unexpectedly.

  1. it cannot evaluate a ternary operation, which occurred in dissect.vmfs and gives a LexerError when trying to evaluate it as it doesn't understand the ? token. While I am aware that those definitions don't function yet, it probably shouldn't crash when trying to evaluate it
#define func(x) ( x ? 1 : 0 )

Without a ternary it makes it into a string:

# define func(x)    ( x == 0)
>>> c_struct.func
' ( x ) ( x = = 0)`
  1. Another issue I found was when it was trying to create define a flag in dissect.apfs. Here it couldn't evaluate the . after the INODE in the definition.
flag INODE {
  ...
};

#define APFS_INODE_PINNED_MASK            INODE.PINNED_TO_MAIN | INODE.PINNED_TO_TIER2

I'll look through the code now to see whether I can find some additional issues.

Comment thread dissect/cstruct/lexer.py Outdated
Comment thread tests/test_lexer.py
Comment thread dissect/cstruct/lexer.py
Comment thread tests/test_lexer.py Outdated
@Schamper

Copy link
Copy Markdown
Member Author

@Miauwkeru fixed.

Comment thread tests/test_parser.py
@Miauwkeru

Copy link
Copy Markdown
Contributor

Only found one more thing that might be an oddity:

#define ADCRYPT_MAGIC ADCRYPT\00

This now fails to parse due to the \00 at the end. Which is fixed by using quotes. I don't know if this was an intended change tho

@Schamper

Schamper commented May 7, 2026

Copy link
Copy Markdown
Member Author

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

@Miauwkeru

Copy link
Copy Markdown
Contributor

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

I think it makes most sense to go the C route. In the example I gave:

#define ADCRYPT_MAGIC ADCRYPT\00

C wouldn't compile it as it would expect ADCRYPT to be another definition or something else it can resolve. (Besides not knowing what to do with the \0). So I think it would be better if we require strings to be explicitly quoted.

What do you think? @Schamper @JSCU-CNI

@Schamper

Copy link
Copy Markdown
Member Author

I was leaning in that direction too.

@JSCU-CNI

Copy link
Copy Markdown
Contributor

Agreed. Something like #define ADCRYPT_MAGIC "ADCRYPT\00" or #define ADCRYPT_MAGIC b"ADCRYPT\x00" would make sense.

@Schamper

Copy link
Copy Markdown
Member Author

I changed a bit how #define values are handled with 3a37ab0. Feedback is welcome.

Both ways now actually work.

Comment thread tests/test_parser.py
#define RAW somevalue
#define STR "hello"
#define BYTES b"world"
#define NULLRAW ADCRYPT\00

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want it to be explicitly quoted so that it fails on this kind of definition?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My "new approach" (which is basically, don't tokenize anything after #define NAME, but just take its raw value until the end of the line) allows this to work again. I think as long as it's unit tested, it should be fine to keep in.

The reason why I slightly prefer this new approach is so that in the parser, we get a more "true" representation of that the define value actually is, including spacing and such. The downside being that the parser now has to deal a little bit with some basic string parsing.

Comment thread dissect/cstruct/parser.py Outdated
raise self._error("unterminated string literal", token=token)

# Remove the surrounding and any duplicate quotes
value = "".join(ch for ch in value if ch != quote)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is pretty much an edge case, but with for example the following define:
#define test "\'\"a'b\""

Dissect evaluates it to "'a'b" where a c compiler turns it into "'\"a'b\"" (checked with godbolt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another note, adding a \n to the string will also create LexerError: unexpected end of input

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moreso a limitation of Python than cstruct I think:

In [10]: """abc\"\'\n"""[3]
Out[10]: '"'

In [11]: """abc\"\'\n"""[4]
Out[11]: "'"

In [12]: """abc\"\'\n"""[5]
Out[12]: '\n'

There's no way for us to see the escape characters unless you make it a raw string (r""").

The proper way to do it would be to double escape. I've added a test for that and a fix to make sure it works.

Miauwkeru
Miauwkeru previously approved these changes Jun 8, 2026

@Miauwkeru Miauwkeru left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, want to release it as 5.0? Then we don't have to update every project immediately.
Otherwise we might need to already have the patch sets ready before it gets merged in.

@Schamper

Schamper commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Yes, see the OP:

This probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.

How would we merge this though? Merging now would result in a 4.7.dev and all CI failing.

@Miauwkeru

Miauwkeru commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

How would we merge this though? Merging now would result in a 4.7.dev and all CI failing.

We can yank/delete the 4.7.dev# it creates on pypi. Ofc not ideal, but it will avoid the CI from falling.
In some cases we might also need to invalidate some caches on projects.

There is also the option to update the publish step to only work when you tag it or disable publishing it in its entirety

@Schamper

Copy link
Copy Markdown
Member Author

Temporarily disabling the publish step is probably the easiest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError: Cannot use capturing groups in re.Scanner on Python 3.15 Incorrect parsing of multiple declarators in a single declaration

4 participants