Skip to content

Commit d82d97f

Browse files
authored
refactor(compiler): refactor fory compiler into hierarchical architecture (#3179)
## Why? The current Fory compiler mixes FDL-native and protobuf-compatible syntax handling in a single parser, making it difficult to add support for new IDL formats like .proto and .fbs files. The validation logic is scattered across parsing and Schema.validate(), and there's no clear separation between parsing, semantic analysis, and code generation. ## What does this PR do? This PR refactors the Fory compiler into a hierarchical, multi-frontend architecture that establishes the Fory IDL AST as the **canonical intermediate representation (IR)**, with separate frontend parsers for different IDL formats. **Key changes:** 1. **New directory structure** with clear separation of concerns: - `ir/` - Intermediate Representation (canonical Fory AST) - `ast.py` - Core AST node definitions with `SourceLocation` tracking - `types.py` - Extended type system (primitives including varint, tagged types, etc.) - `validator.py` - Centralized semantic validation - `emitter.py` - FDL text emitter for debugging translated schemas - `frontend/` - IDL Frontends - `base.py` - Base frontend interface - `fdl/` - FDL Frontend (lexer + parser) - `proto/` - Protobuf Frontend (lexer + parser + translator to Fory IR) - `fbs/` - FlatBuffers Frontend (placeholder) 2. **Proto3 frontend** - Full support for parsing `.proto` files and translating to Fory IR: - Proto3 syntax parsing (messages, enums, nested types, maps, repeated fields) - Type mapping (int32→var_uint32, sint32→varint32, fixed32→uint32, etc.) - Fory extension options (`(fory).id`, `(fory).ref`, `(fory).nullable`, etc.) - Well-known types support (google.protobuf.Timestamp, Duration) 3. **Simplified FDL syntax** - Removed protobuf-style `(fory)` prefix from options: - File options: `option use_record_for_java_message = true;` - Type options: `message Foo [id=100] { ... }` - Field options: `MyType data = 1 [ref=true, nullable=true];` 4. **Extended type system** with new primitive kinds: - Signed/unsigned variants: `int8`-`int64`, `uint8`-`uint64` - Variable-length encoding: `varint32`, `varint64`, `var_uint32`, `var_uint64` - Tagged types: `tagged_int64`, `tagged_uint64` - Additional types: `float16`, `duration`, `decimal` 5. **Improved code generators** for all target languages with better type mapping 6. **CLI enhancements**: - Auto-detect input format by file extension (`.fdl`, `.proto`) - New `--emit-fdl` flag to output translated FDL for debugging 7. **Cross-language integration tests** for proto-based schemas ## Related issues Closes #3178 ## Does this PR introduce any user-facing change? - CLI now accepts `.proto` files directly (in addition to `.fdl`) - FDL syntax simplified: `option (fory).xxx` → `option xxx` - New primitive types available in FDL - [x] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark N/A - This is a compiler refactoring that doesn't affect runtime performance.
1 parent 99585af commit d82d97f

51 files changed

Lines changed: 3475 additions & 1072 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/sync.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,7 @@ apache/fory-site@main:
2020
dest: docs/guide/
2121
- source: docs/specification/
2222
dest: docs/specification/
23+
- source: docs/compiler/
24+
dest: docs/compiler/
2325
- source: docs/benchmarks/
2426
dest: static/img/benchmarks/

benchmarks/cpp_benchmark/README.md

Lines changed: 34 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ Note: Protobuf is fetched automatically via CMake FetchContent, so no manual ins
3131

3232
| Datatype | Operation | Fory TPS | Protobuf TPS | Faster |
3333
| ------------ | ----------- | ---------- | ------------ | ----------- |
34-
| Mediacontent | Serialize | 2,430,924 | 484,368 | Fory (5.0x) |
35-
| Mediacontent | Deserialize | 740,074 | 387,522 | Fory (1.9x) |
36-
| Sample | Serialize | 4,813,270 | 3,021,968 | Fory (1.6x) |
37-
| Sample | Deserialize | 915,554 | 684,675 | Fory (1.3x) |
38-
| Struct | Serialize | 18,105,957 | 5,788,186 | Fory (3.1x) |
39-
| Struct | Deserialize | 7,495,726 | 5,932,982 | Fory (1.3x) |
34+
| Mediacontent | Serialize | 11,319,876 | 1,181,595 | Fory (9.6x) |
35+
| Mediacontent | Deserialize | 2,729,388 | 835,956 | Fory (3.3x) |
36+
| Sample | Serialize | 16,899,403 | 10,575,760 | Fory (1.6x) |
37+
| Sample | Deserialize | 3,079,241 | 1,450,789 | Fory (2.1x) |
38+
| Struct | Serialize | 43,184,198 | 29,359,454 | Fory (1.5x) |
39+
| Struct | Deserialize | 54,599,691 | 38,796,674 | Fory (1.4x) |
4040

4141
## Quick Start
4242

@@ -47,6 +47,34 @@ cd benchmarks/cpp_benchmark
4747
./run.sh
4848
```
4949

50+
### Run Options
51+
52+
```bash
53+
./run.sh --help
54+
55+
Options:
56+
--data <struct|sample> Filter benchmark by data type
57+
--serializer <fory|protobuf> Filter benchmark by serializer
58+
--duration <seconds> Minimum time to run each benchmark (e.g., 10, 30)
59+
--debug Build with debug symbols for profiling
60+
```
61+
62+
Examples:
63+
64+
```bash
65+
# Run only Struct benchmarks
66+
./run.sh --data struct
67+
68+
# Run only Fory benchmarks
69+
./run.sh --serializer fory
70+
71+
# Run each benchmark for at least 10 seconds (for more stable results)
72+
./run.sh --duration 10
73+
74+
# Combine options
75+
./run.sh --data struct --serializer fory --duration 5
76+
```
77+
5078
## Building
5179

5280
```bash

benchmarks/cpp_benchmark/run.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ JOBS=16
3232
DATA=""
3333
SERIALIZER=""
3434
DEBUG_BUILD=false
35+
DURATION=""
3536

3637
# Parse arguments
3738
usage() {
@@ -42,6 +43,7 @@ usage() {
4243
echo "Options:"
4344
echo " --data <struct|sample> Filter benchmark by data type"
4445
echo " --serializer <fory|protobuf> Filter benchmark by serializer"
46+
echo " --duration <seconds> Minimum time to run each benchmark (e.g., 10, 30)"
4547
echo " --debug Build with debug symbols and low optimization for profiling"
4648
echo " --help Show this help message"
4749
echo ""
@@ -50,6 +52,7 @@ usage() {
5052
echo " $0 --data struct # Run only Struct benchmarks"
5153
echo " $0 --serializer fory # Run only Fory benchmarks"
5254
echo " $0 --data struct --serializer fory"
55+
echo " $0 --duration 10 # Run each benchmark for at least 10 seconds"
5356
echo " $0 --debug # Build for profiling (visible function names in flamegraph)"
5457
echo ""
5558
echo "For profiling/flamegraph, use: ./profile.sh"
@@ -66,6 +69,10 @@ while [[ $# -gt 0 ]]; do
6669
SERIALIZER="$2"
6770
shift 2
6871
;;
72+
--duration)
73+
DURATION="$2"
74+
shift 2
75+
;;
6976
--debug)
7077
DEBUG_BUILD=true
7178
shift
@@ -125,6 +132,10 @@ echo ""
125132
# Step 2: Run benchmark
126133
echo -e "${YELLOW}[2/3] Running benchmark...${NC}"
127134
BENCH_ARGS="--benchmark_format=json --benchmark_out=benchmark_results.json"
135+
if [[ -n "$DURATION" ]]; then
136+
BENCH_ARGS="$BENCH_ARGS --benchmark_min_time=${DURATION}s"
137+
echo -e "Duration: ${DURATION}s per benchmark"
138+
fi
128139
if [[ -n "$FILTER" ]]; then
129140
BENCH_ARGS="$BENCH_ARGS --benchmark_filter=$FILTER"
130141
echo -e "Filter: ${FILTER}"

compiler/README.md

Lines changed: 26 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -220,29 +220,27 @@ message Example {
220220
}
221221
```
222222

223-
### Fory Extension Options
223+
### Fory Options
224224

225-
FDL supports protobuf-style extension options using the `(fory)` prefix:
225+
FDL uses plain option keys without a `(fory)` prefix:
226226

227227
**File-level options:**
228228

229229
```fdl
230-
option (fory).use_record_for_java_message = true;
231-
option (fory).polymorphism = true;
230+
option use_record_for_java_message = true;
231+
option polymorphism = true;
232232
```
233233

234234
**Message/Enum options:**
235235

236236
```fdl
237-
message MyMessage {
238-
option (fory).id = 100;
239-
option (fory).evolving = false;
240-
option (fory).use_record_for_java = true;
237+
message MyMessage [id=100] {
238+
option evolving = false;
239+
option use_record_for_java = true;
241240
string name = 1;
242241
}
243242
244-
enum Status {
245-
option (fory).id = 101;
243+
enum Status [id=101] {
246244
UNKNOWN = 0;
247245
ACTIVE = 1;
248246
}
@@ -252,25 +250,29 @@ enum Status {
252250

253251
```fdl
254252
message Example {
255-
MyType friend = 1 [(fory).ref = true];
256-
string nickname = 2 [(fory).nullable = true];
257-
MyType data = 3 [(fory).ref = true, (fory).nullable = true];
253+
MyType friend = 1 [ref=true];
254+
string nickname = 2 [nullable=true];
255+
MyType data = 3 [ref=true, nullable=true];
258256
}
259257
```
260258

261-
See `extension/fory_options.proto` for the complete list of available options.
262-
263259
## Architecture
264260

265261
```
266262
fory_compiler/
267263
├── __init__.py # Package exports
268264
├── __main__.py # Module entry point
269265
├── cli.py # Command-line interface
270-
├── parser/
271-
│ ├── ast.py # AST node definitions
272-
│ ├── lexer.py # Hand-written tokenizer
273-
│ └── parser.py # Recursive descent parser
266+
├── frontend/
267+
│ └── fdl/
268+
│ ├── __init__.py
269+
│ ├── lexer.py # Hand-written tokenizer
270+
│ └── parser.py # Recursive descent parser
271+
├── ir/
272+
│ ├── __init__.py
273+
│ ├── ast.py # Canonical Fory IDL AST
274+
│ ├── validator.py # Schema validation
275+
│ └── emitter.py # Optional FDL emitter
274276
└── generators/
275277
├── base.py # Base generator class
276278
├── java.py # Java POJO generator
@@ -280,13 +282,13 @@ fory_compiler/
280282
└── cpp.py # C++ struct generator
281283
```
282284

283-
### Parser
285+
### FDL Frontend
284286

285-
The parser is a hand-written recursive descent parser that produces an AST:
287+
The FDL frontend is a hand-written lexer/parser that produces the Fory IDL AST:
286288

287-
- **Lexer** (`lexer.py`): Tokenizes FDL source into tokens (keywords, identifiers, punctuation)
288-
- **AST** (`ast.py`): Defines node types - `Schema`, `Message`, `Enum`, `Field`, `FieldType`
289-
- **Parser** (`parser.py`): Builds AST from token stream with validation
289+
- **Lexer** (`frontend/fdl/lexer.py`): Tokenizes FDL source into tokens
290+
- **Parser** (`frontend/fdl/parser.py`): Builds the AST from the token stream
291+
- **AST** (`ir/ast.py`): Canonical node types - `Schema`, `Message`, `Enum`, `Field`, `FieldType`
290292

291293
### Generators
292294

compiler/fory_compiler/__init__.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,14 @@
1515
# specific language governing permissions and limitations
1616
# under the License.
1717

18-
"""FDL (Fory Definition Language) compiler for Apache Fory."""
18+
"""Fory IDL compiler for Apache Fory."""
1919

2020
__version__ = "0.1.0"
2121

22-
from fory_compiler.parser.ast import Schema, Message, Enum, Field, EnumValue, Import
23-
from fory_compiler.parser.parser import Parser
24-
from fory_compiler.parser.lexer import Lexer
22+
from fory_compiler.ir.ast import Schema, Message, Enum, Field, EnumValue, Import
23+
from fory_compiler.frontend.fdl import FDLFrontend
24+
from fory_compiler.frontend.fbs import FBSFrontend
25+
from fory_compiler.frontend.proto import ProtoFrontend
2526

2627
__all__ = [
2728
"Schema",
@@ -30,6 +31,7 @@
3031
"Field",
3132
"EnumValue",
3233
"Import",
33-
"Parser",
34-
"Lexer",
34+
"FDLFrontend",
35+
"FBSFrontend",
36+
"ProtoFrontend",
3537
]

0 commit comments

Comments
 (0)