This note documents an important design criticism: literal keyword substitution does not always produce natural code in every language.
- The project keeps one shared grammar and semantic structure.
- Human languages vary in word order, inflection, and how imperative statements are naturally expressed.
- Result: valid localized code can still feel linguistically unnatural.
multilingual uses:
- one parser grammar,
- concept-level keyword mapping,
- localized surface forms for those concepts.
This optimizes implementation consistency and cross-language semantic equivalence.
As of the current implementation, the parser also runs a small data-driven surface normalization pass before canonical parsing. This allows selected alternate word orders to map to the same core AST without forking parser grammar per language.
A shared positional structure favors technical consistency over fully native phrasing. This is an explicit tradeoff in the current architecture.
- Keeps parser and analyzer complexity manageable.
- Preserves deterministic forward compilation to a shared core/Python output.
- Avoids committing to impossible or brittle source round-trip guarantees.
- Avoids language-specific grammar forks too early.
Surface forms are defined declaratively in:
multilingualprogramming/resources/usm/surface_patterns.json
The linkage with lexing is token-based:
Lexerstill performs all tokenization and concept resolution.- Surface normalization consumes those lexer tokens (it does not re-lex text).
- Rewrites produce canonical keyword-concept tokens consumed by
Parser.
Each rule is language-scoped but follows one generic pipeline:
- match a surface token pattern,
- capture slots (for example
target,iterable), - rewrite to canonical concept order (for example
LOOP_FOR target IN iterable), - parse normally.
To reduce repetition, canonical rewrites can be shared through named
templates in the same JSON file, and rules can reference a template.
This keeps semantics centralized while allowing incremental syntax naturalness improvements for any language, including RTL scripts.
Current pilot rules include iterable-first for loop headers for Japanese,
Arabic, Spanish, and Portuguese.
- Add more syntax profiles per language family where needed.
- Expand alternate surface forms that normalize to the same concept.
- Explore IDE display transforms (render localized forms while storing canonical forms).
Open design area. Contributions should prefer additive experiments (feature flags, language profiles, or transform layers) over immediate grammar fragmentation.