jRegTab is an open-source Java library implementing RegTab — a method for pattern-driven data extraction from editable document tables with regular structure.
Tabular data in spreadsheets, text documents, and web pages are among the most common sources for data analysis. Extracting structured records from such tables is a critical but labour-intensive step in data wrangling. Source tables are typically designed for human readability and lack explicit semantics: cell meaning may be independent of position, cells may be compound, headers may be hierarchical, and relevant context may appear outside the table itself.
RegTab addresses this by matching editable document tables against patterns that capture their regular structure and interpretive logic. A successful match enriches the table with semantic information and yields a structured recordset.
The method is built around two formal models:
-
Interpretable Table Model (ITM) — represents the syntactic and semantic structure of a table. The syntactic layer describes cells (their positions, formatting, and text content) together with a row-oriented substructure hierarchy: subtables → rows → subrows → cells. The semantic layer consists of items (value-associated, attribute-associated, and auxiliary) derived from cell content or supplied from external context, along with interpretation actions that establish how items form attribute–value pairs and record item sequences.
-
Abstract Table Pattern (ATP) — specifies a class of tables and the rules for deriving structured records from them. An ATP instance mirrors the ITM hierarchy and contains cell patterns with cell match conditions, content specifications, and interpretation action specifications. Matching an ATP against an ITM instance populates the semantic layer automatically.
Patterns can be written directly with the Java fluent API or, more compactly, in RTL (Regular Table Language) — a textual DSL that compiles to ATP.
import ru.icc.regtab.atp.AtpMatcher;
import ru.icc.regtab.atp.spec.TablePattern;
import ru.icc.regtab.interpret.TableInterpreter;
import ru.icc.regtab.itm.syntax.TableSyntax;
import ru.icc.regtab.recordset.Recordset;
import ru.icc.regtab.rtl.RtlCompiler;
// Table: Name | Score
// Alice | 95
// Bob | 87
TableSyntax syntax = new TableSyntax(3, 2);
syntax.getCell(0, 0).setText("Name"); syntax.getCell(0, 1).setText("Score");
syntax.getCell(1, 0).setText("Alice"); syntax.getCell(1, 1).setText("95");
syntax.getCell(2, 0).setText("Bob"); syntax.getCell(2, 1).setText("87");
TablePattern pattern = RtlCompiler.compile("""
[ [ATTR]{2} ]
[ [VAL : (^COL)->AVP, (SR)->REC]{2} ]+
""");
Recordset rs = AtpMatcher.match(pattern, syntax)
.map(itm -> new TableInterpreter().interpret(itm))
.orElseThrow();
// rs.schema().attributes() → [Name, Score]
// rs.records().get(0) → {Name=Alice, Score=95}A step-by-step walkthrough of this example (including the equivalent Java fluent API and a low-level ITM construction) is in the Getting started guide.
The full documentation site is published at https://regtab.github.io/jregtab/.
- Getting started — installation, first example, full pipeline walkthrough
- ITM — syntactic and semantic layers, items, providers, working state, interpretation
- ATP — pattern hierarchy, content specs, action specs, matching
- RTL reference — complete RTL syntax with tables and examples
- Examples — worked examples with ATP and RTL patterns side by side
- Architecture — package map, data flow, RTL compilation pipeline
- API reference — public classes, factories, and methods (full Javadoc on javadoc.io)
- Benchmark — Foofah, RegTab, and Baikal task collections
- Testing — test suite layout, fixtures, and how to run tasks
For local preview, run serve.bat (Windows) or on any OS:
pip install -r requirements.txt
mkdocs serveThen open http://127.0.0.1:8000. Publishing is automated by the Deploy docs GitHub Actions workflow on every push to main.
Add to your pom.xml:
<dependency>
<groupId>ru.icc.regtab</groupId>
<artifactId>regtab</artifactId>
<version>0.1.0</version>
</dependency>Requires Java 21+ and Maven 3.9+.
mvn compile # compile
mvn test # compile and run the full test suiteThe library is evaluated on 150 benchmark tasks (Foofah, RegTab, and Baikal collections), covering 1 500 test variants (750 ATP + 750 RTL) with 100 % accuracy. See Benchmark and Testing for details.
jRegTab builds on and supersedes TabbyXL (https://github.com/tabbydoc/tabbyxl), an earlier platform for tabular-data understanding based on the CRL domain-specific language.
This project is distributed under the terms of the MIT License. See LICENSE for details.