Dolphin Math Data Generator
This project generates synthetic math problems covering various arithmetic, algebra, geometry, and statistics topics. Crucially, it also generates detailed, step-by-step solutions intended to mimic the process a human would follow when solving the problem manually (like a "visible scratchpad").
The output is designed for training language models to perform multi-step mathematical reasoning.
It works for both SFT and RL. You should generate separate datasets for SFT and RL. SFT teaches it the syntax, RL teaches it to git gud at it.
Elementary (Grades 3-5) — 34 Problem Types
Long Division — with remainder, showing divide/multiply/subtract/bring-down cycle
Multi-digit Addition — standard column algorithm with carries
Multi-digit Subtraction — standard column algorithm with borrows
Multi-digit Multiplication — partial products method
Mixed Number Operations — all four operations (+, -, *, /) with LCD, simplification
Fraction Comparison — find common denominator and compare
Fraction/Decimal/Percent Conversions — bidirectional conversions
Decimal Addition/Subtraction — column alignment with decimal points
Decimal Multiplication — integer multiplication then decimal placement
Decimal Division — shift decimals, long division, place decimal in quotient
Fraction Operations — add, subtract, multiply, divide with LCD and simplification
Finding All Factors — trial division with factor pairs
Prime Factorization — factor tree method
GCF (Greatest Common Factor) — Euclidean algorithm
LCM (Least Common Multiple) — via GCD formula
PEMDAS Problems — with rewrite steps showing work
Perimeter/Area of Rectangles, Squares, Triangles, Parallelograms, Trapezoids
Perimeter of General Polygons — sum of all sides
Volume of Rectangular Prisms
Place Value and Rounding — whole numbers and decimals
Comparing/Ordering Numbers — whole numbers and decimals
Divisibility Rules — prime/composite classification
Unit Conversions — length, weight, capacity, time, money
Mean, Median, Mode — for small datasets
Simple Probability — single event with uniform outcomes
Graph Interpretation — bar charts, line graphs, pictographs
Abacus-style Addition — column-by-column with carries
Middle School (Grades 6-8) — 41 Problem Types
Unit Rate Calculations — find rate per unit
Unit Rate from Tables — extract rate from data tables
Scaling Problems — maps, blueprints, models
Similar Figures — find missing sides using scale factors
Proportional Relationships — solve proportions
Adding/Subtracting Integers — with number line reasoning
Multiplying/Dividing Integers — sign rules
One-step Equations — all operations (x+a=b, ax=b, etc.)
Two-step Equations — (ax+b=c, a(x+b)=c, etc.)
One-step Inequalities — with inequality flip for negative coefficients
Two-step Inequalities — with proper sign handling
Simple Linear Equations — (ax + b = c)
Complex Linear Equations — variables on both sides (ax + b = cx + d)
Simplifying Expressions — distribution and combining like terms
Evaluating Expressions — variable substitution
Exponent Evaluation — compute powers like 2^5, (-3)^4
Exponent Rules — product, quotient, power, negative, zero exponent
Scientific Notation — convert to/from, operations
Square Roots — perfect squares
Cube Roots — perfect cubes
Simplifying Radicals — √72 → 6√2
Angle Relationships — complementary, supplementary, vertical (numeric and algebraic)
Angles with Parallel Lines — corresponding, alternate interior/exterior, co-interior
Triangle Angle Sum — find missing angle
Exterior Angle Theorem
Circle Area and Circumference — with π symbol or decimal
Volume of Prisms — rectangular and triangular
Volume of Cylinders
Surface Area of Prisms
Surface Area of Cylinders
Pythagorean Theorem — Find Hypotenuse
Pythagorean Theorem — Find Leg
Pythagorean Word Problems — ladders, distances, etc.
Mean (Average) — sum and divide with steps
Median — sort and find middle
Mode — frequency counting (unimodal, bimodal, no mode)
Range — max minus min
Mean Absolute Deviation (MAD)
Simple Probability — P = favorable/total
Compound Probability — Independent Events — coin flips, dice
Compound Probability — Dependent Events — drawing without replacement
High School — 6 Problem Types (more coming)
Quadratic Equations — using quadratic formula with discriminant
Percentage Problems — find part, percent, or whole
To see one sample output from each generator type:
python dolphin_math_datagen.py --sample
You can optionally specify a random seed using -s or --seed.
Limit to specific generators (comma-separated class names):
python dolphin_math_datagen.py --sample --generators MultiDigitAdditionGenerator,LongDivisionGenerator
To generate a full dataset file in JSON Lines format:
python dolphin_math_datagen.py -n < number_of_examples> -o < output_file.jsonl>
You can restrict generation to a subset of generators:
python dolphin_math_datagen.py -n 5000 -o subset.jsonl --generators MultiDigitAdditionGenerator,DecimalMultGenerator
Example: Generate 50,000 examples with seed 123:
# Specify output file explicitly:
python dolphin_math_datagen.py -n 50000 -o my_dataset.jsonl -s 123
# Use default output filename (dolphin_math_50000.jsonl):
python dolphin_math_datagen.py -n 50000 -s 123
Default values are 10,000 examples (outputting to dolphin_math_10000.jsonl by default if -o is omitted). Omit -s/--seed for non-deterministic data; provide a seed to make runs reproducible.
Unit tests are provided for each generator. To run all tests:
python -m unittest discover tests
The steps field in the output JSON contains a list of strings, each representing a step in the solution. Steps are formatted as OP_CODE|arg1|arg2|....
Short codes (1-2 chars): Core arithmetic operations used across many generators
Prefixed codes : Domain-specific operations grouped by prefix (e.g., STAT_, EQ_, PYTHAG_)
Core Arithmetic (used across generators)
Code
Description
Arguments
A
Add
addend1, addend2, sum
S
Subtract
minuend, subtrahend, difference
M
Multiply
factor1, factor2, product
D
Divide
dividend, divisor, quotient
B
Bring down (long division)
remainder_before, digit_down, new_number
R
Remainder
final_remainder
E
Exponent/Power
base, exponent, result
Z
Final answer
answer_string
Code
Description
Arguments
L
Find LCD
denominator1, denominator2, lcd
C
Convert to LCD
original_fraction, lcd, converted_fraction
I
Invert fraction
original, inverted
F
Simplify fraction
unsimplified, simplified
CMP
Compare fractions
frac1, frac2, relation (<, >, =)
Code
Description
Arguments
MIX_IMPROPER
Convert mixed to improper
mixed_str, improper_str
IMPROPER_TO_MIX
Convert improper to mixed
improper_str, mixed_str
Integer Column Arithmetic
Code
Description
Arguments
INT_ALIGN
Align numbers for column math
num1_padded, num2_padded
ADD_COL
Add column
col_name, calculation, result_with_carry
SUB_COL
Subtract column
col_name, calculation, result_with_borrow
BORROW
Borrow from next column
col_name, from_left, 1
CARRY_FINAL
Final carry digit
carry_value
Code
Description
Arguments
DEC_ALIGN
Align by decimal point
num1_aligned, num2_aligned
DEC_ADD_COL
Add decimal column
col_name, calculation, result
DEC_SUB_COL
Subtract decimal column
col_name, calculation, result
DEC_CARRY_FINAL
Final decimal carry
carry_value
DEC_SHIFT
Shift decimal for division
original_expr, shifted_expr, places
MUL_SETUP
Setup multiplication
int1, int2
MUL_PARTIAL
Partial product
digit, multiplicand, partial_product
ADD_PARTIALS
Sum partial products
expression, result
COUNT_DP
Count decimal places
dp1, dp2, total
PLACE_DP
Place decimal in result
integer_result, places, final_result
DIV_SETUP
Setup division
dividend, divisor
PLACE_DP_Q
Place decimal in quotient
quotient_digits, position
Code
Description
Arguments
FACT_CHECK
Check divisibility
n, divisor, remainder
FACT_PAIR
Record factor pair
factor1, factor2
PF_STEP
Prime factorization step
n, prime, quotient
PF_PRIME
Mark as prime
n
GCD_START
Start Euclidean algorithm
a, b
GCD_STEP
Euclidean step
a, b, remainder
GCD_RESULT
Final GCD
gcd
LCM_FROM_GCD
Compute LCM
product_expr, gcd, lcm
Code
Description
Arguments
EQ_SETUP
Show equation
equation_string
EQ_OP_BOTH
Apply operation to both sides
operation, value, result_expr, result_value
EQ_SIMPLIFY
Simplify equation
simplified_equation
EQ_RESULT
Final result
variable, value
INEQ_SETUP
Show inequality
inequality_string
INEQ_OP_BOTH
Apply operation to both sides
operation, value, result_expr, result_value
INEQ_SIMPLIFY
Simplify inequality
simplified_inequality
INEQ_FLIP
Flip inequality sign
reason
INEQ_RESULT
Final result
variable, relation, value
Code
Description
Arguments
REWRITE
Rewrite expression
new_form
DIST
Distribute
factor, expression, result
COMB_X
Combine x terms
term1, term2, result
COMB_CONST
Combine constants
const1, const2, result
SUBST
Substitute value
variable, value, result_expression
MOVE_TERM
Move term across equals
term, target_side, result_equation
DIV_COEFF
Divide by coefficient
numerator, denominator, result
DISC
Discriminant
b_squared, four_ac, discriminant
ROOT
Square root
radicand, result
Q1, Q2
Quadratic roots
neg_b, sqrt_disc, two_a, root_value
PROP_SETUP
Setup proportion
proportion_string
Code
Description
Arguments
EXP_SETUP
Setup exponent
base, exponent
EXP_EXPAND
Expand multiplication
expanded_form
EXP_PARTIAL
Partial multiplication
value1, value2, result
EXP_RULE_SETUP
Setup exponent rule
expression
EXP_RULE_IDENTIFY
Identify rule
rule_name, rule_formula
EXP_RULE_APPLY
Apply rule
operation, exp1, exp2, result
EXP_RULE_SIMPLIFY
Simplify result
simplified
SCI_SETUP
Setup scientific notation
number
SCI_IDENTIFY
Identify coefficient/exponent
coefficient, exponent
SCI_MOVE_DECIMAL
Move decimal
direction, places
ROOT_SETUP
Setup root
expression
ROOT_IDENTIFY
Identify root type
radicand, type, result
ROOT_EXTRACT
Extract root
result
Code
Description
Arguments
PERIM
Perimeter result
value
AREA
Area result
value
VOLUME
Volume result
value
CIRCLE_SETUP
Setup circle problem
value, type (radius/diameter)
CIRCLE_FORMULA
Show formula
formula
CIRCLE_SUBSTITUTE
Substitute values
substituted_formula
CIRCLE_CALCULATE
Calculate
calculation, result
VOL_SETUP
Setup volume
shape, dimensions
VOL_FORMULA
Volume formula
formula
VOL_BASE_AREA
Calculate base area
calculation, result
VOL_CALCULATE
Calculate volume
calculation, result
SA_SETUP
Setup surface area
shape, dimensions
SA_FORMULA
Surface area formula
formula
SA_FACES
Calculate face areas
face_type, calculation, result
SA_BASES
Calculate base areas
calculation, result
SA_LATERAL
Calculate lateral area
calculation, result
SA_TOTAL
Total surface area
calculation, result
Code
Description
Arguments
PYTHAG_SETUP
Setup problem
c=hyp, a=leg, b=?
PYTHAG_FORMULA
Show formula
formula
PYTHAG_SUBSTITUTE
Substitute values
substituted_formula
PYTHAG_SQUARE
Square a value
value, result
PYTHAG_SOLVE
Solve for unknown
equation, result
PYTHAG_ROOT
Take square root
radicand, result
PYTHAG_CONTEXT
Word problem context
context_type, values
PYTHAG_MODEL
Model the problem
leg1, leg2, unknown
PYTHAG_CALCULATE
Intermediate calculation
calculation, result
Code
Description
Arguments
ANGLE_SETUP
Setup angle problem
relationship, equation
ANGLE_RELATION
Simplify relationship
simplified_equation
ANGLE_SOLVE
Solve for variable
equation, solution
PARALLEL_SETUP
Setup parallel lines
angle_type, relationship
PARALLEL_RELATION
Show equation
equation
PARALLEL_SOLVE
Solve
equation, solution
TRI_ANGLE_SETUP
Setup triangle angles
angle1, angle2, angle3
TRI_ANGLE_SUM
Show sum equation
equation
TRI_ANGLE_SOLVE
Solve for angle
equation, result
Code
Description
Arguments
SCALE_SETUP
Setup scale
scale_unit, actual_unit, factor
SCALE_IDENTIFY
Identify given/find
given_value, find_type
SCALE_MULT
Multiply by scale
value, factor, result
SCALE_DIV
Divide by scale
value, factor, result
SIMILAR_SETUP
Setup similar figures
figure_type, sides_a, sides_b
SIMILAR_SCALE
Find scale factor
side_a, side_b, factor
SIMILAR_APPLY
Apply scale factor
known_side, factor, result
UNIT_RATE_SETUP
Setup unit rate
quantity, unit, total
UNIT_RATE_DIV
Calculate rate
total, quantity, rate
UNIT_RATE_TABLE
Show table data
x_values, y_values
UNIT_RATE_PICK
Pick values from table
x, y
Code
Description
Arguments
STAT_SETUP
Setup dataset
values
STAT_SUM
Sum values
expression, result
STAT_COUNT
Count values
n
STAT_DIVIDE
Divide for mean
expression, result
STAT_ORDER
Order values
ordered_values
STAT_MIDDLE
Find middle
position(s), value(s)
STAT_AVERAGE
Average middle values
calculation, result
STAT_FREQUENCY
Count frequency
value, count
STAT_MODE
Identify mode
mode_value(s), frequency
STAT_MIN
Find minimum
value
STAT_MAX
Find maximum
value
STAT_RANGE
Calculate range
calculation, result
STAT_MEAN
Calculate mean
calculation, result
STAT_DEVIATION
Calculate deviation
value, mean, deviation
STAT_ABS_DEV
Absolute deviation
deviation, abs_deviation
STAT_MAD
Mean absolute deviation
sum, count, result
SORT
Sort values
unsorted, sorted
MEAN_DIV
Divide for mean
sum, count, result
MODE_COUNT
Count for mode
value, count
MODE
Mode result
max_count, mode_values
MEDIAN_PAIR
Middle pair for even count
value1, value2
Code
Description
Arguments
PROB_SETUP
Setup probability
description or favorable, total
PROB_IDENTIFY
Identify probability
event, probability
PROB_INDEPENDENT
Note independence
explanation
PROB_DEPENDENT
Note dependence
explanation
PROB_CONDITIONAL
Conditional probability
event, probability
PROB_MULTIPLY
Multiply probabilities
prob1, prob2, result
Code
Description
Arguments
PERCENT_TO_DEC
Convert percent to decimal
percent, decimal
SETUP_PERCENT_EQ
Setup equation
equation
REARRANGE_EQ
Rearrange equation
rearranged
PERCENT_CALC_PART
Calculate part
percent_dec, whole, result
DEC_TO_PERCENT
Convert decimal to percent
decimal, percent
FRAC_TO_DEC
Convert fraction to decimal
fraction, decimal
DEC_TO_FRAC
Convert decimal to fraction
decimal, fraction
Code
Description
Arguments
CONV_FACTOR
Conversion factor
from_unit, to_unit
CONV_RESULT
Conversion result
from_value, to_value
Code
Description
Arguments
ROUND_CHECK
Check rounding digit
value, place, comparison
ROUND_RESULT
Rounding result
original, rounded
ALIGN_NUM
Align for comparison
num1, num2
CMP_NUM
Compare numbers
num1, num2, relation
Code
Description
Arguments
DIV_CHECK
Check divisibility
n, divisor, remainder
PRIME
Mark as prime
n
COMPOSITE_FACTOR
Show factor
factor, cofactor
Code
Description
Arguments
GRAPH_DATA
Graph type and data
graph_type, data_string
GRAPH_READ
Read value
category/time, value
GRAPH_MIN
Minimum value
category, value
GRAPH_MAX
Maximum value
category, value
GRAPH_CHANGE
Change between points
from, to, change
GRAPH_MAX_CHANGE
Largest change
from, to, change
PICTO_KEY
Pictograph key
symbol, value_per_symbol
PICTO_COUNT
Count symbols
category, count
Code
Description
Arguments
AB_SET
Set initial number
number
AB_INFO
Informational text
text
AB_ADD_DGT
Add digits in column
col_name, calculation, sum
AB_CARRY
Carry to next column
from_col, carry, to_col
AB_CARRY_FINAL
Final carry
carry_value
Category
Implemented
Remaining
Elementary (3-5)
34
0
Middle School (6-8)
41
0
Algebra 1
4
48
Geometry
1
28
Algebra 2
1
40
Precalculus
0
38
AP Statistics
0
26
AP Calculus AB
0
38
AP Calculus BC
0
24
Total
81
~243
See TODO.md for the complete curriculum roadmap.
Python 3.9+ (uses only standard library)
No external packages required
dolphin-math/
├── dolphin_math_datagen.py # Main CLI and generator orchestration
├── base_generator.py # Abstract base class for generators
├── helpers.py # Utility functions (step formatter, UUID)
├── generators/ # All generator implementations
│ ├── __init__.py
│ ├── long_division_generator.py
│ ├── fraction_op_generator.py
│ └── ... (51 generator files)
├── tests/ # Unit tests for all generators
│ ├── __init__.py
│ ├── test_long_division_generator.py
│ └── ... (51 test files)
├── README.md # This file
├── AGENTS.md # Guidelines for AI coding agents
├── TODO.md # Curriculum roadmap
└── pyproject.toml # Package configuration
When adding a new generator:
Create generators/my_new_generator.py extending ProblemGenerator
Create tests/test_my_new_generator.py with unit tests
IMPORTANT : Add import and instance to ALL_GENERATORS in dolphin_math_datagen.py
Update TODO.md to mark the item as complete
Run python dolphin_math_datagen.py --sample --generators MyNewGenerator to verify output
Run python -m unittest discover tests to ensure all tests pass