Beta v0.0.1 — experimental; interfaces and behavior may change without notice.
StataAgent is an AI-powered agent for interactive data exploration and analysis directly within Stata. Built on pydantic-ai and LiteLLM, it interprets natural-language questions and Stata commands, converts them into executable Stata code, and returns results in your terminal.
| Component | Requirement |
|---|---|
| Stata | 17 or newer (requires pystata, which ships with Stata 17+) |
| Python | 3.10 or newer |
| pydantic-ai | ≥ 1.87.0 |
| pydantic-ai-litellm | latest (installed alongside pydantic-ai) |
Stata 16 and earlier are not supported.
pystata— the Python API used to execute Stata commands — was introduced in Stata 17 and is not available for earlier versions.
Not all inference providers work reliably with this tool.
StataAgent requires a model that supports structured tool-calling (function calling). Many free-tier and serverless inference endpoints either do not implement the tool-calling protocol correctly or apply aggressive rate limits that break multi-step agentic loops.
The only provider found to be consistently usable so far is Nscale via an OpenAI-compatible endpoint. Other providers (HuggingFace Serverless, SambaNova free tier, etc.) have caused tool-call parsing failures or silent response truncation during testing.
If you use a different provider, confirm it supports OpenAI-style function calling and has sufficient token limits (≥ 4 096 output tokens). Commercial providers like Together AI, Fireworks, and Groq are more likely to work than free-tier serverless endpoints, but have not been systematically tested.
# config.yaml
provider: openai_compatible
model: <nscale-model-id> # e.g. meta-llama/Llama-3.3-70B-Instruct
base_url: https://inference.nscale.com/v1
api_key_env: NSCALE_API_KEY
temperature: 0.3
max_tokens: 4096
commentary: falseexport NSCALE_API_KEY=<your-key>- Natural language queries — ask questions in plain English; StataAgent maps them to Stata commands.
- Structured tools — dedicated wrappers for
summarize,tabulate,regress, anddescribeproduce clean, predictable output. - Escape hatch —
run_stataexecutes arbitrary Stata code for commands not covered by the structured tools (xtreg,margins,xtset, etc.). - Metadata awareness — variable names and labels are loaded at startup so the model can reference them without hallucinating names.
- Multi-turn history — conversation context is preserved across queries within a session.
- Commentary toggle —
--commentary/--no-commentarycontrols whether the model adds plain-language interpretation after raw Stata output.
- Stata 17+ (with
pystataavailable — it ships with the Stata installation) - Python 3.10+
- An API key for your chosen inference provider
git clone https://github.com/ColZoel/StataAgent.git
cd StataAgent
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtCopy and edit the config:
cp config.yaml my_config.yaml # optional — edit in place if you preferSet your API key:
export NSCALE_API_KEY=<your-key> # or whichever variable api_key_env points topython main.py
# or, with overrides:
python hf_agent.py --config my_config.yaml
python hf_agent.py --provider openai_compatible --base-url https://inference.nscale.com/v1 --model meta-llama/Llama-3.3-70B-Instruct
python hf_agent.py --no-commentary
python hf_agent.py --temperature 0.1 --max-tokens 2048Once running, type queries at the > prompt. Type quit or exit to end the session.
> load /path/to/data.dta and describe it
StataAgent calls load_data then describe_dataset and returns the variable list with types and labels.
> Summarize the wage and education variables
Internally executes:
summarize wage education> What is the effect of education on wages, controlling for experience and gender?
Internally executes:
regress wage education experience gender> Show the distribution of homeownership by year
Internally executes:
tabulate homeown year> Run a fixed-effects regression of wage on education, with individual fixed effects (id) and year fixed effects
Internally executes (via run_stata):
xtset id year
xtreg wage education i.year, fe robust# Start with commentary disabled (raw Stata output only)
python hf_agent.py --no-commentary
# Override model without editing config.yaml
python hf_agent.py --provider openai_compatible \
--base-url https://inference.nscale.com/v1 \
--model meta-llama/Llama-3.3-70B-Instruct
# Reset saved Stata installation config and re-run discovery
python hf_agent.py --reset- Each session re-loads the dataset from disk when
load_datais called. There is no in-memory persistence across sessions, so StataAgent is not recommended for datasets above roughly 100 000 observations. pystatamust be importable from your Python environment. If it is not found, verify that yourPYTHONPATHincludes theutilities/pythondirectory inside your Stata installation.
- Fork this repository.
- Create a feature branch (
git checkout -b feature/your-feature). - Commit your changes (
git commit -am 'Add feature'). - Push and open a Pull Request.
- pydantic-ai — typed AI agent framework
- LiteLLM — unified LLM provider backend
- StataCorp for
pystataand the Stata platform