StataAgent

Beta v0.0.1 — experimental; interfaces and behavior may change without notice.

StataAgent is an AI-powered agent for interactive data exploration and analysis directly within Stata. Built on pydantic-ai and LiteLLM, it interprets natural-language questions and Stata commands, converts them into executable Stata code, and returns results in your terminal.

Compatibility

Component	Requirement
Stata	17 or newer (requires `pystata`, which ships with Stata 17+)
Python	3.10 or newer
pydantic-ai	≥ 1.87.0
pydantic-ai-litellm	latest (installed alongside pydantic-ai)

Stata 16 and earlier are not supported. pystata — the Python API used to execute Stata commands — was introduced in Stata 17 and is not available for earlier versions.

Model / Provider Caution

Not all inference providers work reliably with this tool.

StataAgent requires a model that supports structured tool-calling (function calling). Many free-tier and serverless inference endpoints either do not implement the tool-calling protocol correctly or apply aggressive rate limits that break multi-step agentic loops.

The only provider found to be consistently usable so far is Nscale via an OpenAI-compatible endpoint. Other providers (HuggingFace Serverless, SambaNova free tier, etc.) have caused tool-call parsing failures or silent response truncation during testing.

If you use a different provider, confirm it supports OpenAI-style function calling and has sufficient token limits (≥ 4 096 output tokens). Commercial providers like Together AI, Fireworks, and Groq are more likely to work than free-tier serverless endpoints, but have not been systematically tested.

Recommended config (Nscale / OpenAI-compatible)

# config.yaml
provider: openai_compatible
model: <nscale-model-id>          # e.g. meta-llama/Llama-3.3-70B-Instruct
base_url: https://inference.nscale.com/v1
api_key_env: NSCALE_API_KEY
temperature: 0.3
max_tokens: 4096
commentary: false

export NSCALE_API_KEY=<your-key>

Features

Natural language queries — ask questions in plain English; StataAgent maps them to Stata commands.
Structured tools — dedicated wrappers for summarize, tabulate, regress, and describe produce clean, predictable output.
Escape hatch — run_stata executes arbitrary Stata code for commands not covered by the structured tools (xtreg, margins, xtset, etc.).
Metadata awareness — variable names and labels are loaded at startup so the model can reference them without hallucinating names.
Multi-turn history — conversation context is preserved across queries within a session.
Commentary toggle — --commentary / --no-commentary controls whether the model adds plain-language interpretation after raw Stata output.

Installation

Prerequisites

Stata 17+ (with pystata available — it ships with the Stata installation)
Python 3.10+
An API key for your chosen inference provider

Setup

git clone https://github.com/ColZoel/StataAgent.git
cd StataAgent

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

pip install -r requirements.txt

Copy and edit the config:

cp config.yaml my_config.yaml   # optional — edit in place if you prefer

Set your API key:

export NSCALE_API_KEY=<your-key>   # or whichever variable api_key_env points to

Usage

python main.py
# or, with overrides:
python hf_agent.py --config my_config.yaml
python hf_agent.py --provider openai_compatible --base-url https://inference.nscale.com/v1 --model meta-llama/Llama-3.3-70B-Instruct
python hf_agent.py --no-commentary
python hf_agent.py --temperature 0.1 --max-tokens 2048

Once running, type queries at the > prompt. Type quit or exit to end the session.

Examples

Load a dataset and describe it

> load /path/to/data.dta and describe it

StataAgent calls load_data then describe_dataset and returns the variable list with types and labels.

Descriptive statistics

> Summarize the wage and education variables

Internally executes:

summarize wage education

Regression

> What is the effect of education on wages, controlling for experience and gender?

Internally executes:

regress wage education experience gender

Cross-tabulation

> Show the distribution of homeownership by year

Internally executes:

tabulate homeown year

Arbitrary Stata command

> Run a fixed-effects regression of wage on education, with individual fixed effects (id) and year fixed effects

Internally executes (via run_stata):

xtset id year
xtreg wage education i.year, fe robust

CLI flag examples

# Start with commentary disabled (raw Stata output only)
python hf_agent.py --no-commentary

# Override model without editing config.yaml
python hf_agent.py --provider openai_compatible \
  --base-url https://inference.nscale.com/v1 \
  --model meta-llama/Llama-3.3-70B-Instruct

# Reset saved Stata installation config and re-run discovery
python hf_agent.py --reset

Notes

Each session re-loads the dataset from disk when load_data is called. There is no in-memory persistence across sessions, so StataAgent is not recommended for datasets above roughly 100 000 observations.
pystata must be importable from your Python environment. If it is not found, verify that your PYTHONPATH includes the utilities/python directory inside your Stata installation.

Contributing

Fork this repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit your changes (git commit -am 'Add feature').
Push and open a Pull Request.

Acknowledgments

pydantic-ai — typed AI agent framework
LiteLLM — unified LLM provider backend
StataCorp for pystata and the Stata platform

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
agent.py		agent.py
config.py		config.py
config.yaml		config.yaml
error_handler.py		error_handler.py
hf_agent.py		hf_agent.py
main.py		main.py
requirements.txt		requirements.txt
stata_tools.py		stata_tools.py
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StataAgent

Compatibility

Model / Provider Caution

Recommended config (Nscale / OpenAI-compatible)

Features

Installation

Prerequisites

Setup

Usage

Examples

Load a dataset and describe it

Descriptive statistics

Regression

Cross-tabulation

Arbitrary Stata command

CLI flag examples

Notes

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StataAgent

Compatibility

Model / Provider Caution

Recommended config (Nscale / OpenAI-compatible)

Features

Installation

Prerequisites

Setup

Usage

Examples

Load a dataset and describe it

Descriptive statistics

Regression

Cross-tabulation

Arbitrary Stata command

CLI flag examples

Notes

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages