DataAnalystBot is an interactive, AI-powered assistant designed to help users with all things data analysis. It leverages advanced retrieval-augmented generation (RAG) techniques, a custom vector database, and a conversational interface to provide expert guidance on data cleaning, visualization, statistics, machine learning, and popular tools like Python, SQL, Excel, and more.
- Conversational AI: Chat with an LLM (Llama 3/4 via Groq) about any data analysis topic.
- Multi-File Upload & Analysis: Upload and analyze images (charts, screenshots), CSV/Excel files, and PDFs simultaneously. The bot uses all provided files as context for your question via the
/multi-uploadendpoint. - Data Cleaning & Analysis Endpoints: Use
/analyze-datafor full AI-powered analysis (cleaning, stats, insights, visualizations) and/clean-datafor fast, quota-free cleaning and summary. - SQL Query Generator: Describe what you want in plain English, provide an optional schema or upload a CSV to auto-detect columns, and get a ready-to-run SQL query with a full explanation and optimization suggestions. Supports PostgreSQL, MySQL, SQLite, SQL Server, Oracle, BigQuery, and Snowflake.
- Auto Data Augmentation: Upload a CSV and let the bot automatically diagnose data quality issues β missing values, outliers, duplicates, skewed distributions, and class imbalance β then apply fixes in one click. Download the cleaned dataset or feed it directly into analysis.
- Modern GUI: Redesigned Streamlit interface with tabs for chat, data upload, SQL generation, and data augmentation, plus sidebar controls, recent chat management, and raw data preview.
- Image Understanding: Upload images and ask questions about them. The bot uses a multimodal LLM to analyze and respond, then grounds the answer using your chat history and knowledge base.
- CSV Data Analysis: Upload a CSV file and ask questions about its content. The bot uses the CSV content as context for the LLM, providing data-aware answers.
- PDF Data Analysis: Upload a PDF file and ask questions about its content. The bot extracts text from the PDF and uses it as context for the LLM, enabling document-aware responses.
- File Caching: Uploaded CSV, image, and PDF data are cached for each session, enabling fast, context-aware follow-up questions without re-uploading or re-processing files.
- Image Upload Rate Limiting: Each user can upload up to 3 images every 6 hours. If the limit is reached, only text, CSV, or PDF questions are allowed until the window resets.
- Image Display in Chat: Uploaded images are shown inline with your messages for easy reference.
- Retrieval-Augmented Generation (RAG): Answers are grounded in a curated, chunked knowledge base from top data science sources.
- Session Memory: Each user session maintains its own chat history for context-aware conversations.
- Recent Chats: All conversations are saved and can be resumed from the sidebar.
- Custom Vector Database: Fast, semantic search over chunked documents using FAISS and HuggingFace embeddings.
- Modern UI: Built with Streamlit for a clean, interactive chat experience.
- Extensible Scrapers: Easily add new data sources with modular web scrapers.
Chat with DataAnalystBot about Power BI for data analysis!
flowchart TD
subgraph "π€ User Interface"
A[π€ User] -->|π€ Uploads Files & Asks Questions| B[π₯οΈ Streamlit Web App]
end
subgraph "π Processing Layer"
B -->|π‘ Sends Request| C[β‘ FastAPI Server]
C -->|πΎ Stores Uploads| J[π File Storage]
C -->|π Retrieves Context| E[ποΈ Vector Database]
C -->|π§ Generates Answer| D[π€ AI Model - Groq]
C -->|π οΈ Generates SQL| L[π SQL Generator]
C -->|π§ Cleans & Enriches| M[𧬠Data Augmentor]
end
subgraph "πΎ Data Storage"
E[ποΈ FAISS Vector Database]
F[π€ HuggingFace Embeddings]
G[π Session Memory]
I[β‘ Cache Storage]
K[π¬ Chat History]
H[π·οΈ Web Scrapers]
end
%% Data Flow
E --> F
H -->|π Adds Scraped Data| E
C -->|πΎ Saves Session| G
C -->|β‘ Caches Results| I
C -->|π¬ Stores Chats| K
L -->|β
SQL + Explanation| C
M -->|β
Augmented CSV + Log| C
%% Response Flow
D -->|β
AI Response| C
C -->|π Final Answer| B
B -->|πΊ Shows Result| A
class A,B userStyle
class C,D,J,L,M processStyle
class E,F,G,H,I,K storageStyle
The augmentation pipeline runs in three stages β nothing is applied silently without the user reviewing and approving the plan first.
Stage 1 β Diagnose
Scans the uploaded CSV and reports all detected issues with severity ratings:
| Issue | Detection Method |
|---|---|
| Missing values | Per-column null count + percentage |
| Duplicate rows | Exact row matching |
| Outliers | IQR (1.5Γ fence) per numeric column |
| Skewed distributions | Skewness > 1.0 on numeric columns with all-positive values |
| Class imbalance | Majority/minority ratio > 3:1 on categorical columns |
| Low row count | Dataset smaller than 100 rows |
Stage 2 β Augment (user-controlled)
Each fix can be toggled on or off before applying:
| Option | Technique |
|---|---|
| Impute Missing Values | KNN imputation (β€10 numeric cols) or median; mode or "Unknown" for categorical |
| Treat Outliers | Winsorization β caps values at 1st/99th percentile |
| Remove Duplicates | Exact deduplication with reset index |
| Fix Skewed Distributions | log1p transform on skewed numeric columns |
| Generate Synthetic Rows | Gaussian noise (numeric) + frequency sampling (categorical) targeting 2Γ row count |
Stage 3 β Review & Export
After augmentation the user sees before/after row counts, a full change log, a side-by-side data preview, a Download Augmented CSV button, and a one-click Run Analysis on Augmented Data option that feeds directly into the existing /analyze-data pipeline.
The SQL Query Generator lets you describe what you want in plain English and instantly receive a production-ready SQL query β no SQL expertise required.
How it works in three steps:
Step 1 β Describe your query
Type a plain English description of what data you want. For example:
"Show me the top 10 customers by total revenue from completed orders in the last 90 days, only include customers with at least 2 orders."
Step 2 β Provide context (optional but recommended)
Paste your table schema as DDL, or upload a CSV file to let the bot auto-detect column names and data types. Also select your database dialect and the type of query you need.
Step 3 β Get your query
The bot returns three things:
- β A ready-to-run SQL query formatted for your chosen database
- π A plain English explanation of what the query does and why
- π‘ Optimization suggestions such as index recommendations or alternative approaches
You can download the result as a .sql file directly from the interface.
Supported databases:
| Database | Dialect Support |
|---|---|
| PostgreSQL | β Full |
| MySQL | β Full |
| SQLite | β Full |
| Microsoft SQL Server | β Full |
| Oracle | β Full |
| Google BigQuery | β Full |
| Snowflake | β Full |
Supported query types:
- SELECT / Fetch Data
- INSERT / Add Data
- UPDATE / Modify Data
- DELETE / Remove Data
- JOIN / Combine Tables
- Aggregation / GROUP BY
- Subquery / CTE
- Other / Custom
Example input β output:
Input description:
sorted by total spend descending.```
Generated query:
```sql
SELECT
c.id,
c.name,
c.country,
COUNT(o.id) AS order_count,
SUM(o.total) AS total_spent
FROM customers c
JOIN orders o
ON o.customer_id = c.id
WHERE
o.status = 'completed'
AND o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY
c.id, c.name, c.country
HAVING
COUNT(o.id) > 3
ORDER BY
total_spent DESC;
Note: Generated queries are never executed server-side. The bot returns query text only β your database stays safe.
All articles are scraped, chunked (500 chars), and stored in data/data.jsonl for efficient retrieval.
- Frontend: Streamlit
- Backend: FastAPI
- LLM: Groq Llama 3 & Multimodal Llama 4
- Vector DB: FAISS
- Embeddings: HuggingFace Transformers
- Web Scraping: Selenium
- Data Augmentation: scikit-learn + SciPy
- Session Memory: In-memory per-session chat history
- Caching: DiskCache and Streamlit cache for fast file and context retrieval
git clone https://github.com/Lokesh-DataScience/Data-Analyst-Expert-Bot.git
cd DataAnalystBotpython -m venv .venv
.venv\Scripts\activate # On Windows
pip install -r requirements.txtCreate a .env file in the root directory:
GROQ_API_KEY=your_groq_api_key
LANGSMITH_API_KEY=your_langsmith_api_keyRun the scrapers in the scrapers/ folder to populate data/data.jsonl with chunked content:
python scrapers/gfg_scraper.py
python scrapers/pointtech_scraper.py
python scrapers/towardsdatascience_scrapper.pypython vector_db/faiss_db.pyuvicorn api.main:app --reloadstreamlit run streamlit_app/app.py- Open http://localhost:8501 in your browser.
- Ask questions about data analysis, tools, or techniques.
- To analyze an image: Upload a jpg, jpeg, or png file and enter your question. The bot will analyze the image and respond.
- To analyze a CSV: Upload a CSV file and ask a question about its content. The bot will use the CSV data as context for its answer.
- To analyze a PDF: Upload a PDF file and ask a question about its content. The bot will use the PDF text as context for its answer.
- To generate a SQL query: Go to the π οΈ SQL Query Generator tab. Paste your schema (or upload a CSV to auto-detect columns), select your database type and query type, describe what you want in plain English, and click β‘ Generate SQL Query. The bot returns a ready-to-run query, a plain English explanation, and optimization suggestions you can download as a
.sqlfile. - To augment a dataset: Go to the π§ Data Augmentation tab. Upload a CSV, click π Diagnose Data to see a full issues report and recommended fixes, toggle which steps to apply, then click β‘ Apply Augmentation. Review the change log and before/after preview, download the cleaned CSV, or click π Run Analysis on Augmented Data to analyze it immediately.
- Note: You can upload up to 3 images every 6 hours. If you reach the limit, you can still ask text questions.
- Resume conversations: Select any recent chat from the sidebar to continue where you left off.
DataAnalystBot/
β
βββ api/ # FastAPI backend
β βββ main.py
βββ chains/ # RAG chain construction
β βββ rag_chain.py
βββ data/ # Chunked knowledge base (JSONL)
β βββ data.jsonl
βββ loaders/ # Data loading utilities
β βββ load_data.py
β βββ load_csv.py
β βββ load_pdf.py
βββ memory/ # Session memory management
β βββ session_memory.py
βββ scrapers/ # Web scrapers for sources
β βββ gfg_scraper.py
β βββ pointtech_scraper.py
β βββ towardsdatascience_scrapper.py
βββ streamlit_app/ # Streamlit UI
β βββ components/
β βββ config/
β βββ styles/
β βββ utils/
β βββ app.py
βββ utils/ # Backend utilities
β βββ data_analyzer.py
β βββ data_augmentor.py
βββ vector_db/ # Vector DB creation/loading
β βββ faiss_db.py
βββ requirements.txt
βββ README.md
- Add new sources: Write a new scraper in
scrapers/, chunk the content, and append todata/data.jsonl. - Change chunk size: Adjust the
textwrap.wrap(..., width=500)in scrapers. - Swap LLM or embeddings: Update model names in
chains/rag_chain.pyorvector_db/faiss_db.py. - Switch between full analysis and fast cleaning: Use
/analyze-datafor AI-powered insights, or/clean-datafor quick cleaning and stats. - Extend SQL generation: The
/generate-sqlendpoint accepts any schema DDL and supports all major SQL dialects. Add dialect-specific prompt templates inapi/main.pyto further tailor output. - Extend augmentation: Add new augmentation steps in
utils/data_augmentor.pyby adding a method and registering it in theaugment()dispatcher. SMOTE-based oversampling can be enabled by installingimbalanced-learnand extending_generate_synthetic_rows().
- All chat history is stored in memory per session and is not persisted between server restarts.
- API keys are loaded from
.envand never exposed to the frontend. - Generated SQL queries are not executed server-side β the bot only returns query text, keeping your database safe.
- Data augmentation is performed entirely server-side in memory β uploaded CSVs are written to a temporary file, processed, and immediately deleted.
Pull requests, issues, and feature suggestions are welcome!
Please open an issue or submit a PR.
MIT License. See LICENSE for details.
Happy Analyzing! π