Skip to content

Lokesh-DataScience/Data-Analyst-Expert-Bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataAnalystBot πŸ€–

DataAnalystBot is an interactive, AI-powered assistant designed to help users with all things data analysis. It leverages advanced retrieval-augmented generation (RAG) techniques, a custom vector database, and a conversational interface to provide expert guidance on data cleaning, visualization, statistics, machine learning, and popular tools like Python, SQL, Excel, and more.


πŸš€ Features

  • Conversational AI: Chat with an LLM (Llama 3/4 via Groq) about any data analysis topic.
  • Multi-File Upload & Analysis: Upload and analyze images (charts, screenshots), CSV/Excel files, and PDFs simultaneously. The bot uses all provided files as context for your question via the /multi-upload endpoint.
  • Data Cleaning & Analysis Endpoints: Use /analyze-data for full AI-powered analysis (cleaning, stats, insights, visualizations) and /clean-data for fast, quota-free cleaning and summary.
  • SQL Query Generator: Describe what you want in plain English, provide an optional schema or upload a CSV to auto-detect columns, and get a ready-to-run SQL query with a full explanation and optimization suggestions. Supports PostgreSQL, MySQL, SQLite, SQL Server, Oracle, BigQuery, and Snowflake.
  • Auto Data Augmentation: Upload a CSV and let the bot automatically diagnose data quality issues β€” missing values, outliers, duplicates, skewed distributions, and class imbalance β€” then apply fixes in one click. Download the cleaned dataset or feed it directly into analysis.
  • Modern GUI: Redesigned Streamlit interface with tabs for chat, data upload, SQL generation, and data augmentation, plus sidebar controls, recent chat management, and raw data preview.
  • Image Understanding: Upload images and ask questions about them. The bot uses a multimodal LLM to analyze and respond, then grounds the answer using your chat history and knowledge base.
  • CSV Data Analysis: Upload a CSV file and ask questions about its content. The bot uses the CSV content as context for the LLM, providing data-aware answers.
  • PDF Data Analysis: Upload a PDF file and ask questions about its content. The bot extracts text from the PDF and uses it as context for the LLM, enabling document-aware responses.
  • File Caching: Uploaded CSV, image, and PDF data are cached for each session, enabling fast, context-aware follow-up questions without re-uploading or re-processing files.
  • Image Upload Rate Limiting: Each user can upload up to 3 images every 6 hours. If the limit is reached, only text, CSV, or PDF questions are allowed until the window resets.
  • Image Display in Chat: Uploaded images are shown inline with your messages for easy reference.
  • Retrieval-Augmented Generation (RAG): Answers are grounded in a curated, chunked knowledge base from top data science sources.
  • Session Memory: Each user session maintains its own chat history for context-aware conversations.
  • Recent Chats: All conversations are saved and can be resumed from the sidebar.
  • Custom Vector Database: Fast, semantic search over chunked documents using FAISS and HuggingFace embeddings.
  • Modern UI: Built with Streamlit for a clean, interactive chat experience.
  • Extensible Scrapers: Easily add new data sources with modular web scrapers.

πŸ“Έ Screenshots

Chat UI Example Chat with DataAnalystBot about Power BI for data analysis!


πŸ—οΈ Architecture Overview

flowchart TD
    subgraph "πŸ‘€ User Interface"
        A[πŸ‘€ User] -->|πŸ“€ Uploads Files & Asks Questions| B[πŸ–₯️ Streamlit Web App]
    end

    subgraph "πŸ”„ Processing Layer"
        B -->|πŸ“‘ Sends Request| C[⚑ FastAPI Server]
        C -->|πŸ’Ύ Stores Uploads| J[πŸ“ File Storage]
        C -->|πŸ” Retrieves Context| E[πŸ—„οΈ Vector Database]
        C -->|🧠 Generates Answer| D[πŸ€– AI Model - Groq]
        C -->|πŸ› οΈ Generates SQL| L[πŸ“ SQL Generator]
        C -->|πŸ”§ Cleans & Enriches| M[🧬 Data Augmentor]
    end

    subgraph "πŸ’Ύ Data Storage"
        E[πŸ—„οΈ FAISS Vector Database]
        F[πŸ”€ HuggingFace Embeddings]
        G[πŸ’­ Session Memory]
        I[⚑ Cache Storage]
        K[πŸ’¬ Chat History]
        H[πŸ•·οΈ Web Scrapers]
    end

    %% Data Flow
    E --> F
    H -->|πŸ“Š Adds Scraped Data| E
    C -->|πŸ’Ύ Saves Session| G
    C -->|⚑ Caches Results| I
    C -->|πŸ’¬ Stores Chats| K
    L -->|βœ… SQL + Explanation| C
    M -->|βœ… Augmented CSV + Log| C

    %% Response Flow
    D -->|βœ… AI Response| C
    C -->|πŸ“‹ Final Answer| B
    B -->|πŸ“Ί Shows Result| A

    class A,B userStyle
    class C,D,J,L,M processStyle
    class E,F,G,H,I,K storageStyle
Loading

πŸ”§ Auto Data Augmentation

The augmentation pipeline runs in three stages β€” nothing is applied silently without the user reviewing and approving the plan first.

Stage 1 β€” Diagnose
Scans the uploaded CSV and reports all detected issues with severity ratings:

Issue Detection Method
Missing values Per-column null count + percentage
Duplicate rows Exact row matching
Outliers IQR (1.5Γ— fence) per numeric column
Skewed distributions Skewness > 1.0 on numeric columns with all-positive values
Class imbalance Majority/minority ratio > 3:1 on categorical columns
Low row count Dataset smaller than 100 rows

Stage 2 β€” Augment (user-controlled)
Each fix can be toggled on or off before applying:

Option Technique
Impute Missing Values KNN imputation (≀10 numeric cols) or median; mode or "Unknown" for categorical
Treat Outliers Winsorization β€” caps values at 1st/99th percentile
Remove Duplicates Exact deduplication with reset index
Fix Skewed Distributions log1p transform on skewed numeric columns
Generate Synthetic Rows Gaussian noise (numeric) + frequency sampling (categorical) targeting 2Γ— row count

Stage 3 β€” Review & Export
After augmentation the user sees before/after row counts, a full change log, a side-by-side data preview, a Download Augmented CSV button, and a one-click Run Analysis on Augmented Data option that feeds directly into the existing /analyze-data pipeline.


πŸ› οΈ Natural Language to SQL Query Generator

The SQL Query Generator lets you describe what you want in plain English and instantly receive a production-ready SQL query β€” no SQL expertise required.

How it works in three steps:

Step 1 β€” Describe your query
Type a plain English description of what data you want. For example:

"Show me the top 10 customers by total revenue from completed orders in the last 90 days, only include customers with at least 2 orders."

Step 2 β€” Provide context (optional but recommended)
Paste your table schema as DDL, or upload a CSV file to let the bot auto-detect column names and data types. Also select your database dialect and the type of query you need.

Step 3 β€” Get your query
The bot returns three things:

  • βœ… A ready-to-run SQL query formatted for your chosen database
  • πŸ“– A plain English explanation of what the query does and why
  • πŸ’‘ Optimization suggestions such as index recommendations or alternative approaches

You can download the result as a .sql file directly from the interface.


Supported databases:

Database Dialect Support
PostgreSQL βœ… Full
MySQL βœ… Full
SQLite βœ… Full
Microsoft SQL Server βœ… Full
Oracle βœ… Full
Google BigQuery βœ… Full
Snowflake βœ… Full

Supported query types:

  • SELECT / Fetch Data
  • INSERT / Add Data
  • UPDATE / Modify Data
  • DELETE / Remove Data
  • JOIN / Combine Tables
  • Aggregation / GROUP BY
  • Subquery / CTE
  • Other / Custom

Example input β†’ output:

Input description:

sorted by total spend descending.```

Generated query:
```sql
SELECT
    c.id,
    c.name,
    c.country,
    COUNT(o.id)   AS order_count,
    SUM(o.total)  AS total_spent
FROM customers c
JOIN orders o
    ON o.customer_id = c.id
WHERE
    o.status      = 'completed'
    AND o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY
    c.id, c.name, c.country
HAVING
    COUNT(o.id) > 3
ORDER BY
    total_spent DESC;

Note: Generated queries are never executed server-side. The bot returns query text only β€” your database stays safe.


πŸ“š Data Sources

All articles are scraped, chunked (500 chars), and stored in data/data.jsonl for efficient retrieval.


πŸ› οΈ Tech Stack


⚑ Quickstart

1. Clone the Repository

git clone https://github.com/Lokesh-DataScience/Data-Analyst-Expert-Bot.git
cd DataAnalystBot

2. Install Dependencies

python -m venv .venv
.venv\Scripts\activate  # On Windows
pip install -r requirements.txt

3. Set Up Environment Variables

Create a .env file in the root directory:

GROQ_API_KEY=your_groq_api_key
LANGSMITH_API_KEY=your_langsmith_api_key

4. Scrape and Prepare Data

Run the scrapers in the scrapers/ folder to populate data/data.jsonl with chunked content:

python scrapers/gfg_scraper.py
python scrapers/pointtech_scraper.py
python scrapers/towardsdatascience_scrapper.py

5. Build the Vector Database

python vector_db/faiss_db.py

6. Start the Backend API

uvicorn api.main:app --reload

7. Launch the Streamlit Frontend

streamlit run streamlit_app/app.py

πŸ’¬ Usage

  • Open http://localhost:8501 in your browser.
  • Ask questions about data analysis, tools, or techniques.
  • To analyze an image: Upload a jpg, jpeg, or png file and enter your question. The bot will analyze the image and respond.
  • To analyze a CSV: Upload a CSV file and ask a question about its content. The bot will use the CSV data as context for its answer.
  • To analyze a PDF: Upload a PDF file and ask a question about its content. The bot will use the PDF text as context for its answer.
  • To generate a SQL query: Go to the πŸ› οΈ SQL Query Generator tab. Paste your schema (or upload a CSV to auto-detect columns), select your database type and query type, describe what you want in plain English, and click ⚑ Generate SQL Query. The bot returns a ready-to-run query, a plain English explanation, and optimization suggestions you can download as a .sql file.
  • To augment a dataset: Go to the πŸ”§ Data Augmentation tab. Upload a CSV, click πŸ” Diagnose Data to see a full issues report and recommended fixes, toggle which steps to apply, then click ⚑ Apply Augmentation. Review the change log and before/after preview, download the cleaned CSV, or click πŸ“Š Run Analysis on Augmented Data to analyze it immediately.
  • Note: You can upload up to 3 images every 6 hours. If you reach the limit, you can still ask text questions.
  • Resume conversations: Select any recent chat from the sidebar to continue where you left off.

🧩 Project Structure

DataAnalystBot/
β”‚
β”œβ”€β”€ api/                  # FastAPI backend
β”‚   └── main.py
β”œβ”€β”€ chains/               # RAG chain construction
β”‚   └── rag_chain.py
β”œβ”€β”€ data/                 # Chunked knowledge base (JSONL)
β”‚   └── data.jsonl
β”œβ”€β”€ loaders/              # Data loading utilities
β”‚   β”œβ”€β”€ load_data.py
β”‚   β”œβ”€β”€ load_csv.py
β”‚   └── load_pdf.py
β”œβ”€β”€ memory/               # Session memory management
β”‚   └── session_memory.py
β”œβ”€β”€ scrapers/             # Web scrapers for sources
β”‚   β”œβ”€β”€ gfg_scraper.py
β”‚   β”œβ”€β”€ pointtech_scraper.py
β”‚   └── towardsdatascience_scrapper.py
β”œβ”€β”€ streamlit_app/        # Streamlit UI
β”‚   β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ styles/
β”‚   β”œβ”€β”€ utils/
β”‚   └── app.py
β”œβ”€β”€ utils/                # Backend utilities
β”‚   β”œβ”€β”€ data_analyzer.py
β”‚   └── data_augmentor.py
β”œβ”€β”€ vector_db/            # Vector DB creation/loading
β”‚   └── faiss_db.py
β”œβ”€β”€ requirements.txt
└── README.md

πŸ“ Customization

  • Add new sources: Write a new scraper in scrapers/, chunk the content, and append to data/data.jsonl.
  • Change chunk size: Adjust the textwrap.wrap(..., width=500) in scrapers.
  • Swap LLM or embeddings: Update model names in chains/rag_chain.py or vector_db/faiss_db.py.
  • Switch between full analysis and fast cleaning: Use /analyze-data for AI-powered insights, or /clean-data for quick cleaning and stats.
  • Extend SQL generation: The /generate-sql endpoint accepts any schema DDL and supports all major SQL dialects. Add dialect-specific prompt templates in api/main.py to further tailor output.
  • Extend augmentation: Add new augmentation steps in utils/data_augmentor.py by adding a method and registering it in the augment() dispatcher. SMOTE-based oversampling can be enabled by installing imbalanced-learn and extending _generate_synthetic_rows().

πŸ›‘οΈ Security & Privacy

  • All chat history is stored in memory per session and is not persisted between server restarts.
  • API keys are loaded from .env and never exposed to the frontend.
  • Generated SQL queries are not executed server-side β€” the bot only returns query text, keeping your database safe.
  • Data augmentation is performed entirely server-side in memory β€” uploaded CSVs are written to a temporary file, processed, and immediately deleted.

🀝 Contributing

Pull requests, issues, and feature suggestions are welcome!
Please open an issue or submit a PR.


πŸ“„ License

MIT License. See LICENSE for details.


πŸ™ Acknowledgements


Happy Analyzing! πŸš€

Releases

No releases published

Packages

 
 
 

Contributors

Languages