AI Web Scraper

Overview

This is an AI-powered web scraping application built with Streamlit, Selenium, and Ollama, allowing users to scrape website content and extract specific information using natural language prompts.

Features

🌐 Web Scraping: Extract content from any website
🤖 AI-Powered Parsing: Use natural language to extract precise information
🖥️ User-Friendly Interface: Simple Streamlit web application
🔍 CAPTCHA Handling: Integrated CAPTCHA solving mechanism

Prerequisites

Python 3.10+
Anaconda or Miniconda
Ollama (with Mistral model)
Chrome WebDriver

Installation

1. Clone the Repository

git clone https://github.com/sunilmakkar/ai-web-scraper.git
cd ai-web-scraper

2. Create Conda Environment

conda env create -f environment.yaml
conda activate ai_web_scraper_env

3. Install Dependencies

pip install -r requirements.txt

4. Set Up Ollama

Ensure Ollama is installed and the Mistral model is available:

ollama pull mistral

5. Configure Bright Data Proxy (Optional)

Replace the SBR_WEBDRIVER in scrape.py with your Bright Data credentials.

Usage

Run the Streamlit application:

streamlit run main.py

How to Use

Enter a website URL
Click "Scrape Site" to extract content
Describe what information you want to parse
Click "Parse Content" to extract specific details

Dependencies

Streamlit
Selenium
BeautifulSoup
Langchain
Ollama
Mistral LLM

Limitations

Requires Bright Data proxy for advanced scraping
CAPTCHA solving may not work for all websites
Parsing accuracy depends on the Mistral model's performance

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Future Improvements

While the current implementation provides a versatile web scraping solution, the most significant enhancement would be to specialize the scraper for a specific industry or use case. By tailoring the scraper to solve a niche problem, such as automated competitive analysis for e-commerce, real estate market research, or job market intelligence, the tool could provide more targeted and valuable insights. Personalization would involve creating custom parsing logic, industry-specific data extraction rules, and potentially integrating domain-specific AI models to improve accuracy and relevance of extracted information.

License

MIT License

Acknowledgements

Inspired by Tech With Tim's tutorial
Powered by Ollama and Mistral LLM

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper

Overview

Features

Prerequisites

Installation

1. Clone the Repository

2. Create Conda Environment

3. Install Dependencies

4. Set Up Ollama

5. Configure Bright Data Proxy (Optional)

Usage

How to Use

Dependencies

Limitations

Contributing

Future Improvements

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Overview

Features

Prerequisites

Installation

1. Clone the Repository

2. Create Conda Environment

3. Install Dependencies

4. Set Up Ollama

5. Configure Bright Data Proxy (Optional)

Usage

How to Use

Dependencies

Limitations

Contributing

Future Improvements

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages