This is an AI-powered web scraping application built with Streamlit, Selenium, and Ollama, allowing users to scrape website content and extract specific information using natural language prompts.
- 🌐 Web Scraping: Extract content from any website
- 🤖 AI-Powered Parsing: Use natural language to extract precise information
- 🖥️ User-Friendly Interface: Simple Streamlit web application
- 🔍 CAPTCHA Handling: Integrated CAPTCHA solving mechanism
- Python 3.10+
- Anaconda or Miniconda
- Ollama (with Mistral model)
- Chrome WebDriver
git clone https://github.com/sunilmakkar/ai-web-scraper.git
cd ai-web-scraperconda env create -f environment.yaml
conda activate ai_web_scraper_envpip install -r requirements.txtEnsure Ollama is installed and the Mistral model is available:
ollama pull mistralReplace the SBR_WEBDRIVER in scrape.py with your Bright Data credentials.
Run the Streamlit application:
streamlit run main.py- Enter a website URL
- Click "Scrape Site" to extract content
- Describe what information you want to parse
- Click "Parse Content" to extract specific details
- Streamlit
- Selenium
- BeautifulSoup
- Langchain
- Ollama
- Mistral LLM
- Requires Bright Data proxy for advanced scraping
- CAPTCHA solving may not work for all websites
- Parsing accuracy depends on the Mistral model's performance
Contributions are welcome! Please feel free to submit a Pull Request.
While the current implementation provides a versatile web scraping solution, the most significant enhancement would be to specialize the scraper for a specific industry or use case. By tailoring the scraper to solve a niche problem, such as automated competitive analysis for e-commerce, real estate market research, or job market intelligence, the tool could provide more targeted and valuable insights. Personalization would involve creating custom parsing logic, industry-specific data extraction rules, and potentially integrating domain-specific AI models to improve accuracy and relevance of extracted information.
MIT License
- Inspired by Tech With Tim's tutorial
- Powered by Ollama and Mistral LLM