A real-time news crawler built with Python and Scrapy, designed to fetch latest news from major Taiwan news websites. 使用 Python Scrapy 建置的台灣新聞網站即時爬蟲。
- Real-time Crawling: Fetches the latest news updates efficiently.
- Multiple Sources: Supports various major Taiwan news outlets.
- Flexible Storage: Supports saving data to Cassandra, MySQL, or JSON (via pipelines).
- Extensible: Easy to add new spiders for additional news sites.
- Docker Support: Ready-to-use Docker configuration for easy deployment.
| News Site | Website | Status |
|---|---|---|
| 自由時報 (Liberty Times) | Website | ✅ Active |
| 東森新聞 (EBC) | Website | ✅ Active |
| 聯合新聞網 (UDN) | Website | ✅ Active |
| 今日新聞 (NOWnews) | Website | ✅ Active |
| ETtoday | Website | ✅ Active |
| 中時電子報 (China Times) | Website | ✅ Active |
| TVBS | Website | ✅ Active |
| 三立新聞網 (SETN) | Website | ✅ Active |
| 中央通訊社 (CNA) | Website | ✅ Active |
| 巴哈姆特 (Gamer) | Website | 🚧 TODO |
| 風傳媒 (Storm) | Website | 🚧 TODO |
- Python 3.9+
- Redis (for deduplication and queue management)
- Database (Optional): Cassandra or MySQL
-
Clone the repository
git clone https://github.com/SecondDim/crawler-news.git cd crawler-news -
Install dependencies
pip install -r requirements.txt
-
Configuration Copy the example settings file and configure your environment:
cp crawler_news/settings.py.example crawler_news/settings.py
Edit
crawler_news/settings.pyto set up your database connections (Redis, MySQL, Cassandra) and other preferences.
To run all spiders sequentially:
python app.py
# or
./run.shTo run a specific spider (e.g., ettoday):
scrapy crawl ettodayAvailable spiders: chinatimes, cna, ebc, ettoday, libertytimes, nownews, setn, tvbs, udn.
You can run the crawler using Docker to avoid environment setup issues.
docker build . -t crawler_news# Start services (Crawler + Redis + DBs if configured)
docker-compose up -d
# Stop services
docker-compose down# Run without database (using local tmp/log folders)
docker run --rm -it -v $(pwd)/tmp:/src/tmp -v $(pwd)/log:/src/log crawler_newscrawler-news/
├── crawler_news/ # Main Scrapy project folder
│ ├── spiders/ # Spider definitions (one per news site)
│ ├── pipelines/ # Data processing pipelines (DB, JSON)
│ ├── extensions/ # Custom extensions (DB connectors)
│ ├── items.py # Data models
│ └── settings.py # Project configuration
├── docker/ # Docker related files
├── app.py # Script to run all spiders
├── run.sh # Shell script entry point
├── requirements.txt # Python dependencies
└── scrapy.cfg # Scrapy deployment config
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.