Python Crawler for News

A real-time news crawler built with Python and Scrapy, designed to fetch latest news from major Taiwan news websites. 使用 Python Scrapy 建置的台灣新聞網站即時爬蟲。

Features

Real-time Crawling: Fetches the latest news updates efficiently.
Multiple Sources: Supports various major Taiwan news outlets.
Flexible Storage: Supports saving data to Cassandra, MySQL, or JSON (via pipelines).
Extensible: Easy to add new spiders for additional news sites.
Docker Support: Ready-to-use Docker configuration for easy deployment.

Supported News Sites

News Site	Website	Status
自由時報 (Liberty Times)	Website	✅ Active
東森新聞 (EBC)	Website	✅ Active
聯合新聞網 (UDN)	Website	✅ Active
今日新聞 (NOWnews)	Website	✅ Active
ETtoday	Website	✅ Active
中時電子報 (China Times)	Website	✅ Active
TVBS	Website	✅ Active
三立新聞網 (SETN)	Website	✅ Active
中央通訊社 (CNA)	Website	✅ Active
巴哈姆特 (Gamer)	Website	🚧 TODO
風傳媒 (Storm)	Website	🚧 TODO

Requirements

Python 3.9+
Redis (for deduplication and queue management)
Database (Optional): Cassandra or MySQL

Installation

Clone the repository

git clone https://github.com/SecondDim/crawler-news.git
cd crawler-news

Install dependencies
```
pip install -r requirements.txt
```
Configuration Copy the example settings file and configure your environment:
```
cp crawler_news/settings.py.example crawler_news/settings.py
```
Edit crawler_news/settings.py to set up your database connections (Redis, MySQL, Cassandra) and other preferences.

Usage

Run All Spiders

To run all spiders sequentially:

python app.py
# or
./run.sh

Run a Single Spider

To run a specific spider (e.g., ettoday):

scrapy crawl ettoday

Available spiders: chinatimes, cna, ebc, ettoday, libertytimes, nownews, setn, tvbs, udn.

Docker Deployment

You can run the crawler using Docker to avoid environment setup issues.

Build Image

docker build . -t crawler_news

Run with Docker Compose

# Start services (Crawler + Redis + DBs if configured)
docker-compose up -d

# Stop services
docker-compose down

Run Manually

# Run without database (using local tmp/log folders)
docker run --rm -it -v $(pwd)/tmp:/src/tmp -v $(pwd)/log:/src/log crawler_news

Project Structure

crawler-news/
├── crawler_news/           # Main Scrapy project folder
│   ├── spiders/            # Spider definitions (one per news site)
│   ├── pipelines/          # Data processing pipelines (DB, JSON)
│   ├── extensions/         # Custom extensions (DB connectors)
│   ├── items.py            # Data models
│   └── settings.py         # Project configuration
├── docker/                 # Docker related files
├── app.py                  # Script to run all spiders
├── run.sh                  # Shell script entry point
├── requirements.txt        # Python dependencies
└── scrapy.cfg              # Scrapy deployment config

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.circleci		.circleci
crawler_news		crawler_news
log		log
tmp		tmp
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
requirements.txt		requirements.txt
run.sh		run.sh
scrapy.cfg		scrapy.cfg
unittest.py		unittest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Crawler for News

Features

Supported News Sites

Requirements

Installation

Usage

Run All Spiders

Run a Single Spider

Docker Deployment

Build Image

Run with Docker Compose

Run Manually

Project Structure

Contributing

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

SecondDim/crawler-news

Folders and files

Latest commit

History

Repository files navigation

Python Crawler for News

Features

Supported News Sites

Requirements

Installation

Usage

Run All Spiders

Run a Single Spider

Docker Deployment

Build Image

Run with Docker Compose

Run Manually

Project Structure

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages