This document describes how to scrape a news source, "http://www.dailymail.co.uk" and store the data in a Google big query table.
- A Google account
- Python 3
- Pip3
- Google Services authentication
- Chrome based browser (because Firefox stalls)
The components we use and what we do with them described below:
- Local PC (green)
- We run a python scraper based on to collect URLs and articles from the news site.
- Provision a server locally which invokes the scraper.
- A google cloud project (blue)
- We define a topic, a subscription and a data warehouse to store articles.
- We use the Pub/Sub model described here: https://cloud.google.com/pubsub/docs/overview.
- The publisher application in our case is the scraper. It extracts the content of articles and encapsulates them into messages.
- The subscriber application is a google topic subscription.
- Messages are published onto a topic queue and removed if ACK is sent from the subscriber.
- Big Query is the data warehouse used to store the message content long term.
Go to https://console.cloud.google.com and select a project. In this example my project is called "project-id:
linux-academy-project-91522". The project you select shoud be your own. You might need to create one. This is because
you will have the correct permissions. Unless you request them you will not have the correct permissions on the
"AI for Good" Google cloud project "eng-lightning-244220"
From https://console.cloud.google.com Use the UI and search for pub/sub or go to https://console.cloud.google.com/cloudpubsub/
Create a topic as Step 1) In this example the topic is called "newsarticles".
The endpoint of the topic is:
The topic name is
- projects/linux-academy-project-91522/topics/newsarticles
You do not need to define the schema of the topic.
Create a subscription as Step 2). This can be done by checking a box during creating topic or afterwards using the web UI below. Make sure to select "newsarticles" as the topic. Use the default settings.
The subscription name is
- projects/linux-academy-project-91522/subscriptions/newsarticles-sub
The endpoint of the subscription is
Go to https://console.cloud.google.com/bigquery?project=linux-academy-project-91522 and create a datatable. The dataset name in this example is "my_dataset".
The endpoint of the dataset is
The dataset "my_dataset" contains a datatable called "my_table2". The datatable has a table schema which is shown on the screenshot.
The schema for the table, so each article has the field of the types below :
'name': 'url', 'type': 'STRING'
'name': 'title', 'type': 'STRING'
'name': 'author', 'type': 'STRING'
'name': 'date', 'type': 'TIMESTAMP'
'name': 'tags', 'type': 'STRING'
'name': 'text', 'type': 'STRING'
The URI for schema creation is https://console.cloud.google.com/cloudpubsub/schema/list?cloudshell=true&project=linux-academy-project-91522
If you later try an import panda dataframes with a different schema the import will fail with the message "Please verify that the structure and data types in the DataFrame match the schema of the destination table." This can also appear if no data is present in the dataframe.
To begin scraping articles into this google cloud project you need to run the python scraping code. The actions needed to do that are described below. Follow them to import daily mail articles into "my_table2"
This step proves that we have the local PC configured so that the Flask server can run there. The Flask server will run the scraping functions.
Go to the root of this repo and install the requirements using pip then run the examples.py file
git clone
cd infrastructure
pip install ./cloud_functions/dailymail/requirements.txt
python ./cloud_functions/dailymail/examples.py
You will see
(venv) C:\Users\richard.hill\PycharmProjects\infrastructure>python ./cloud_functions/dailymail/examples.py
* Serving Flask app "examples" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: on
* Restarting with stat
* Debugger is active!
* Debugger PIN: 206-483-474
* Running on http://127.0.0.1:8088/ (Press CTRL+C to quit)
Open page
http://127.0.0.1:8088/.
A Flask server will display "Hello world". This proves the Flask server is running. To scrape using this Flask server you need to make a HTTP request to different endpoint on the same server.
This step configures the scraper to use the topic, dataset, datatable you created in your google cloud project. The instructions assume that you used the same values we did above.
Open the cloud_functions/dailymail/examples.py in an editor or IDE. Configure the tool in the examples file to match your project
tool = Tool(domain_url='https://www.dailymail.co.uk/', project_id="linux-academy-project-91522", gps_topic_id="newsarticles",
gbq_dataset='my_dataset', gbq_table='my_table2')
Save the examples.py file
cloud_functions/dailymail/examples.py
Run the examples.py in your python env.
python cloud_functions/dailymail/examples.py
A HTTP GET request will cause the domain to be scraped for article urls and these URLS are appended to the "newsarticles" topic you created.
Open a browser or HTTP client and GET http://127.0.0.1:8088/scrapeurls
A HTTP GET request will subscribe to the topic get the articles content and publish it to the Big Query database you created.
Open a browser or HTTP client and GET http://127.0.0.1:8088/publisharticles
In my case this is visible at
https://console.cloud.google.com/bigquery?project=linux-academy-project-91522&d=my_dataset&p=linux-academy-project-91522&page=dataset
