From 842c9f8a3a7e14023f451ba7a83dbcf75e0b236b Mon Sep 17 00:00:00 2001 From: Thijs Dalhuijsen Date: Fri, 18 Jul 2025 21:07:53 +0200 Subject: [PATCH 1/4] docs: added flowchart diagrams --- README.md | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d6093b3..f2f823d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,18 @@ # Whirlwind Tour of Common Crawl's Datasets using Python The Common Crawl corpus contains petabytes of crawl data, including raw web page data, metadata extracts, and text extracts. Common Crawl's data storage is a little complicated, as you might expect for such a large and rich dataset. We make our crawl data available in a variety of formats (WARC, WET, WAT) and we also have two index files of the crawled webpages: CDXJ and columnar. +```mermaid +flowchart TD + WEB["WEB"] -- crawler --> cc["Common Crawl"] + cc --> WARC["WARC"] & WAT["WAT"] & WET["WET"] & CDXJ["CDXJ"] & Columnar["Columnar"] & etc["..."] + WEB@{ shape: cyl} + WARC@{ shape: stored-data} + WAT@{ shape: stored-data} + WET@{ shape: stored-data} + CDXJ@{ shape: stored-data} + Columnar@{ shape: stored-data} + etc@{ shape: stored-data} +``` The goal of this whirlwind tour is to show you how a single webpage appears in all of these different places. That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete), which we crawled on the date 2024-05-18T01:58:10Z. On the way, we'll also explore the file formats we use and learn about some useful tools for interacting with our data! @@ -96,7 +108,12 @@ Now that we've looked at the uncompressed versions of these files to understand ## Task 2: Iterate over WARC, WET, and WAT files The [warcio](https://github.com/webrecorder/warcio) Python library lets us read and write WARC files programmatically. - +```mermaid +flowchart LR + user["user (r/w)"]--warcio (r) -->warc + user--warcio (w) -->warc + warc@{shape: cyl} +``` Let's use it to iterate over our WARC, WET, and WAT files and print out the record types we looked at before. First, look at the code in `warcio-iterator.py`:
@@ -161,6 +178,12 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one ## Task 3: Index the WARC, WET, and WAT The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. +```mermaid +flowchart LR + warc[warc] --> indexer --> cdxj[.cdxj] & columnar[.parquet] + warc@{shape: cyl} +``` + We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index. @@ -196,7 +219,7 @@ The JSON blob has enough information to extract individual records: it says whic ## Task 4: Use the CDXJ index to extract raw content from the local WARC, WET, and WAT -Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed.The `gzip` compression utility supports this, but it's rarely used. +Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used. To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record. @@ -312,6 +335,12 @@ Make sure you compress WARCs the right way! ## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3 Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later. +```mermaid +flowchart LR + user --cdx_toolkit--> cdxi + cdxi@{shape: cyl} +``` + The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour. From 0a790a52b5b11dd74945fb1b40f6505621ae5fae Mon Sep 17 00:00:00 2001 From: Thijs Dalhuijsen Date: Fri, 18 Jul 2025 21:18:00 +0200 Subject: [PATCH 2/4] docs: added bonus multi-task exercise --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index f2f823d..581f60d 100644 --- a/README.md +++ b/README.md @@ -536,10 +536,18 @@ download instructions. All of these scripts run the same SQL query and should return the same record (written as a parquet file). +## Bonus 2: combine some steps + +1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives. +2. Note its url, warc, and timestamp. +3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section. +4. Repeat the cdx_toolkit steps, but for the page and date range you found above. + ## Congratulations! You have completed the Whirlwind Tour of Common Crawl's Datasets using Python! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Python. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab? + ## Other datasets We make more datasets available than just the ones discussed in this Whirlwind Tour. Below is a short introduction to some of these other datasets, along with links to where you can find out more. From 6f6ffa01deafd03c7891b5e50967b5b0256be46a Mon Sep 17 00:00:00 2001 From: Thijs Dalhuijsen Date: Fri, 18 Jul 2025 21:35:29 +0200 Subject: [PATCH 3/4] docs: cleanup after testing of diagrams --- README.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 581f60d..e0f65db 100644 --- a/README.md +++ b/README.md @@ -110,8 +110,8 @@ Now that we've looked at the uncompressed versions of these files to understand The [warcio](https://github.com/webrecorder/warcio) Python library lets us read and write WARC files programmatically. ```mermaid flowchart LR - user["user (r/w)"]--warcio (r) -->warc - user--warcio (w) -->warc + user["userprocess (r/w)"]--warcio (w) -->warc + warc --warcio (r)--> user warc@{shape: cyl} ``` Let's use it to iterate over our WARC, WET, and WAT files and print out the record types we looked at before. First, look at the code in `warcio-iterator.py`: @@ -177,11 +177,13 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one ## Task 3: Index the WARC, WET, and WAT -The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. +The example WARC files we've been using a\e tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. ```mermaid flowchart LR - warc[warc] --> indexer --> cdxj[.cdxj] & columnar[.parquet] + warc --> indexer --> cdxj & columnar warc@{shape: cyl} + cdxj@{ shape: stored-data} + columnar@{ shape: stored-data} ``` @@ -335,12 +337,6 @@ Make sure you compress WARCs the right way! ## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3 Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later. -```mermaid -flowchart LR - user --cdx_toolkit--> cdxi - cdxi@{shape: cyl} -``` - The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour. From a45e92bc1ad1435f69295a033556bdeeae3b3469 Mon Sep 17 00:00:00 2001 From: Thijs Dalhuijsen Date: Fri, 25 Jul 2025 19:52:12 +0200 Subject: [PATCH 4/4] docs: typo fix --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e0f65db..60757c0 100644 --- a/README.md +++ b/README.md @@ -177,7 +177,7 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one ## Task 3: Index the WARC, WET, and WAT -The example WARC files we've been using a\e tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. +The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. ```mermaid flowchart LR warc --> indexer --> cdxj & columnar