Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ In the next lesson, we'll start with our Node.js project. First we'll be figurin

### Extract the price of IKEA's most expensive artificial plant

At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number.
At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number (you may need [`replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) to handle spaces).

<details>
<summary>Solution</summary>
Expand All @@ -93,8 +93,8 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
1. Convert the price text into a number by executing `parseInt(price.textContent)`.
1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
1. Convert the price text into a number by executing `parseInt(price.textContent.replace(' ', ''))`. Note that `replace(' ', '')` removes spaces from the price string before converting it to a number.
1. At the time of writing, this returns `1299`, meaning [1 299 SEK](https://www.google.com/search?q=1299%20sek).

</details>

Expand All @@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto

### Extract details about the first post on Guardian's F1 news

On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph (if it has one), and URL of the associated photo.

![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)

Expand All @@ -132,7 +132,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
1. Extract the lead paragraph (if it has one) by executing `post.querySelector('span div').textContent`.
1. Extract the photo URL by executing `post.querySelector('img').src`.

</details>
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ if (response.ok) {
}
```

Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/Cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element.
Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element.

Cheerio requires us to wrap each element with `$()` again before we can work with it further, and then we call `.text()`. If we run the code, it… well, it definitely prints _something_…

Expand Down Expand Up @@ -136,7 +136,7 @@ When translated to a tree of JavaScript objects, the element with class `price`
- a `span` HTML element,
- a textual node representing the actual amount and possibly also white space.

We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/Cheerio#contents) method to access individual nodes. It returns a list of nodes like this:
We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/cheerio#contents) method to access individual nodes. It returns a list of nodes like this:

```text
LoadedCheerio {
Expand Down Expand Up @@ -197,7 +197,7 @@ if (response.ok) {
}
```

We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/Cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/Cheerio#last). If we run the scraper now, it should print prices as only amounts:
We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/cheerio#last). If we run the scraper now, it should print prices as only amounts:

```text
$ node index.js
Expand Down Expand Up @@ -237,7 +237,7 @@ Macao, China

:::tip Need a nudge?

You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/Cheerio#eq).
You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/cheerio#eq).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian
:::tip Need a nudge?

- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/Cheerio#attr) to access attributes.
- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/cheerio#attr) to access attributes.
- In JavaScript you can use an ISO 8601 string to create a [`Date`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date) object.
- To get the date, you can call `.toDateString()` on `Date` objects.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ We'll start from a boilerplate that's very similar to the scraper we built in [B
{Example}
</RunnableCodeBlock>

Aside from importing libraries and downloading HTML, we load the HTML into Cheerio and then use it to retrieve all the `<a>` elements. After that, we iterate over the collected links and print their `href` attributes, which we access using the [`.attr()`](https://cheerio.js.org/docs/api/classes/Cheerio#attr) method.
Aside from importing libraries and downloading HTML, we load the HTML into Cheerio and then use it to retrieve all the `<a>` elements. After that, we iterate over the collected links and print their `href` attributes, which we access using the [`.attr()`](https://cheerio.js.org/docs/api/classes/cheerio#attr) method.

When you run the above code, you'll see quite a lot of links in the terminal. Some of them may look wrong, because they don't start with the regular `https://` protocol. We'll learn what to do with them in the following lessons.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ In the next lesson, we'll start with our Python project. First we'll be figuring

### Extract the price of IKEA's most expensive artificial plant

At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number.
At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number (you may need [`replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) to handle spaces).

<details>
<summary>Solution</summary>
Expand All @@ -90,8 +90,8 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
1. Convert the price text into a number by executing `parseInt(price.textContent)`.
1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
1. Convert the price text into a number by executing `parseInt(price.textContent.replace(' ', ''))`. Note that `replace(' ', '')` removes spaces from the price string before converting it to a number.
1. At the time of writing, this returns `1299`, meaning [1 299 SEK](https://www.google.com/search?q=1299%20sek).

</details>

Expand All @@ -116,7 +116,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto

### Extract details about the first post on Guardian's F1 news

On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph (if it has one), and URL of the associated photo.

![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)

Expand All @@ -129,7 +129,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
1. Extract the lead paragraph (if it has one) by executing `post.querySelector('span div').textContent`.
1. Extract the photo URL by executing `post.querySelector('img').src`.

</details>
Loading