How to extract image URLs and save the images locally?

Hi,

As I am working with Norconex, I do find myself a little lost this time. The title says what I am trying to achieve but not making any progress.

First I want to extra image urls from the HTML that is downloaded (getting the HTML works great), these images are inside a `div` with a class `article-split__content img`. I have tried several parsers but nothing ever gets into the resulting JSON file. So I do not even know if anything works.

I have tried these options:

- RegexTagger
- HTMLLinkExtractor
- DOMTagger
- HtmlContentExtractor

My best guess is that I need the `DOMTagger`. So I setup a postParseHandler. This is my config file
```xml
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="config-id">
 <workDir>./work</workDir>

 <crawlers>
 <crawler id="crawler-id">

 
 <sitemapResolver ignore="true" />

 
 <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
 <url>https://mydomain.com/imagepage</url>
 </startURLs>

 
 <maxDepth>0</maxDepth>

 <importer>

 
 <postParseHandlers>
 <handler class="com.norconex.importer.handler.tagger.impl.DOMTagger">
 <dom selector="div.article-split__content img" toField="links" extract="attr(src)" />
 </handler>
 </postParseHandlers>

 </importer>

 <committers>
 
 <committer class="com.norconex.committer.core3.fs.impl.JSONFileCommitter">
 <directory>./link</directory>
 <docsPerFile>50</docsPerFile>
 <indent>3</indent>
 </committer>
 </committers>

 </crawler>
 </crawlers>

</httpcollector>
```
Does this look correct?

Here is an example of the HTML to be parsed:
```html
<div class="article-split__content">

Zes spellen, 456 deelnemers, en 45,6 miljard won aan prijzengeld. Squid Game houdt de wereld al drie jaar in de ban van een verknipte strijd om leven en dood, en heel veel geld. Nu is het tijd voor de ontknoping. En daarin kan er maar één winnaar zijn. 


Door: Jip Soekhai&nbsp;Ieder spel komt ooit tot een einde, zo ook Squid Game. Nog geen half jaar na de release van Squid Game 2 dropt Netflix het derde én laatste seizoen van hun best bekeken serie ooit. Het verhaal dat drie jaar geleden begon als kritiek op kapitalisme en klassenverschillen is uitgegroeid tot een wereldwijd fenomeen. Inmiddels wordt het nagespeeld in realityshows, om zo zelf de ‘Squid Game-experience’ mee te maken. Het gaat in tegen alles wat maker Hwang Dong-hyuk probeert over te brengen met zijn serie. Maargoed, wat is menselijkheid nog waard als er bakken met geld tegenover staan? Na drie seizoenen van luguber en mensonterend ‘vermaak’ is dat de vraag die blijft hangen.&nbsp;<img src="https://assets.prod.npo3.npox.nl/media_item/1197/63/Squidgame_Unit_311_N013244-1750968924.jpg" alt="Squid Game Season 3 - Netflix" style="width: 100%;">
</div>
```

What I would like the next step to be is that all the results that are in the imageLinks are URLs to images that can be downloaded. Now I found that there is the `FileCommitter` but this seems to be only for version 2 while I am on version 3.1.0. Through my searches I found this issue https://github.com/Norconex/crawlers/issues/1160 and I get the idea that the `collector-filesystem` is now used to save images?

I also checked the migration guide at https://opensource.norconex.com/crawlers/web/v3/migration but I see no reference to the `FileCommitter`. 

If the images cannot be saved I would already be happy if the links can be extracted.

Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract image URLs and save the images locally? #1164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to extract image URLs and save the images locally? #1164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions