feat: centralizza estrazione e raggruppamento link HTML in link_extractor + probe_html_portal#407
Merged
Merged
Conversation
…ctor Crea toolkit/scout/link_extractor.py come contratto centrale per estrazione e raggruppamento di link dati da pagine HTML. - DataLink e LinkGroup come dataclass con metadati (formato, prefisso, anni) - extract_data_links() unifica le precedenti implementazioni - group_links() per raggruppamento intelligente - extract_candidate_links() mantenuto come bridge backward compat - http.py: rimosso _AnchorParser, extract_candidate_links redirect - mcp/scout_ops.py: response arricchito con data_links[] e groups[] File: +link_extractor.py, __init__.py, http.py, scout_ops.py
Aggiunge probe a costo fisso per scoprire struttura di un portale HTML: - GET /robots.txt estrae Sitemap: URLs - Prova path canonici (/sitemap.xml, /sitemap_index.xml) - Scarica e parse sitemap lista pagine - Cerca feed RSS e pattern JSON:API nella homepage - Conta link interni Risultati su portali reali: - opencivitas: 575 pagine dalla sitemap - giustizia_statistiche: 123 pagine dalla sitemap - aifa, dait: robots.txt trovato ma nessuna sitemap File: probe.py (+probe_html_portal, +PortalProfile), __init__.py
- _fetch_and_parse_sitemap rinominato a fetch_sitemap_pages pubblico - Export in __init__.py - Uso in probe_html_portal aggiornato File: probe.py, __init__.py
- Se base_url è un path (es. /opendata/...), prova anche la root del dominio per robots.txt e sitemap - Aggiunto /opendata/sitemap.xml ai path canonici - Dedup degli URL trovati
- fetch_sitemap_pages ora gestisce sitemap index XML (<sitemapindex><sitemap><loc>) con fetch ricorsivo - MCP formats calcolato dal path URL (urlparse) anziché URL grezzo
6f99096 to
002274a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosa cambia
Centralizza nel toolkit tutta la logica di estrazione link da HTML, raggruppamento e probe portali, precedentemente duplicata tra toolkit e source-observatory.
Modifiche
Nuovo modulo:
toolkit/scout/link_extractor.pyDataLink/LinkGroup— dataclass con metadati (url, format, prefix, years)extract_data_links()— estrazione link con metadati, unifica le vecchie implementazionigroup_links()— raggruppamento intelligente per prefisso e pattern URLextract_candidate_links()— bridge backward compattoolkit/scout/probe.pyPortalProfile— dataclass profilo portaleprobe_html_portal()— probe leggero (robots.txt → sitemap → pagine, RSS, JSON:API)fetch_sitemap_pages()— pubblica, estrae URL pagine da sitemap XMLtoolkit/scout/http.py_AnchorParserduplicatoextract_candidate_links()reindirizza alink_extractortoolkit/mcp/scout_ops.pymcp_html_extract_linksarricchito: restituiscedata_links[]egroups[]links,total,formats) mantenuti per backward compatImpatto downstream
feat/toolkit-link-extractorcorrispondente — importa dal toolkit invece di avere logica propria, rimossa duplicazione_fetch_sitemap,_extract_data_links,_extract_prefix,_extract_yearsopencivitaspassa dapage_url_templateadiscovery: autoMetriche