Skip to content

CommonCrawl index date range code is broken #26

@wumpus

Description

@wumpus
cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions