Skip to content

Question: File Counts and Dataset Size #44

@darien-schettler

Description

@darien-schettler

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

  1. The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)

  2. Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions