Skip to content

daily-file-diet: replace expensive find | wc -l with git ls-tree #256

@lkraav

Description

@lkraav

IMHO :copilot: is correct to flag as follows ⬇️

find ... -exec wc -l {} \; runs one wc process per file and scans every file type (-name '*'). On larger repos this can be very slow and risks hitting workflow timeouts. Prefer restricting to source extensions and using a single wc invocation (e.g., -exec wc -l {} + or -print0 | xargs -0 wc -l) while keeping existing path exclusions.

    - "find . -type f \\( -name '*.go' -o -name '*.py' -o -name '*.ts' -o -name '*.js' -o -name '*.rb' -o -name '*.java' -o -name '*.rs' -o -name '*.cs' -o -name '*.cpp' -o -name '*.c' \\) -not -path '*/.git/*' -not -path '*/node_modules/*' -not -path '*/vendor/*' -not -path '*/dist/*' -not -path '*/build/*' -not -path '*/.next/*' -not -path '*/target/*' -not -path '*/__pycache__/*' -not -path '*/coverage/*' -not -path '*/venv/*' -not -path '*/.tox/*' -not -path '*/.mypy_cache/*' -print0 | xargs -0 wc -l 2>/dev/null"

What about doing something like git ls-tree -r -t -l --full-name HEAD | grep \.c\\?jsx\\?$ | sort -rn -k 4 | head -n 10 for a near-instant calculation even on larger repos?

Example output
± git ls-tree -r -t -l --full-name HEAD | grep \.c\\?jsx\\?$ | sort -rn -k 4 | head -n 10
100644 blob 65d31932231ed13af4fc89e6d6a427f1355a5159   61617    apps/api/src/resources/payment/payment.service.js
100644 blob dffccf9f746756dbd0126cb6320a94ea660c87b6   51076    apps/testers-portal-api/src/resources/test/get/validation-helpers.js
100644 blob eb680232f5ba857d69d3edbdf525aaac80ed7719   50382    apps/web/src/client/pages/test-results/survey/survey.jsx
100644 blob b99b9e9e942cf1d3691e57bbcb4b22e4385e79c1   48097    apps/admin/src/client/components/tests-table/tests-table.jsx
100644 blob 95ed5702d9652b6e5b2a0a3b168d8af9c72f6e97   47869    apps/api/src/resources/test/test.service.js
100644 blob 8c750d70131c032c6f54f6eb3a1d8a15c4891374   47060    apps/web/src/client/pages/test/steps/payment/payment.jsx
100644 blob 4a445db81b6c76916039e16945acf9338e97143b   43841    packages/lib/client/components/logo/logo-type.jsx
100644 blob 74fca97dcab4310b3bf9681792d14451a2af35b7   42685    apps/api/src/helpers/create-survey-results-csv.helper.spec.js
100644 blob 09a19ca908cbcaa442e31adc7c191814b44ce4c5   41439    apps/web/src/client/components/common-payment/common-payment.jsx
100644 blob 867a95cf3fb3423e90f4b6535d0a21145c46e43f   41114    apps/web/src/client/pages/test-results/standard-test/standard-test.jsx

grep statement specifics need to be figured out, but this also avoids manually ignorelisting a ton of potentially unrelated files in a general purpose solution such as we're shipping here.

Source: https://stackoverflow.com/questions/9456550/how-can-i-find-the-n-largest-files-in-a-git-repository

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions