Skip to content

Commit dc7c758

Browse files
Build production-grade meta-ads-collector v1.0.0
Add 6 major feature modules built by specialized agents: - Core hardening: browser fingerprint randomization, dynamic doc_id extraction, token verification, session refresh, proxy pool - Search & collection: filters, dedup, page search, URL parsing, page-level collection - Media enrichment: image/video downloads, ad detail endpoint - Async support: httpx-based async client/collector, event emitter, webhooks - Testing & quality: structured logging, collection reporting, comprehensive edge case tests - Docs & release: README, CHANGELOG, 11 doc guides, Makefile, CI/CD Fix 18 audit findings from live API testing: - Critical: parser mismatch with actual API format, response flattening, SpendRange/ImpressionRange zero-bound bug - Should fix: datetime.utcnow deprecation, sort relevancy, export method params, async session refresh, proxy rotation, webhook types, enrich_ad deepcopy, doc_id warnings, changelog date - Nice to have: search constants, filter raw_data check, logging handler safety, detail parser error handling 767 tests passing, ruff clean, mypy clean, 0 deprecation warnings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 39f7c15 commit dc7c758

73 files changed

Lines changed: 18351 additions & 317 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
name: Bug Report
3+
about: Report a bug in meta-ads-collector
4+
title: "[Bug] "
5+
labels: bug
6+
assignees: ""
7+
---
8+
9+
## Describe the bug
10+
11+
A clear and concise description of what the bug is.
12+
13+
## To reproduce
14+
15+
Steps to reproduce the behavior:
16+
17+
1. Install version `...`
18+
2. Run this code / command:
19+
20+
```python
21+
# Your code here
22+
```
23+
24+
or
25+
26+
```bash
27+
# Your CLI command here
28+
```
29+
30+
3. See error
31+
32+
## Expected behavior
33+
34+
What you expected to happen.
35+
36+
## Actual behavior
37+
38+
What actually happened. Include the full error message or traceback if applicable.
39+
40+
## Environment
41+
42+
- OS: [e.g., Windows 11, Ubuntu 22.04, macOS 14]
43+
- Python version: [e.g., 3.12.1]
44+
- meta-ads-collector version: [e.g., 1.0.0]
45+
- Using proxy: [yes/no]
46+
- Using async: [yes/no]
47+
48+
## Additional context
49+
50+
Add any other context about the problem here (log output, screenshots, etc.).
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
name: Feature Request
3+
about: Suggest a new feature for meta-ads-collector
4+
title: "[Feature] "
5+
labels: enhancement
6+
assignees: ""
7+
---
8+
9+
## Problem
10+
11+
A clear and concise description of the problem this feature would solve.
12+
13+
Example: "I want to be able to ..."
14+
15+
## Proposed solution
16+
17+
Describe the solution you'd like to see.
18+
19+
## Alternatives considered
20+
21+
Describe any alternative solutions or workarounds you've considered.
22+
23+
## Additional context
24+
25+
Any other context, examples, or references that would help understand the request.

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
## Summary
2+
3+
Brief description of what this PR does.
4+
5+
## Changes
6+
7+
- Change 1
8+
- Change 2
9+
10+
## Type of change
11+
12+
- [ ] Bug fix (non-breaking change that fixes an issue)
13+
- [ ] New feature (non-breaking change that adds functionality)
14+
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
15+
- [ ] Documentation update
16+
- [ ] Refactoring (no functional changes)
17+
18+
## Checklist
19+
20+
- [ ] My code follows the project's style guidelines (`ruff check .` passes)
21+
- [ ] I have added tests that cover my changes
22+
- [ ] All new and existing tests pass (`pytest` passes)
23+
- [ ] Type checking passes (`mypy meta_ads_collector/ --ignore-missing-imports`)
24+
- [ ] I have updated the documentation if needed
25+
- [ ] I have updated `CHANGELOG.md` if this is a user-facing change
26+
27+
## Test plan
28+
29+
How was this tested?

.github/workflows/ci.yml

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ jobs:
1010
test:
1111
runs-on: ubuntu-latest
1212
strategy:
13+
fail-fast: false
1314
matrix:
1415
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
1516

@@ -24,13 +25,20 @@ jobs:
2425
- name: Install dependencies
2526
run: |
2627
python -m pip install --upgrade pip
27-
pip install -e ".[dev]"
28+
pip install -e ".[dev,async]"
2829
2930
- name: Lint with ruff
3031
run: ruff check .
3132

3233
- name: Type check with mypy
3334
run: mypy meta_ads_collector/ --ignore-missing-imports
3435

35-
- name: Run tests
36-
run: pytest --cov=meta_ads_collector --cov-report=term-missing
36+
- name: Run tests with coverage
37+
run: pytest --cov=meta_ads_collector --cov-report=term-missing --cov-report=xml
38+
39+
- name: Upload coverage
40+
if: matrix.python-version == '3.12'
41+
uses: actions/upload-artifact@v4
42+
with:
43+
name: coverage-report
44+
path: coverage.xml

.github/workflows/publish.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,10 @@ jobs:
2424
- name: Build package
2525
run: python -m build
2626

27+
- name: Verify package
28+
run: |
29+
pip install dist/*.whl
30+
python -c "import meta_ads_collector; print(meta_ads_collector.__version__)"
31+
2732
- name: Publish to PyPI
2833
uses: pypa/gh-action-pypi-publish@release/v1

.gitignore

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,40 @@
11
# Python
22
__pycache__/
33
*.py[cod]
4+
*$py.class
5+
*.so
46
*.egg-info/
7+
*.egg
58
dist/
69
build/
7-
*.egg
10+
sdist/
11+
wheels/
812

913
# Virtual environments
1014
venv/
1115
.venv/
1216
env/
17+
.env/
1318

14-
# Output data
15-
*.json
16-
*.csv
17-
*.jsonl
18-
output/
19+
# Testing
20+
.pytest_cache/
21+
.coverage
22+
htmlcov/
23+
coverage.xml
24+
*.cover
25+
26+
# Type checking
27+
.mypy_cache/
28+
29+
# Linting
30+
.ruff_cache/
1931

2032
# IDE
2133
.vscode/
2234
.idea/
35+
*.swp
36+
*.swo
37+
*~
2338

2439
# Claude Code
2540
.claude/
@@ -28,6 +43,25 @@ output/
2843
NUL
2944
Thumbs.db
3045
.DS_Store
46+
Desktop.ini
3147

32-
# Environment
48+
# Environment files
3349
.env
50+
.env.local
51+
.env.*.local
52+
53+
# Output data
54+
*.json
55+
!.github/**/*.json
56+
*.csv
57+
*.jsonl
58+
output/
59+
ad_media/
60+
61+
# SQLite state files
62+
*.db
63+
*.sqlite
64+
*.sqlite3
65+
66+
# Log files
67+
*.log

CHANGELOG.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [1.0.0] - 2026-02-08
9+
10+
### Added
11+
12+
#### Core
13+
- `MetaAdsCollector` high-level interface with `search()`, `collect()`, and export methods
14+
- `MetaAdsClient` low-level HTTP client with session management and GraphQL request handling
15+
- Browser fingerprint randomization across Chrome versions, platforms, viewports, and DPR values
16+
- Dynamic `doc_id` extraction from Ad Library page HTML with hardcoded fallbacks
17+
- Token extraction (LSD, CSRF, session IDs) with verification and fallback generation
18+
- Automatic session refresh on 403 responses with configurable max refresh attempts
19+
- Session staleness detection with 30-minute max age
20+
- Challenge/verification handling for Facebook's bot detection
21+
22+
#### Search & Collection
23+
- Keyword search, exact phrase search, and page-level search modes
24+
- Page-level collection by URL (`collect_by_page_url`), by name (`collect_by_page_name`), and by ID (`collect_by_page_id`)
25+
- Typeahead page search (`search_pages`) for resolving page names to IDs
26+
- URL parser for extracting page IDs from Ad Library URLs, profile URLs, and numeric paths
27+
- Pagination with cursor-based traversal
28+
- Configurable page size, max results, sort order, and country
29+
- Ad type filtering: all, political, housing, employment, credit
30+
- Status filtering: active, inactive, all
31+
- Ad enrichment via detail/snapshot endpoint (`enrich_ad`)
32+
- Stream mode yielding lifecycle events alongside ads (`stream`)
33+
34+
#### Filtering
35+
- `FilterConfig` dataclass with 11 filter fields
36+
- Impression range filters (min/max using conservative bound logic)
37+
- Spend range filters (min/max)
38+
- Date range filters (start_date, end_date)
39+
- Media type filter (image, video, meme, none)
40+
- Publisher platform filter (facebook, instagram, messenger, audience_network)
41+
- Language filter
42+
- Boolean filters: has_video, has_image
43+
- AND logic across all filters with missing-data-inclusive policy
44+
45+
#### Deduplication
46+
- `DeduplicationTracker` with two modes: in-memory and persistent (SQLite)
47+
- `has_seen()` and `mark_seen()` for ad ID tracking
48+
- `get_last_collection_time()` and `update_collection_time()` for incremental collection
49+
- Context manager protocol with automatic save on exit
50+
- `count()` and `clear()` utility methods
51+
52+
#### Media Downloads
53+
- `MediaDownloader` for downloading images, videos, and thumbnails from ad creatives
54+
- `MediaDownloadResult` frozen dataclass with success/failure details
55+
- File extension detection from URL path and Content-Type headers
56+
- Retry with exponential backoff on download failures
57+
- Skip-existing-file optimization
58+
- `collect_with_media()` convenience method on the collector
59+
- `download_ad_media()` for single-ad media downloads
60+
61+
#### Events & Webhooks
62+
- `EventEmitter` with synchronous callback dispatch and exception isolation
63+
- 7 lifecycle event types: collection_started, ad_collected, page_fetched, error_occurred, rate_limited, session_refreshed, collection_finished
64+
- `Event` dataclass with event_type, data payload, and UTC timestamp
65+
- Convenience callback registration via `callbacks` parameter on collector init
66+
- `WebhookSender` for POSTing ad data to external HTTP endpoints
67+
- Retry with exponential backoff on webhook failures
68+
- Optional batch mode for webhook sends
69+
70+
#### Async Support
71+
- `AsyncMetaAdsClient` using httpx for non-blocking HTTP
72+
- `AsyncMetaAdsCollector` mirroring the sync API with `async for` generators
73+
- Async `search()`, `collect()`, `collect_to_json()`, `collect_to_csv()`, `search_pages()`
74+
- Optional dependency: `pip install meta-ads-collector[async]`
75+
76+
#### Proxy Support
77+
- Single proxy configuration (host:port or host:port:user:pass)
78+
- `ProxyPool` with round-robin selection across multiple proxies
79+
- Per-proxy failure tracking with configurable max failures threshold
80+
- Dead proxy cooldown with automatic revival
81+
- `ProxyPool.from_file()` for loading proxies from text files
82+
- Proxy URL format detection (plain, URL, SOCKS5)
83+
84+
#### Export
85+
- JSON export with metadata envelope (query, country, stats, timestamps)
86+
- CSV export with 25-column flattened schema
87+
- JSONL export (one JSON object per line)
88+
- Export methods: `collect_to_json()`, `collect_to_csv()`, `collect_to_jsonl()`
89+
90+
#### Logging & Reporting
91+
- `setup_logging()` with text or JSON format selection
92+
- `JSONFormatter` producing single-line JSON log records
93+
- Optional file handler with automatic directory creation
94+
- `CollectionReport` dataclass with throughput metrics
95+
- `format_report()` for human-readable summary text
96+
- `format_report_json()` for machine-readable JSON output
97+
98+
#### Data Models
99+
- `Ad` dataclass with 30+ fields covering all Ad Library data
100+
- `AdCreative` with body, title, description, link URL, image/video URLs, CTA
101+
- `PageInfo` with ID, name, profile picture, URL, likes, verification status
102+
- `PageSearchResult` for typeahead search results
103+
- `ImpressionRange` and `SpendRange` with lower/upper bounds
104+
- `AudienceDistribution` for demographic and regional data
105+
- `SearchResult` for paginated result sets
106+
- `Ad.from_graphql_response()` parser handling multiple response formats
107+
108+
#### CLI
109+
- Full CLI with 35+ flags via argparse
110+
- All search parameters, filtering, proxy, dedup, media, enrichment, webhook, logging, and reporting flags
111+
- `python -m meta_ads_collector` entry point
112+
- `meta-ads-collector` console script
113+
- Page search mode (`--search-pages`)
114+
- Page collection modes (`--page-url`, `--page-name`)
115+
116+
#### Exceptions
117+
- `MetaAdsError` base exception
118+
- `AuthenticationError` for session/token failures
119+
- `RateLimitError` with retry_after attribute
120+
- `SessionExpiredError` for unrecoverable session failures
121+
- `ProxyError` for proxy configuration issues
122+
- `InvalidParameterError` with param name, value, and allowed values
123+
124+
#### Infrastructure
125+
- PEP 561 `py.typed` marker for type checking support
126+
- CI pipeline with Python 3.9-3.13 matrix testing
127+
- Automated PyPI publishing on GitHub release
128+
- 642 tests covering all modules
129+
130+
[1.0.0]: https://github.com/Yossef/meta-ads-collector/releases/tag/v1.0.0

MANIFEST.in

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
include LICENSE
2+
include README.md
3+
include CHANGELOG.md
4+
include pyproject.toml
5+
include meta_ads_collector/py.typed
6+
recursive-include docs *.md

0 commit comments

Comments
 (0)