This project began with a simple football question:
Do Tottenham Hotspur concede more “Great Goals” than other Premier League teams?
The idea was to use Reddit highlight posts labeled “Great Goal” as a proxy for spectacular goals and compare how often teams appear as the conceding side versus the scoring side. If reliable, this could offer an interesting, fan-driven perspective on defensive vulnerability and attacking flair.
Over the course of building the data pipeline, however, it became clear that answering this question responsibly required first answering a more fundamental one:
Is Reddit search data a reliable source for reconstructing historical Premier League goal events?
This repository documents that investigation.
- Platform: Reddit
- Subreddit: r/soccer
- Signal: Posts using the “Great Goal” flair
- Access method: Unauthenticated Reddit JSON search endpoint
The dataset was never intended to be scraped indiscriminately; instead, the goal was to construct a reproducible pipeline that could retrieve, parse, and filter posts under real-world API constraints.
The core ingestion logic lives in src/full_season_scraper.py and demonstrates:
- Paginated querying using Reddit’s
search.jsonendpoint - Rate limiting and defensive request handling
- Filtering posts by flair and subreddit
- Extracting structured information from unstructured titles:
- Teams
- Scores (including bracketed score conventions)
- Goal scorer
- Minute scored (including added time)
- Filtering to Premier League vs Premier League fixtures using a curated alias map (
pl_teams.py)
This pipeline reflects realistic data engineering tradeoffs when working with third-party APIs that are not designed for historical analysis.
Reddit post titles often follow patterns such as:
Chelsea [1] - 0 Liverpool – Moisés Caicedo 14’Brighton 1 - [1] Newcastle United – Nick Woltemade 76’
The project implemented regex-based parsing to extract:
- Home team
- Away team
- Score state immediately after the goal
- Goal scorer and minute
When present, bracketed scores were treated as the strongest indicator of which team scored the highlighted goal.
Despite this, many titles remain ambiguous due to:
- Inconsistent formatting
- Missing brackets
- Equalizers without explicit indicators
- Editorial text mixed into titles
Ambiguous cases were intentionally left unresolved rather than force-labeled.
Using the parsed data, the initial plan was to:
- Identify the scoring team and conceding team for each “Great Goal”
- Aggregate results by club
- Compare Tottenham Hotspur’s conceded “Great Goals” against other EPL teams
At a glance, early aggregates appeared promising.
However, deeper validation revealed fundamental issues.
Through multiple iterations of scraping, parsing, and aggregation, the following constraints became unavoidable:
Reddit search endpoints impose undocumented limits on how far back results can be retrieved, even with pagination. This makes it impossible to guarantee season-level completeness.
The same goal frequently appears multiple times with slightly different titles, timestamps, or contexts. Without external ground truth, deduplication cannot be done reliably.
Even with bracket logic, a non-trivial share of posts cannot be confidently classified as “scored” vs “conceded” without match-event data.
Only goals perceived as “great” by users — and flaired as such — appear in the dataset. This introduces subjective and engagement-driven bias that varies by team and era.
While Reddit “Great Goal” posts are a rich qualitative signal for fan engagement and highlight culture, they cannot be used as a reliable standalone dataset for reconstructing historical Premier League goal events.
As a result, the original Tottenham comparison question cannot be answered responsibly using this data alone.
This conclusion is not a failure of implementation, but a necessary outcome of rigorous data evaluation.
- Designing ingestion pipelines under real API constraints
- Extracting structured data from noisy, user-generated text
- Applying domain knowledge to filtering and validation
- Recognizing and articulating when a dataset is not fit for purpose
- Making principled decisions to avoid overclaiming insights
If continuing this line of inquiry, a more reliable approach would be to:
- Use official match-event data (e.g., Opta, FBref, StatsBomb) as ground truth
- Treat Reddit highlights as a secondary engagement signal
- Join social perception with objective event data rather than replacing it
To avoid implying completeness or accuracy that cannot be guaranteed:
- Full scraped datasets are not committed
- Only small, representative samples are included
- The emphasis is on process, evaluation, and judgment, not final metrics