Evaluating Reddit “Great Goal” Posts as a Data Source for Premier League Analysis

Project Overview

This project began with a simple football question:

Do Tottenham Hotspur concede more “Great Goals” than other Premier League teams?

The idea was to use Reddit highlight posts labeled “Great Goal” as a proxy for spectacular goals and compare how often teams appear as the conceding side versus the scoring side. If reliable, this could offer an interesting, fan-driven perspective on defensive vulnerability and attacking flair.

Over the course of building the data pipeline, however, it became clear that answering this question responsibly required first answering a more fundamental one:

Is Reddit search data a reliable source for reconstructing historical Premier League goal events?

This repository documents that investigation.

Data Source

Platform: Reddit
Subreddit: r/soccer
Signal: Posts using the “Great Goal” flair
Access method: Unauthenticated Reddit JSON search endpoint

The dataset was never intended to be scraped indiscriminately; instead, the goal was to construct a reproducible pipeline that could retrieve, parse, and filter posts under real-world API constraints.

Pipeline Design

The core ingestion logic lives in src/full_season_scraper.py and demonstrates:

Paginated querying using Reddit’s search.json endpoint
Rate limiting and defensive request handling
Filtering posts by flair and subreddit
Extracting structured information from unstructured titles:
- Teams
- Scores (including bracketed score conventions)
- Goal scorer
- Minute scored (including added time)
Filtering to Premier League vs Premier League fixtures using a curated alias map (pl_teams.py)

This pipeline reflects realistic data engineering tradeoffs when working with third-party APIs that are not designed for historical analysis.

Parsing Strategy

Reddit post titles often follow patterns such as:

Chelsea [1] - 0 Liverpool – Moisés Caicedo 14’
Brighton 1 - [1] Newcastle United – Nick Woltemade 76’

The project implemented regex-based parsing to extract:

Home team
Away team
Score state immediately after the goal
Goal scorer and minute

When present, bracketed scores were treated as the strongest indicator of which team scored the highlighted goal.

Despite this, many titles remain ambiguous due to:

Inconsistent formatting
Missing brackets
Equalizers without explicit indicators
Editorial text mixed into titles

Ambiguous cases were intentionally left unresolved rather than force-labeled.

The Original Analysis Goal

Using the parsed data, the initial plan was to:

Identify the scoring team and conceding team for each “Great Goal”
Aggregate results by club
Compare Tottenham Hotspur’s conceded “Great Goals” against other EPL teams

At a glance, early aggregates appeared promising.

However, deeper validation revealed fundamental issues.

Key Feasibility Findings

Through multiple iterations of scraping, parsing, and aggregation, the following constraints became unavoidable:

1. Incomplete Historical Coverage

Reddit search endpoints impose undocumented limits on how far back results can be retrieved, even with pagination. This makes it impossible to guarantee season-level completeness.

2. Duplicate and Reposted Content

The same goal frequently appears multiple times with slightly different titles, timestamps, or contexts. Without external ground truth, deduplication cannot be done reliably.

3. Ambiguous Goal Attribution

Even with bracket logic, a non-trivial share of posts cannot be confidently classified as “scored” vs “conceded” without match-event data.

4. Sampling Bias

Only goals perceived as “great” by users — and flaired as such — appear in the dataset. This introduces subjective and engagement-driven bias that varies by team and era.

Conclusion

While Reddit “Great Goal” posts are a rich qualitative signal for fan engagement and highlight culture, they cannot be used as a reliable standalone dataset for reconstructing historical Premier League goal events.

As a result, the original Tottenham comparison question cannot be answered responsibly using this data alone.

This conclusion is not a failure of implementation, but a necessary outcome of rigorous data evaluation.

What This Project Demonstrates

Designing ingestion pipelines under real API constraints
Extracting structured data from noisy, user-generated text
Applying domain knowledge to filtering and validation
Recognizing and articulating when a dataset is not fit for purpose
Making principled decisions to avoid overclaiming insights

What I Would Do Next

If continuing this line of inquiry, a more reliable approach would be to:

Use official match-event data (e.g., Opta, FBref, StatsBomb) as ground truth
Treat Reddit highlights as a secondary engagement signal
Join social perception with objective event data rather than replacing it

Repository Notes

To avoid implying completeness or accuracy that cannot be guaranteed:

Full scraped datasets are not committed
Only small, representative samples are included
The emphasis is on process, evaluation, and judgment, not final metrics

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Reddit “Great Goal” Posts as a Data Source for Premier League Analysis

Project Overview

Data Source

Pipeline Design

Parsing Strategy

The Original Analysis Goal

Key Feasibility Findings

1. Incomplete Historical Coverage

2. Duplicate and Reposted Content

3. Ambiguous Goal Attribution

4. Sampling Bias

Conclusion

What This Project Demonstrates

What I Would Do Next

Repository Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Reddit “Great Goal” Posts as a Data Source for Premier League Analysis

Project Overview

Data Source

Pipeline Design

Parsing Strategy

The Original Analysis Goal

Key Feasibility Findings

1. Incomplete Historical Coverage

2. Duplicate and Reposted Content

3. Ambiguous Goal Attribution

4. Sampling Bias

Conclusion

What This Project Demonstrates

What I Would Do Next

Repository Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages