Skip to content

mtchynkstff/reddit-epl-great-goals-feasibility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Reddit “Great Goal” Posts as a Data Source for Premier League Analysis

Project Overview

This project began with a simple football question:

Do Tottenham Hotspur concede more “Great Goals” than other Premier League teams?

The idea was to use Reddit highlight posts labeled “Great Goal” as a proxy for spectacular goals and compare how often teams appear as the conceding side versus the scoring side. If reliable, this could offer an interesting, fan-driven perspective on defensive vulnerability and attacking flair.

Over the course of building the data pipeline, however, it became clear that answering this question responsibly required first answering a more fundamental one:

Is Reddit search data a reliable source for reconstructing historical Premier League goal events?

This repository documents that investigation.


Data Source

  • Platform: Reddit
  • Subreddit: r/soccer
  • Signal: Posts using the “Great Goal” flair
  • Access method: Unauthenticated Reddit JSON search endpoint

The dataset was never intended to be scraped indiscriminately; instead, the goal was to construct a reproducible pipeline that could retrieve, parse, and filter posts under real-world API constraints.


Pipeline Design

The core ingestion logic lives in src/full_season_scraper.py and demonstrates:

  • Paginated querying using Reddit’s search.json endpoint
  • Rate limiting and defensive request handling
  • Filtering posts by flair and subreddit
  • Extracting structured information from unstructured titles:
    • Teams
    • Scores (including bracketed score conventions)
    • Goal scorer
    • Minute scored (including added time)
  • Filtering to Premier League vs Premier League fixtures using a curated alias map (pl_teams.py)

This pipeline reflects realistic data engineering tradeoffs when working with third-party APIs that are not designed for historical analysis.


Parsing Strategy

Reddit post titles often follow patterns such as:

  • Chelsea [1] - 0 Liverpool – Moisés Caicedo 14’
  • Brighton 1 - [1] Newcastle United – Nick Woltemade 76’

The project implemented regex-based parsing to extract:

  • Home team
  • Away team
  • Score state immediately after the goal
  • Goal scorer and minute

When present, bracketed scores were treated as the strongest indicator of which team scored the highlighted goal.

Despite this, many titles remain ambiguous due to:

  • Inconsistent formatting
  • Missing brackets
  • Equalizers without explicit indicators
  • Editorial text mixed into titles

Ambiguous cases were intentionally left unresolved rather than force-labeled.


The Original Analysis Goal

Using the parsed data, the initial plan was to:

  • Identify the scoring team and conceding team for each “Great Goal”
  • Aggregate results by club
  • Compare Tottenham Hotspur’s conceded “Great Goals” against other EPL teams

At a glance, early aggregates appeared promising.

However, deeper validation revealed fundamental issues.


Key Feasibility Findings

Through multiple iterations of scraping, parsing, and aggregation, the following constraints became unavoidable:

1. Incomplete Historical Coverage

Reddit search endpoints impose undocumented limits on how far back results can be retrieved, even with pagination. This makes it impossible to guarantee season-level completeness.

2. Duplicate and Reposted Content

The same goal frequently appears multiple times with slightly different titles, timestamps, or contexts. Without external ground truth, deduplication cannot be done reliably.

3. Ambiguous Goal Attribution

Even with bracket logic, a non-trivial share of posts cannot be confidently classified as “scored” vs “conceded” without match-event data.

4. Sampling Bias

Only goals perceived as “great” by users — and flaired as such — appear in the dataset. This introduces subjective and engagement-driven bias that varies by team and era.


Conclusion

While Reddit “Great Goal” posts are a rich qualitative signal for fan engagement and highlight culture, they cannot be used as a reliable standalone dataset for reconstructing historical Premier League goal events.

As a result, the original Tottenham comparison question cannot be answered responsibly using this data alone.

This conclusion is not a failure of implementation, but a necessary outcome of rigorous data evaluation.


What This Project Demonstrates

  • Designing ingestion pipelines under real API constraints
  • Extracting structured data from noisy, user-generated text
  • Applying domain knowledge to filtering and validation
  • Recognizing and articulating when a dataset is not fit for purpose
  • Making principled decisions to avoid overclaiming insights

What I Would Do Next

If continuing this line of inquiry, a more reliable approach would be to:

  • Use official match-event data (e.g., Opta, FBref, StatsBomb) as ground truth
  • Treat Reddit highlights as a secondary engagement signal
  • Join social perception with objective event data rather than replacing it

Repository Notes

To avoid implying completeness or accuracy that cannot be guaranteed:

  • Full scraped datasets are not committed
  • Only small, representative samples are included
  • The emphasis is on process, evaluation, and judgment, not final metrics

About

Evaluating whether Reddit “Great Goal” highlight posts can reliably be used to analyze Premier League goal events.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages