This project aims to analyze game patch notes to gain insights into how games evolve over time, how developers iterate on gameplay, and how user experience is affected by changes. The first step in this project involves selecting appropriate data sources and building a dataset from them.
We chose Steam as our only data source. Steam is a digital distribution platform developed by Valve Corporation, and it's the largest global platform for video games.
- Massive Coverage: As of 2021, Steam hosted over 30,000 games, ranging from AAA titles to indie releases.
- High Engagement: In 2021, the platform recorded 132 million monthly active users.
- Most major games are available on Steam.
- The platform is open to indie developers (99% of games on steams are indie games), ensuring a broad representation of genres and development styles.
- It provides a comprehensive sampling frame of the gaming ecosystem.
- Focusing on a single, dominant platform allows us to standardize the data retrieval process across a wide range of games and patch histories.
We use the official Steam Web API to query data.
- API Documentation: https://steamcommunity.com/dev
- API Terms of Use: https://steamcommunity.com/dev/apiterms
Permissions: usage of the Steam Web API is permitted under their terms of use, provided we remain compliant: Steam Web API Terms
| Limitation | Mitigation |
|---|---|
| API Rate Limit: 100,000 calls/day (~200 calls/5 min) | Implement an automated, incremental retrieval process to distribute requests across the day |
| Data Noise: Many app IDs are not games (DLCs, tools, etc.) | Implement filtering logic to include only actual games |
| Inconsistent Metadata: Duplicate entries, changing appIDs | Add validation and logging steps to identify inconsistencies |
We plan to build a local dataset representing the Steam game catalog, from which we can later sample games for patch note analysis.
This step builds a local dataset of Steam games by querying the appdetails API and collecting selected metadata.
-
Initial App List Retrieval
- Download the full list of app IDs from the Steam endpoint (includes games and non-game apps).
- Save the list to applist.json.
-
Filtering and Metadata Collection
-
For each app ID in the list:
-
Query the
appdetailsAPI individually. -
Record every query in queries.json to prevent redundant calls.
-
Check whether the app is categorized as a game.
-
If it is a game:
- Extract basic metadata and store it locally in the folder ./raw_metadata_dataset/.
- Filenames follow the format:
{appid}__{name}.json. - At this stage, only high-level metadata is collected (no patch notes or extended data).
-
-
This step is implemented in a Jupyter notebook: game_metadata_extraction.ipynb.
Note: Due to Steam API rate limits (100,000 calls/day, ~200 every 5 minutes), the script is designed to run incrementally over several days. It is intended to be launched once per app ID list.
-
-
Metadata Formatting
- After metadata extraction, format the dataset as one CSV file: games_metadata.csv
- Each row represents a game, with the following metadata fields as columns:
- name, steam_appid, required_age, is_free, number_dlc, developers, publishers, price_currency, price_initial, price_final, windows, mac, linux, metacritic_score, categories, genres, recommendations_total, achievements_total, release_date
- The resulting CSV file is compatible with our internal sampling tool.
- This step is implemented in a Jupyter notebook: appdetails_to_csv.ipynb.
-
Descriptive Stats of the selected metadata
- The columns of the CSV file are analysed to have insights on their content
- This is implemented in a Jupyter notebook: dataset_overview.ipynb.
The organization of our repository follows the implementation of the data extraction pipeline described above.
This repository is structured as follows:
- raw_metadat_dataset/
{appid}.json - steamspy_dataset/
{appid}.json - patches/
raw_news/
{appid}.json
filtered_patches/
{appid}.json
cleaned_patches/
{appid}.json
flagged_patches/
{appid}.json
logs/
flagged_app_ids.txt
total_flagged_notes.txt - outputs/
applist.json
queries.json
game_metadata.csv
game_metadata_totals.csv - scripts/
cleaning_script/
files_handler.py
filter_patch_notes.py
game_data.py
run_all.py
strip_html.py
find_embedded_data/
files_handler.py
file_embedded_data.py
game_data.py
get_news_script/
logs/
skipped_files.txt
output.log
create_json_files.py
get_patch_notes.py
load_batches.py
process_notes.py
session.py
Supporting_Script/
add_totals_to_csv.py
total_ patch_count.py add_totals_to_csv.py
total_patch_count.py
appdetails_to_csv.ipynb
dataset_overview.ipynb
game_metadata_extraction.ipynb
steamspy_extraction.ipynb
Each component is modular and can be reused or extended to support more detailed patch note collection and analysis in future work.
- ✅ Initial app ID list retrieved (257,148 entries total)
- ✅ Incremental filtering process implemented to extract game metadata
- ✅ Metadata stored in structured JSON files for local use
- Aug 8, 2025: 27% of queries made
- Aug 15, 2025, 72% of queries made
- ✅ Format selected metadata from individual JSON files into one csv
- ✅ Script to analyze the selected metadata (descriptive stats)
- ⬜ Fetch complementary metadata for each game? (e.g., users, hours played)
- ⬜ Write a short update script to refresh the dataset with new entries without re-fetching the entire list.
- ⬜ Begin defining sampling strategy for selecting games from the dataset for patch note analysis.
- 99% of games on Steam are indie games, but how much game time / size of user-base compared to AAA games?
- Can we use the raw game metadata to identify game clusters (PCA)?