Skip to content

iremcimen/google_play_store_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📱 Google Play Store Apps

Data Cleaning & Exploratory Data Analysis (EDA)

This project focuses on data cleaning, preprocessing, and exploratory data analysis (EDA) of the Google Play Store Apps dataset. The dataset contains metadata of Android applications, including installs, reviews, ratings, price, size, category, genres, and Android version.

The primary aim is to transform raw, inconsistent data into a clean and analyzable format and extract meaningful insights through visualization and aggregation.


📂 Dataset Overview

  • Source: Google Play Store Apps dataset
  • Initial Size: ~10,800 applications
  • Data Types: Mixed (numeric + categorical)
  • Target: No explicit target (EDA & preprocessing focused)

Common Data Issues

  • Non-numeric characters in numeric columns
  • Missing or inconsistent values
  • Duplicate app entries
  • Mixed units and formats

🧹 Data Cleaning Steps

✔ Reviews

  • Detected non-numeric values
  • Removed invalid row(s)
  • Converted Reviews to int64

✔ Size

  • Converted:
    • "M" → megabytes
    • "k" → kilobytes
  • Replaced "Varies with device" with NaN
  • Converted Size to float

✔ Installs

  • Removed "+" and "," characters
  • Converted to int64

✔ Price

  • Removed "$" symbol
  • Converted to float

✔ Dates

  • Converted Last Updated to datetime
  • Extracted:
    • Day
    • Month
    • Year

✔ Duplicates

  • Detected duplicate app names
  • Removed duplicates based on the App column
  • Kept the first occurrence

📊 Feature Categorization

Numeric Features

  • Reviews
  • Size
  • Installs
  • Price
  • Rating
  • Day, Month, Year

Categorical Features

  • App
  • Category
  • Type
  • Content Rating
  • Genres
  • Android Version

📈 Exploratory Data Analysis (EDA)

🔹 Numeric Features

  • KDE plots used to analyze distributions
  • Skewness and outliers visually inspected

🔹 Categorical Features

  • Count plots for:
    • App Type
    • Content Rating

🔹 Category-Level Analysis

  • Total installs calculated by category
  • Identified Top 10 categories by total installs

🔹 App-Level Insights

  • Identified Top 5 apps by installs in major categories:
    • GAME
    • COMMUNICATION
    • TOOLS
    • PRODUCTIVITY
    • SOCIAL

🔹 Genre Analysis

  • Calculated mean installs per genre
  • Created a numeric encoding:
    • Genres Encoded (mean installs per genre)

📊 Key Insights

  • A small number of apps account for the majority of installs
  • Game and Communication categories dominate total downloads
  • Free apps vastly outnumber paid apps
  • Install counts show extreme skewness
  • Genre-based encoding can be useful for downstream modeling

🛠️ Technologies Used

  • Python
  • NumPy
  • Pandas
  • Matplotlib
  • Seaborn
  • Jupyter Notebook

About

Data cleaning and exploratory data analysis (EDA) of the Google Play Store apps dataset using Python.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors