This project focuses on data cleaning, preprocessing, and exploratory data analysis (EDA) of the Google Play Store Apps dataset. The dataset contains metadata of Android applications, including installs, reviews, ratings, price, size, category, genres, and Android version.
The primary aim is to transform raw, inconsistent data into a clean and analyzable format and extract meaningful insights through visualization and aggregation.
- Source: Google Play Store Apps dataset
- Initial Size: ~10,800 applications
- Data Types: Mixed (numeric + categorical)
- Target: No explicit target (EDA & preprocessing focused)
- Non-numeric characters in numeric columns
- Missing or inconsistent values
- Duplicate app entries
- Mixed units and formats
- Detected non-numeric values
- Removed invalid row(s)
- Converted Reviews to
int64
- Converted:
- "M" → megabytes
- "k" → kilobytes
- Replaced "Varies with device" with
NaN - Converted Size to
float
- Removed "+" and "," characters
- Converted to
int64
- Removed "$" symbol
- Converted to
float
- Converted Last Updated to
datetime - Extracted:
- Day
- Month
- Year
- Detected duplicate app names
- Removed duplicates based on the App column
- Kept the first occurrence
- Reviews
- Size
- Installs
- Price
- Rating
- Day, Month, Year
- App
- Category
- Type
- Content Rating
- Genres
- Android Version
- KDE plots used to analyze distributions
- Skewness and outliers visually inspected
- Count plots for:
- App Type
- Content Rating
- Total installs calculated by category
- Identified Top 10 categories by total installs
- Identified Top 5 apps by installs in major categories:
- GAME
- COMMUNICATION
- TOOLS
- PRODUCTIVITY
- SOCIAL
- Calculated mean installs per genre
- Created a numeric encoding:
- Genres Encoded (mean installs per genre)
- A small number of apps account for the majority of installs
- Game and Communication categories dominate total downloads
- Free apps vastly outnumber paid apps
- Install counts show extreme skewness
- Genre-based encoding can be useful for downstream modeling
- Python
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Jupyter Notebook