Skip to content

danilovabg/data_cleaning_project_sber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data cleaning of a dataset about apartments in Moscow

1. Project description
2. Solving case
3. Data summary
4. Stages of the project work
5. Results

Project description

The data are taken from a competition on the Kaggle platform, initiated by Sber. The competition was held in 2017. Sber's requirements were to build a model that would predict housing prices in Moscow based on the parameters of property.

⬆️To contents

Solving case

Prepare data for model building, which include:

I. basic analysis of the data structure II. detection of missing data III. processing of missing data IV. outliers detection and cleaning V. Search and eliminate duplicates

Quality metric
The output should be clear data which means removing duplicates, outliers, and converting data into a form suitable for analysis

What we practice
Processing and cleaning of data to prepare them for further work

Data summary

The data is a database of 30471 records of real properties in Moscow and Moscow region with specified attributes which probably influence the price. For each record 61 attributes are provided: the area of the property, living area, district, price, ecology, information about kindergartens, schools, hospitals, distance to the center, distance to the subway and train stations, distance to stores, museums, etc.

⬆️To contents

Stages of the project work

  1. basic analysis of the data structure

Analyzing the data structure and determining the necessary transformations.

  1. detection of missing data

Missing value detection and analysis. Detection of entries/features that need to be removed and those that need to be processed

  1. processing of missing data

Analysis of missing data and filling numerical features with median values and categorical features with modes

  1. outliers detection and cleaning

Finding and identifying outliers using logic, rules of three Sigma, and Tuke's method

  1. Search and eliminate duplicates

Search and delete duplicates

⬆️To contents

Results:

As a result of the work, we received a dataset with records cleared from duplicates, missing data and outliers

⬆️To contents

About

Data cleaning of a dataset about apartments in Moscow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors