CDR Data Analysis Project

Project Overview

This project utilizes Databricks to process and analyze Call Detail Records (CDRs) from a public S3 bucket. The goal is to extract meaningful insights regarding SMS, call, and internet activities over a specific time period.

Dataset Information

The dataset comprises Call Detail Records (CDRs) located in WeCloudData's S3 bucket:

S3 Bucket Link: s3://weclouddata/datasets/telecom/CDR
Data Structure:
- The data is organized into monthly subfolders, with each day containing a text file.
- We focus on the cdr_by_grid_december folder and analyze the first five days of December.

Column Definitions

The dataset contains the following columns, which do not include headers in the raw files:

square-id: ID of the square part of the Milano GRID (numeric)
time-interval: Beginning of the time interval (numeric, milliseconds since Unix Epoch)
country-code: Phone country code (numeric)
sms-in-activity: Received SMS count in the square during the time interval (numeric)
sms-out-activity: Sent SMS count in the square during the time interval (numeric)
call-in-activity: Received call count in the square during the time interval (numeric)
call-out-activity: Issued call count in the square during the time interval (numeric)
internet-traffic-activity: Internet traffic count in the square during the time interval (numeric)

Project Goals

Mount the CDR data from the S3 bucket to Databricks.
Create a schema with appropriate headers.
Perform various data transformations and analyses as outlined below.

Tasks Completed

Data Mounting: Mounted the CDR data from the specified S3 subfolder.
Data Cleaning: Renamed columns by replacing - with _.
Feature Engineering:
- Added a new column, sms_ratio, showing the ratio of sms-in-activity to sms-out-activity.
- Created a date column from time-interval and formatted it as yyyy/MM/dd.
Summary Statistics: Calculated aggregated statistics at the square-id level, including means, minimums, and maximums for various activities.
Min/Max Values: Found minimum and maximum values for key activity metrics grouped by square-id.
Aggregate Table: Generated an activity summary table for SMS, calls, and internet activities by country and date. Wrote this DataFrame to the tmp folder in Parquet format.
Data Export: Mounted a personal AWS S3 bucket and exported the summary statistics to the S3 bucket.
Ranking Internet Activity: Used window functions to rank internet activity by country code and date.

Instructions to Run the Notebook

Prerequisites: Ensure you have access to Databricks and the necessary AWS credentials.
AWS Configuration: Set up AWS credentials using the provided databricks CLI commands below.
Run the Notebook: Execute the notebook cells in order to mount the data, process it, and generate the summary statistics.

AWS Configuration

To set up AWS credentials, follow these steps:

Obtain AWS Access Key and Secret Access Key:
- Log in to your AWS Management Console.
- Navigate to the IAM (Identity and Access Management) service.
- Click on Users and select your user account.
- Go to the Security Credentials tab.
- Click on Create Access Key to generate a new access key pair (Access Key ID and Secret Access Key).
- Save these keys securely; you will need them for accessing your S3 bucket.

Set Up AWS Credentials in Databricks:

Use the Databricks CLI to set your AWS credentials. Run the following commands in your terminal:

databricks secrets create-scope --scope my_secret_scope
databricks secrets put --scope my_secret_scope --key AWS_ACCESS_KEY_ID
databricks secrets put --scope my_secret_scope --key AWS_SECRET_ACCESS_KEY

Replace my_secret_scope with your desired secret scope name.

Configure AWS S3 Access in Your Notebook:
- Follow the instructions in the notebook to retrieve and configure the AWS credentials using Databricks secrets.

Future Work

Extend the analysis to include more data months or other data sources.
Implement visualizations to better illustrate trends and patterns in the CDR data.

Acknowledgments

Dataset sourced from WeCloudData.
Project thanks to WeCloudData.
Special thanks to the Databricks community for ongoing support.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebook		notebook
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDR Data Analysis Project

Project Overview

Dataset Information

Column Definitions

Project Goals

Tasks Completed

Instructions to Run the Notebook

AWS Configuration

Future Work

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CDR Data Analysis Project

Project Overview

Dataset Information

Column Definitions

Project Goals

Tasks Completed

Instructions to Run the Notebook

AWS Configuration

Future Work

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages