This project utilizes Databricks to process and analyze Call Detail Records (CDRs) from a public S3 bucket. The goal is to extract meaningful insights regarding SMS, call, and internet activities over a specific time period.
The dataset comprises Call Detail Records (CDRs) located in WeCloudData's S3 bucket:
- S3 Bucket Link:
s3://weclouddata/datasets/telecom/CDR - Data Structure:
- The data is organized into monthly subfolders, with each day containing a text file.
- We focus on the
cdr_by_grid_decemberfolder and analyze the first five days of December.
The dataset contains the following columns, which do not include headers in the raw files:
- square-id: ID of the square part of the Milano GRID (numeric)
- time-interval: Beginning of the time interval (numeric, milliseconds since Unix Epoch)
- country-code: Phone country code (numeric)
- sms-in-activity: Received SMS count in the square during the time interval (numeric)
- sms-out-activity: Sent SMS count in the square during the time interval (numeric)
- call-in-activity: Received call count in the square during the time interval (numeric)
- call-out-activity: Issued call count in the square during the time interval (numeric)
- internet-traffic-activity: Internet traffic count in the square during the time interval (numeric)
- Mount the CDR data from the S3 bucket to Databricks.
- Create a schema with appropriate headers.
- Perform various data transformations and analyses as outlined below.
- Data Mounting: Mounted the CDR data from the specified S3 subfolder.
- Data Cleaning: Renamed columns by replacing
-with_. - Feature Engineering:
- Added a new column,
sms_ratio, showing the ratio ofsms-in-activitytosms-out-activity. - Created a date column from
time-intervaland formatted it asyyyy/MM/dd.
- Added a new column,
- Summary Statistics: Calculated aggregated statistics at the
square-idlevel, including means, minimums, and maximums for various activities. - Min/Max Values: Found minimum and maximum values for key activity metrics grouped by
square-id. - Aggregate Table: Generated an activity summary table for SMS, calls, and internet activities by country and date. Wrote this DataFrame to the
tmpfolder in Parquet format. - Data Export: Mounted a personal AWS S3 bucket and exported the summary statistics to the S3 bucket.
- Ranking Internet Activity: Used window functions to rank internet activity by country code and date.
- Prerequisites: Ensure you have access to Databricks and the necessary AWS credentials.
- AWS Configuration: Set up AWS credentials using the provided databricks CLI commands below.
- Run the Notebook: Execute the notebook cells in order to mount the data, process it, and generate the summary statistics.
To set up AWS credentials, follow these steps:
-
Obtain AWS Access Key and Secret Access Key:
- Log in to your AWS Management Console.
- Navigate to the IAM (Identity and Access Management) service.
- Click on Users and select your user account.
- Go to the Security Credentials tab.
- Click on Create Access Key to generate a new access key pair (Access Key ID and Secret Access Key).
- Save these keys securely; you will need them for accessing your S3 bucket.
-
Set Up AWS Credentials in Databricks:
- Use the Databricks CLI to set your AWS credentials. Run the following commands in your terminal:
databricks secrets create-scope --scope my_secret_scope databricks secrets put --scope my_secret_scope --key AWS_ACCESS_KEY_ID databricks secrets put --scope my_secret_scope --key AWS_SECRET_ACCESS_KEY
- Replace
my_secret_scopewith your desired secret scope name.
- Use the Databricks CLI to set your AWS credentials. Run the following commands in your terminal:
-
Configure AWS S3 Access in Your Notebook:
- Follow the instructions in the notebook to retrieve and configure the AWS credentials using Databricks secrets.
- Extend the analysis to include more data months or other data sources.
- Implement visualizations to better illustrate trends and patterns in the CDR data.
- Dataset sourced from WeCloudData.
- Project thanks to WeCloudData.
- Special thanks to the Databricks community for ongoing support.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.