Pan-cancer analysis involves assessing frequently mutated genes and other genomic abnormalities common to many different cancers, regardless of tumor origin. Using next-generation sequencing (NGS), pan-tumor projects such as The Cancer Genome Atlas2 have made significant contributions to our understanding of DNA and RNA variants across many cancer types.
The objective of this project is to utilize regression techniques to predict a continuous value using data from The Cancer Genome Atlas (TCGA) Pan-Cancer analysis project.
The goal of this project is to train regression models using this data to predict a continuous value representing a specific molecular characteristic of cancers. The aims are the following:
- Study the genomes of various cancers to better understand their molecular characteristics.
- Assist in developing new treatment strategies
The data, collected from different types of tumors, can be downloaded from the two links below:
This project implements four different regression models using cancer genome data:
- Simple Linear Regression (SLR)
- Multiple Linear Regression (MLR)
- Ridge Regression (RR)
- Lasso Regression (LR)
The models are evaluated using two different metrics: mean squared error (MSE) and cross-validation.
| Model | Score | Cross-validation | Mean Squared Error |
|---|---|---|---|
| Simple Linear Regression | 0.98 | 0.98 | 0.05 |
| Multiple Linear Regression | 0.98 | 0.98 | 0.04 |
| Ridge Regression | 0.97 | 0.96 | 0.05 |
| Lasso Regression | 0.92 | 0.94 | 0.07 |
- Simple and multiple linear regression models exhibit comparable performance, while Ridge and Lasso regression models have slightly lower performance.