Case Study 2: TCGA BRCA Data Analysis — Machine Learning and Deep Learning for Breast Cancer Subtype Classification

This case study demonstrates my expertise in genomic data analysis, machine learning, and deep learning. The project involved accessing public genomic datasets from the GDC data portal, performing comprehensive exploratory data analysis, and applying both classical machine learning and deep learning approaches to classify breast cancer subtypes.

Project Background

The Cancer Genome Atlas (TCGA) is a comprehensive, publicly available archive of genomic data from over 20,000 primary cancer and matched normal samples. This project focused on the TCGA-BRCA dataset, which contains RNA-Seq expression data from 885 breast cancer and normal tissue samples.

The primary objective was to:

Access and prepare large-scale genomic data from the GDC data portal
Perform exploratory data analysis to understand the underlying structure
Identify molecular subtypes (PAM50) as a major source of variation
Apply machine learning and deep learning models to classify samples into breast cancer subtypes
Compare classical ML (KNN) with deep learning approaches

Dataset Overview

The TCGA-BRCA dataset provides comprehensive RNA-Seq expression data:

Total samples: 885 samples (primary tumours, metastatic tumours, and normal tissue)
Gene features: 60,660 genes (raw counts)
Post-filtering: 18,303 genes (after removing unexpressed genes)
Molecular classification: PAM50 breast cancer subtypes (basal, her2, lumA, lumB, normal)

Data Access and Preparation

GDC Data Portal Integration

The dataset was accessed through the GDC (Genomic Data Commons) data portal using:

GDC portal filters to identify TCGA-BRCA samples
Manifest file generation for batch download
R-based data loading using TCGAbiolinks package functions
GDCdownload() and GDCprepare() to create a SummarizedExperiment (SE) object

Data Structure

The prepared dataset contained:

Dimensions: 60,660 rows (genes) × 885 columns (samples)
Data format: SummarizedExperiment object with raw RNA-Seq counts
Metadata: Sample annotations including tissue type and PAM50 subtype classification

Exploratory Data Analysis (EDA)

1. Data Filtering and Normalization

Raw RNA-Seq counts were processed to prepare for analysis:

Filtering: Removed unexpressed genes using filterByExpr() (60,660 → 18,303 genes)
Normalization: Variance Stabilizing Transformation (VST) applied to normalize data and stabilize variance across expression levels
Output: VST-normalized counts matrix ready for downstream analysis

2. Principal Component Analysis (PCA)

PCA was performed to visualize the relationships between samples and identify major sources of variation:

PCA plot of VST-normalized counts showing tissue type separation — FIGURE 1: PCA plot of VST-normalized counts. Clear separation is evident between tissue types (primary tumour, metastatic, and normal tissue), with no major outliers visible.

3. PAM50 Subtype Analysis

The PAM50 classification system represents the molecular subtypes of breast cancer. PCA analysis revealed that PAM50 subtype is a major source of variation in the data:

PCA plot colored by PAM50 breast cancer subtype — FIGURE 2: PCA plot colored by PAM50 subtype. Different molecular subtypes (basal, her2, lumA, lumB, normal) show distinct clustering patterns, indicating that molecular subtype is a major driver of variation in gene expression.

4. PC1 Score by PAM50 Subtype

Box plots revealed significant differences in PC1 scores across different breast cancer subtypes, confirming the strong association between molecular subtype and overall expression profile:

Box plot of PC1 scores by PAM50 subtype — FIGURE 3: Distribution of PC1 scores across PAM50 subtypes. Significant differences in principal component loadings across subtypes confirm that molecular subtype is associated with the major axis of variation.

5. Sample-to-Sample Distance Heatmap

A hierarchical clustering heatmap visualized the relationships between all 885 samples based on VST-normalized gene expression:

Machine Learning and Deep Learning Models

Data Preprocessing for Python ML Pipeline

Data prepared in R was transferred to Python for machine learning:

Matrix transposition: Converted to sample-by-gene format (885 samples × 18,303 genes)
Metadata integration: PAM50 subtype labels merged with expression data by sample ID
Normalization: Z-score normalization applied to ensure equal feature contribution
Encoding: PAM50 subtype labels (basal, her2, lumA, lumB, normal) converted to numerical encoding
Train-test split: Data divided into training and test sets

1. K-Nearest Neighbours (KNN) Classification

Algorithm: K-Nearest Neighbours is a simple, instance-based classification algorithm.

Application:

Trained on z-scaled gene expression data
Target: PAM50 subtype classification (5 classes)
Hyperparameter tuning: K values tested from 1 to 31

Results:

          Optimal K value: K = 2
Peak accuracy: 76.23%
Performance range: 71% - 77% accuracy across tested K values
Insight: KNN demonstrates reasonable performance, with the optimal K suggesting that nearest neighbors are informative for subtype classification

        

KNN classification accuracy across K values — FIGURE 5: KNN classification accuracy as a function of K value. Peak performance of 76.23% was achieved at K=2, with performance degrading gradually as K increases.

2. Deep Learning Model (Multi-Layer Perceptron)

Architecture: A neural network designed to capture complex, non-linear relationships in high-dimensional gene expression data:

Input layer: 18,303 gene expression features
Hidden layer 1: 256 neurons with ReLU activation
Hidden layer 2: 128 neurons with ReLU activation
Output layer: 5 neurons (one per PAM50 subtype) with softmax activation
Training: 50 epochs with GPU acceleration
Loss function: Categorical cross-entropy
Optimizer: Adam optimizer with appropriate learning rate tuning

Results:

          Peak accuracy: 86.84%
Performance improvement: +10.61% compared to KNN (76.23%)
Key finding: Deep learning significantly outperforms classical ML for this classification task
Interpretation: The improved performance suggests that complex, non-linear relationships in gene expression data are better captured by neural networks than by distance-based methods

        

Key Findings and Conclusions

This comprehensive analysis of the TCGA-BRCA dataset demonstrates several important insights:

1. Data Quality and Structure

The TCGA-BRCA dataset is well-characterized with clear separation between tissue types and strong molecular subtype signals, as evidenced by PCA and clustering analyses.

2. Molecular Subtype as a Major Driver

PAM50 breast cancer subtype is the dominant source of variation in gene expression, explaining a substantial portion of the variance captured by the first principal component.

3. Deep Learning Outperforms Classical ML

The deep learning model (86.84% accuracy) substantially outperformed K-Nearest Neighbours (76.23%), suggesting that non-linear relationships in high-dimensional genomic data benefit from the representational capacity of neural networks.

          Clinical Significance: Accurate PAM50 subtype classification is important for breast cancer treatment planning, as different molecular subtypes respond differently to various therapeutic approaches. These models demonstrate the potential for automated classification from RNA-Seq data.
        

Technical Methods and Tools

1. Genomic Data Access and Processing (R)

Packages:

TCGAbiolinks - Access GDC data portal and download genomic datasets
DESeq2 - Convert raw counts to DESeqDataSet and apply filtering
edgeR - filterByExpr() for identifying unexpressed genes
ggplot2 - Publication-quality visualizations (PCA plots, box plots)
pheatmap - Hierarchical clustering heatmaps

2. Machine Learning Pipeline (Python)

Core Libraries:

Pandas & NumPy - Data manipulation and numerical operations
Scikit-learn - KNN implementation and model evaluation metrics
PyTorch - Deep learning model development and training
Scikit-learn.preprocessing.LabelEncoder - Encoding categorical variables

3. Reproducibility

The complete analysis pipeline, including all code for data access, processing, visualization, and machine learning modeling, is available on GitHub for full reproducibility and transparency.

Resources and References

For those interested in exploring this analysis further:

GitHub Repository - Complete code and analysis pipeline
TCGA Project Information
GDC Data Portal - Access the TCGA-BRCA dataset