Colin Shepherd Back to portfolio

Case Study 2: TCGA BRCA Data Analysis — Machine Learning and Deep Learning for Breast Cancer Subtype Classification

This case study demonstrates my expertise in genomic data analysis, machine learning, and deep learning. The project involved accessing public genomic datasets from the GDC data portal, performing comprehensive exploratory data analysis, and applying both classical machine learning and deep learning approaches to classify breast cancer subtypes.

Project Background

The Cancer Genome Atlas (TCGA) is a comprehensive, publicly available archive of genomic data from over 20,000 primary cancer and matched normal samples. This project focused on the TCGA-BRCA dataset, which contains RNA-Seq expression data from 885 breast cancer and normal tissue samples.

The primary objective was to:

  • Access and prepare large-scale genomic data from the GDC data portal
  • Perform exploratory data analysis to understand the underlying structure
  • Identify molecular subtypes (PAM50) as a major source of variation
  • Apply machine learning and deep learning models to classify samples into breast cancer subtypes
  • Compare classical ML (KNN) with deep learning approaches

Dataset Overview

The TCGA-BRCA dataset provides comprehensive RNA-Seq expression data:

  • Total samples: 885 samples (primary tumours, metastatic tumours, and normal tissue)
  • Gene features: 60,660 genes (raw counts)
  • Post-filtering: 18,303 genes (after removing unexpressed genes)
  • Molecular classification: PAM50 breast cancer subtypes (basal, her2, lumA, lumB, normal)

Data Access and Preparation

GDC Data Portal Integration

The dataset was accessed through the GDC (Genomic Data Commons) data portal using:

  • GDC portal filters to identify TCGA-BRCA samples
  • Manifest file generation for batch download
  • R-based data loading using TCGAbiolinks package functions
  • GDCdownload() and GDCprepare() to create a SummarizedExperiment (SE) object

Data Structure

The prepared dataset contained:

  • Dimensions: 60,660 rows (genes) × 885 columns (samples)
  • Data format: SummarizedExperiment object with raw RNA-Seq counts
  • Metadata: Sample annotations including tissue type and PAM50 subtype classification

Exploratory Data Analysis (EDA)

1. Data Filtering and Normalization

Raw RNA-Seq counts were processed to prepare for analysis:

  • Filtering: Removed unexpressed genes using filterByExpr() (60,660 → 18,303 genes)
  • Normalization: Variance Stabilizing Transformation (VST) applied to normalize data and stabilize variance across expression levels
  • Output: VST-normalized counts matrix ready for downstream analysis

2. Principal Component Analysis (PCA)

PCA was performed to visualize the relationships between samples and identify major sources of variation:

PCA plot of VST-normalized counts showing tissue type separation
FIGURE 1: PCA plot of VST-normalized counts. Clear separation is evident between tissue types (primary tumour, metastatic, and normal tissue), with no major outliers visible.

3. PAM50 Subtype Analysis

The PAM50 classification system represents the molecular subtypes of breast cancer. PCA analysis revealed that PAM50 subtype is a major source of variation in the data:

PCA plot colored by PAM50 breast cancer subtype
FIGURE 2: PCA plot colored by PAM50 subtype. Different molecular subtypes (basal, her2, lumA, lumB, normal) show distinct clustering patterns, indicating that molecular subtype is a major driver of variation in gene expression.

4. PC1 Score by PAM50 Subtype

Box plots revealed significant differences in PC1 scores across different breast cancer subtypes, confirming the strong association between molecular subtype and overall expression profile:

Box plot of PC1 scores by PAM50 subtype
FIGURE 3: Distribution of PC1 scores across PAM50 subtypes. Significant differences in principal component loadings across subtypes confirm that molecular subtype is associated with the major axis of variation.

5. Sample-to-Sample Distance Heatmap

A hierarchical clustering heatmap visualized the relationships between all 885 samples based on VST-normalized gene expression:

Sample-to-sample distance heatmap
FIGURE 4: Hierarchical clustering heatmap of pairwise sample distances. The heatmap reveals clustering structure and sample relationships based on gene expression similarity across all 885 samples.

Machine Learning and Deep Learning Models

Data Preprocessing for Python ML Pipeline

Data prepared in R was transferred to Python for machine learning:

  • Matrix transposition: Converted to sample-by-gene format (885 samples × 18,303 genes)
  • Metadata integration: PAM50 subtype labels merged with expression data by sample ID
  • Normalization: Z-score normalization applied to ensure equal feature contribution
  • Encoding: PAM50 subtype labels (basal, her2, lumA, lumB, normal) converted to numerical encoding
  • Train-test split: Data divided into training and test sets

1. K-Nearest Neighbours (KNN) Classification

Algorithm: K-Nearest Neighbours is a simple, instance-based classification algorithm.

Application:

  • Trained on z-scaled gene expression data
  • Target: PAM50 subtype classification (5 classes)
  • Hyperparameter tuning: K values tested from 1 to 31

Results:

  • Optimal K value: K = 2
  • Peak accuracy: 76.23%
  • Performance range: 71% - 77% accuracy across tested K values
  • Insight: KNN demonstrates reasonable performance, with the optimal K suggesting that nearest neighbors are informative for subtype classification
KNN classification accuracy across K values
FIGURE 5: KNN classification accuracy as a function of K value. Peak performance of 76.23% was achieved at K=2, with performance degrading gradually as K increases.

2. Deep Learning Model (Multi-Layer Perceptron)

Architecture: A neural network designed to capture complex, non-linear relationships in high-dimensional gene expression data:

  • Input layer: 18,303 gene expression features
  • Hidden layer 1: 256 neurons with ReLU activation
  • Hidden layer 2: 128 neurons with ReLU activation
  • Output layer: 5 neurons (one per PAM50 subtype) with softmax activation
  • Training: 50 epochs with GPU acceleration
  • Loss function: Categorical cross-entropy
  • Optimizer: Adam optimizer with appropriate learning rate tuning

Results:

  • Peak accuracy: 86.84%
  • Performance improvement: +10.61% compared to KNN (76.23%)
  • Key finding: Deep learning significantly outperforms classical ML for this classification task
  • Interpretation: The improved performance suggests that complex, non-linear relationships in gene expression data are better captured by neural networks than by distance-based methods

Key Findings and Conclusions

This comprehensive analysis of the TCGA-BRCA dataset demonstrates several important insights:

1. Data Quality and Structure

The TCGA-BRCA dataset is well-characterized with clear separation between tissue types and strong molecular subtype signals, as evidenced by PCA and clustering analyses.

2. Molecular Subtype as a Major Driver

PAM50 breast cancer subtype is the dominant source of variation in gene expression, explaining a substantial portion of the variance captured by the first principal component.

3. Deep Learning Outperforms Classical ML

The deep learning model (86.84% accuracy) substantially outperformed K-Nearest Neighbours (76.23%), suggesting that non-linear relationships in high-dimensional genomic data benefit from the representational capacity of neural networks.

Clinical Significance: Accurate PAM50 subtype classification is important for breast cancer treatment planning, as different molecular subtypes respond differently to various therapeutic approaches. These models demonstrate the potential for automated classification from RNA-Seq data.

Technical Methods and Tools

1. Genomic Data Access and Processing (R)

Packages:

  • TCGAbiolinks - Access GDC data portal and download genomic datasets
  • DESeq2 - Convert raw counts to DESeqDataSet and apply filtering
  • edgeR - filterByExpr() for identifying unexpressed genes
  • ggplot2 - Publication-quality visualizations (PCA plots, box plots)
  • pheatmap - Hierarchical clustering heatmaps

2. Machine Learning Pipeline (Python)

Core Libraries:

  • Pandas & NumPy - Data manipulation and numerical operations
  • Scikit-learn - KNN implementation and model evaluation metrics
  • PyTorch - Deep learning model development and training
  • Scikit-learn.preprocessing.LabelEncoder - Encoding categorical variables

3. Reproducibility

The complete analysis pipeline, including all code for data access, processing, visualization, and machine learning modeling, is available on GitHub for full reproducibility and transparency.

Resources and References

For those interested in exploring this analysis further: