Jiaqi Wu

Prepping TCGA data for machine learning projects

05 Jan 2019

Last semester, I wanted to explore the role of epigenetic markers in differentiating types of cancers for a course project. I turned to The Cancer Genome Atlas’s (TCGA) data portal, which contains 7 types of genomic data for 33 types of cancer.

Processing and organizing the data from TCGA to a form that I could use for Tensorflow turned out to be the most difficult part of this project, so here are the steps I followed, and resources I found helpful along the way.

1: MicroRNA expression profiles

I started by reading discussions on Biostars. I was surprised by the lack of detailed documentation for handling TCGA data for beginners.

1.1: Downloading raw data with GDC

I downloaded 17 datasets from TCGA, but will only focus on three in this post. TCGA has two tiers of data: free and controlled. The controlled set requires special access (you need to apply for it). So instead, I chose only the TXT files (consisting of transcriptome profiles). This sped up my processing a bit, since the controlled tier contains BAM sequencing reads of ~60-200MB.

Once I had all my TXT files for one tissue origin site in the cart, I downloaded the GDC Manifest. Since I was downloading to a cluster, I used the GDC-Client:

./gdc-client download -m $MANIFEST

Note: The expression profiles only give you the miRNA_ID and sequencing read information; if you want more information about the sample (i.e., metastatic, primary, or normal tissue), you have to download the Sample Sheet from TCGA as well.

1.2: Preparing the data using R

Each sample downloaded creates a new folder in your download directory, and inside the folder are two files: *.mirbase21.mirnas.quantification.txt and *.mirbase21.isoforms.quantification.txt. I will be working with miRNA isoform expression profiles.

More on miRNA isoforms

Each row in the isoforms file contains the calculated expression for each individual miRNA isoform observed, per sample. A common way to process this data is by taking the max or sum of all counts associated with a specific miRNA. If there are multiple isoforms for a given miRNA in each sample, then the expression is summed.

Now, I move all the files out of their folders and into one directory. We can use rptashkin’s TCGA_miRNAseq_matrix R script to obtain a complete matrix of summed miRNA isoforms from the individual samples. The rows represent miRNAs, and the columns represent samples.

I add a row called “label” and manually label it with the corresponding cancer type (i.e., BRCA, ESCA, SKCM). Then, I repeat this process with all the cancer types (I had 17). Once I have 17 labeled matrices, I merge them in R, and transpose. Now, the final matrix has samples from all 17 cancer types, where the columns represent miRNAs, and the rows are samples. I randomize, split the label column, and import into Python for machine learning analysis.

Importing R dataframe to Python

There are several ways to do this. I wrote my R matrix to disc (in a .txt file), and then imported it into Python using Pandas.

After splitting my data into a train and test set, and isolating the labels columns, I converted the data frame to numpy array to use for further analysis.

2: DNA methylation expression profiles

DNA methylation was a little more complicated than miRNAs, and I could no longer use the same script to turn individual samples into one big matrix. So, I had to do some more digging.

2.1: Downloading TCGA data using TCGA2STAT

I found an R package TCGA2STAT that enables users to directly download TCGA data into an analysis-ready format, which is exactly what I was looking for.

Because I only needed it for methylation, I extracted the code relevant to DNA methylation. The altered version is available in my repository.

TCGA2STAT allows you to specify the type of data you wish to download; I used Illumina 27K methylation profiles.

# 27K is selected by default
methyl <- getTCGA(disease="LUAD", data.type="Methylation")

# 450K example—450K isn't available for every dataset. Check in TCGA before downloading 
# methyl <- getTCGA(disease="LUAD", data.type="Methylation", type="450K")

# Table of samples vs. CpG site beta-values is found in $dat 
data = methyl$dat

# Script to see distribution of primary/metastatic/normal tissue. 

samples <- colnames(data)
parsed <- strsplit(samples,"-")
parsed <- lapply(parsed, `[[`, 4)
parsed <- unlist(parsed, recursive=FALSE)

# Replace sample names, transform, and add label.  
colnames(data) <- gsub(pattern='-', replacement='_', x=colnames(data))
t_data <- t(data)
t_data = as.data.frame(t_data)
t_data$type = c(rep(CODE,nrow(t_data)))

# Add row names to make sure table prints out correctly in .txt format 
t_data <- cbind(Row.names = rownames(t_data), t_data)

write.table(t_data,file=paste0("LUAD-methyl-labeled",".txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

The $dat attribute of the returned data structure gives us a matrix, where, again, the rows represent a CpG site and columns represent a sample. I transform the matrix, add a label for the origin site/disease type (i.e., LUAD), and store as a .txt file. This is repeated for each origin site I wanted to look at, and then all the .txt files are merged, resulting in a matrix with their labeled origin sites.

2.2: Feature selection in R

Since DNA methylation yields a much larger dataset, I did some feature selection before transforming the data matrix. I removed all features (CpG sites) with more than 30% “NA” values. If you want to go even further with the feature selection at this stage, you can consider filtering the matrix based on differential expression using variance cut-offs or limma. You can filter for differential expression between different origin sites, between tumor vs. normal tissue samples, etc.

3: Conclusion

Now, we have two datasets ready to be analyzed in Python. Examples of how to build machine learning models based on the processed data are available on GitHub.

The main effort was processing TCGA data, a step that I could only find very scattered documentation for. Hope this post helps other beginners trying to work with microRNA and DNA methylation data! :)

Here’s a list of resources I referenced for data processing:

Tweet me @https://twitter.com/thejiaqiwu if you like this post.