Reproducibility of Microarray and Gene Expression Analysis

The “Reproducibility of Microarray and Gene Expression Analysis” course focuses on the reproduction of a figure included in the paper “Repeated observation of breast tumor subtypes in independent gene expression data sets” by Sørlie et al. Supporting Figure 6 shows a heat map of the gene expression patterns across 122 breast tissue samples and two dendrograms representing the gene and array clustering. The reproducibility of published results is significant in Biomedical Informatics and Data Science. Information visualizations have become a key element in scientific presentations and it is therefore important to be able to recreate them by reusing the original data and following the same steps.

The education material supports the primary objective of the course and has been organized accordingly in four modules with six main lessons. It particularly provides full guidance to any interested scientist with minimum to medium knowledge on the topic, as illustrated in the User Stories. The course starts with an initial Setup phase for data retrieval and software installation and continues with four 1-hour modules that include one or more lessons focusing on certain subtopics.

Module 1 includes the three Lessons that provide an introduction and guidelines for reproducing Supporting Figure 6 using the tool described in the paper. Lesson 1 and Lesson 2 discuss key concepts in microarray analysis and hierarchical clustering that are necessary for understanding the heat map and dendrogram visualization illustrated in Supporting Figure 6. Lesson 3 steps through the reconstruction of the figure using the most recent version of the TreeView tool described in the paper’s methodology to visualize the clustered data distributed by the authors.

Module 2 with Lesson 4 describes the use of hive plots for visualization of the clustered data. Hive plots present an interesting alternative to the heat map-dendrogram combination of the original figure and can be created in R using the original data and code provided in the lesson.

Module 3 with Lesson 5 describes the reformatting of the semi-processed data for processing in the Cluster 3.0 tool (an earlier version of this tool was used for the analysis presented in the paper), the actual clustering process, and the visualization of results in TreeView 3.0. The new figure and the clustering results are compared with the original findings and any differences or similarities are further discussed.

Module 4 with Lesson 6 follows the same structure; however, the clustering and visualization are conducted in R. The complete details at each step and the corresponding R code are provided in this lesson.

The Software Carpentry templates have been used to structure lessons into coherent blocks with estimates for the time required.

Prerequisites

Extensive background knowledge on gene expression or data analysis is not required, though participants should be familiar with basic scientific principles regarding data quality, research procedures, and the publication process. Prior experience using the R language would be helpful in following certain lessons, but fully operable code is provided at all times, which will allow anyone to successfully reach the results.

Schedule

Setup Dowload files used on the lesson.
00:00 Introduction to Gene Expression What is a microarray?
What biological concepts underlie microarray analysis?
How are the results of microarray analysis visualized?
00:20 Introduction to Hierarchical Clustering What is the basic principal behind hierarchical clustering?
How can we interpret the branch levels?
How are the cancer subtypes defined and illustrated in Supporting Figure 6?
00:40 Recreating Supporting Figure 6 with TreeView Can Supporting Figure 6 be reproduced with the same visualization tool?
01:00 Recreating Supporting Figure 6 with Hive Plots What is a hive plot?
How can a hive plot be created using R?
How can gene expression data be represented with a hive plot?
02:00 Recreating the Analysis with Cluster and TreeView What tools were used for analysis in the original paper?
Can certain analysis steps be reproduced?
03:00 Recreating the Analysis with R How can R be used for hierarchical clustering and visualization of gene expression data?
Can Supporting Figure 6 be recreated in R?
04:00 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.