-
Notifications
You must be signed in to change notification settings - Fork 2
Quality Control Plots
Disclaimer: We have yet to include plots to this article. Please refer to the publication for examples. Sorry!
This section provides explanations on how to interpret the quality control plots produced by the various commands, and additional details on how these are created.
An MA plot is an application of the Bland-Altman plots for genomic data, where each gene is plotted as a point at M and A coordinates. The M coordinate for each gene is the log-ratio of the expression of that gene between two samples (see footnote 1). The A coordinate is the average expression of that gene in the two samples.
If one or both of the samples is instead a group of samples, the M coordinate is instead calculated on the log-ratio of the groupwise median of the expression of each gene is the sample groups. The "A" coordinate is the average expression between all the samples.
MA plots generated by bioTEA provide density shading, showing regions with gradually increasing point density from blue to red.
An example MA plot can be seen in Figure (Coming soon (tm)), from a spike-in experiment. It is generally expected to see a large number of genes around low "A" values, with low M spread. These are generally low-expressed genes, of which M values are non-zero due to biological variability. Genes that have high A values but low M values (on the center-right of the plot) are housekeeping genes. Genes that have high absolute M values are putative DEGs.
The overall trend of the plot is shown with a GAM regression line, in red.
We expect that MA plots for normalized data to be roughly linear and centered (as in, the GAM line flat and centered on zero). Both biotea prepare
and biotea analyze
, make one MA plot for each sample, plotting it versus the median expression of all other samples. When and produce MA plots, they are ordered and numbered from the most distorted to the least, allowing the user to detect at a glance possibly distorted samples.
If a sample remains distorted after normalization, a possible mode of action is to remove it from the analysis. Otherwise, one may specify the option in to perform quantile-quantile normalization on the data, forcing distorted samples to normality. Note however that doing this might also distort the results of the analysis.
The command also produces MA plots, with the purpose of highlighting differences between the groups of interest during DEA, as well as showing detected DEGs in a plot. An example of such an annotated plot is shown in Figure (COMING SOON (tm)).
Each boxplot in an expression boxplot plot represents a different sample, with each point representing the expression value of each gene. An example of such a plot is shown in Figure (COMING SOON (tm)). We expect that the general shape of all boxplots to be roughly similar, especially after normalization. If one or more samples have different shapes than the others, it is possible to employ the same procedures proposed in the MA plot section: use the additional quantile-quantile normalization eliminate the non-homogeneous sample(s).
Hierarchical clustering dendrograms (hereby referred to as "dendrograms"), PCA plots and Scree plots allow the visual detection of clustering of the data.
Dendrograms show the hierarchical distances between the samples. Samples connected more closely than others are more clustered. An example of such a plot can be seen in Figure (COMING SOON (tm)).
PCA plots show the result of the PCA analysis of the data. The main PCA plot shows principal component 1 vs principal component 2 of the samples. An example of such a plot can be seen in Figure (COMING SOON (tm)). An additional collection of PCA plots, named "PCA pairs", can also be found, and show plots generated from the combinations of other principal components.
Closely associated with the PCA plots is the Scree plot. The Scree plot shows the variability captured by each principal component, as well as the rolling sum of the captured variability. An example Scree plot can be seen in Figure (COMING SOON (tm)). It is useful to consider the Scree plot together with the PCA plots. If principal components cannot capture most of the variability of the data, the PCA plots are less informative.
We expect the samples to be homogeneously distant from each other, with any obvious clustering a sign of a possible batch effect. The main exception from this rule of thumb is the clustering of different types of samples, for instance between the conditions of interest. While such a clustering (especially if very dramatic) could be due to batch effects that are not of interest, they could be a sign of important effects of the actual conditions. Therefore, light clustering of samples with the same conditions of interest can be normal.
Poisson plots, or SD vs Mean plots, show the relationship between the standard deviation and the average of all genes in a certain group or groups. An example of such a plot can be seen in Figure (COMING SOON (tm)).
The data is expected to be roughly Poissonian, with the standard deviation positively correlated with the mean. Any distortion or artifacts in these plots could be symptoms of deeper problems in the array hybridization, exposure, or both. The specific problem should be identifiable from the other quality-control plots.
Volcano plots are useful to show the result of a DEA. The Y axis shows the p-values associated to each gene in a certain contrast (such as group "B" vs group "A"), but in log10 scale, to allow a wide range of p-values to be plotted. The X axis shows the respective log2 fold-changes of each gene.
Threshold lines are drawn on the plot: an horizontal line representing the p-value threshold for statistical significance, and two vertical lines showing the log2 fold change thresholds to be considered DEGs. Therefore, all genes that fall on the upper-left and upper-right quadrants are downregulated and upregulated DEGs, respectively.
Do note that the Y axis shows P-values, which are not corrected for multiple hypothesis testing. It is the P-value threshold that is moderated, instead, so the highlighted genes are the same ones that would be shown by plotting FDR or PFP values on the Y axis and using a threshold of 0.05. The only difference is graphical: P-values show the genes as more spread-out, while corrected values are squashed.
Volcano plots created by bioTEA can be labelled with HUGO symbols, if an annotation database is specified when calling . If unspecified, the probe ids are used instead (as seen in Figure (COMING SOON (tm))).
1: When reading "test vs control" it is assumed that the M value is calculated as log2(test) − log2(control)