Package: sylsnet
Authors: Chunxuan Shao c.shao@dkfz.de
Compiled: 2015-10-14
Here is a step-by-step tutorial of the package synlet. This package processes data from cellBasedArrays. In this vignette, we will show a quick tour of using the synlet with dummy/stimulated data, including quality control, data visualization, and most importantly, hits selection.
First, Let’s have a look at the example data.
library(synlet)
## Loading required package: ggplot2
data("exampleDat")
head(exampleDat)
## PLATE MASTER_PLATE WELL_CONTENT_NAME EXPERIMENT_TYPE
## 1 1031121 P001 AAK1 si3 sample
## 2 1031121 P001 PLK1 si1 control_positive
## 3 1031121 P001 lipid only control_negative
## 4 1031121 P001 scrambled control si1 control_negative
## 5 1031121 P001 lipid only control_negative
## 6 1031121 P001 AARSD1 si3 sample
## EXPERIMENT_MODIFICATION ROW_NAME COL_NAME READOUT
## 1 treatment C 10 455
## 2 treatment C 11 814
## 3 treatment C 12 537
## 4 treatment C 13 568
## 5 treatment C 14 566
## 6 treatment C 15 632
The exampleDat is a data.frame containing 8 columns, all of them are mandatory and please do NOT change the column names. Besides that, users are free to add new columns. However, please not that the new columns may not appear in the final results. Users need to generate exampleDat like data.frame from the screen results.
In the exampleDat:
PLATE, the name of individual plate. Must be numeric value.
MASTER_PLATE, usually there are plate replicates designed in the RNAi screen experiments. Several individual PLATE having the identical layout are group to a MASTER_PLATE.
WELL_CONTENT_NAME, the name of siRNA used in each cell, in the format of “geneName si(number)”, as usually several siRNAs are available to control for the off-target effect.
EXPERIMENT_TYPE, indicates the cells are sample (interested knock-down), or negative control (control_negative), or positive control (control_positive).
EXPERIMENT_MODIFICATION, there are two conditions in synthetic lethal RNAi screen, named treatment and control.
ROW_NAME and COL_NAME, the position of siRNA reagent in the plate.
READOUT, the output of RNAi screen output for each cell in the plate. Here the data are randomized.
The exampleDat contains three MASTER_PLATE, each MASTER_PLATE has three plates for treatment and control.
Z’ factor and Z factor are widely used quality metrics in RNAi experiments.
Z’ factor = 1- 3*(\(\delta_p\) + \(\delta_n\)) / |\(\mu_p\) - \(\mu_n\)|
Z factor = 1- 3*(\(\delta_s\) + \(\delta_n\)) / |\(\mu_s\) - \(\mu_n\)|
In Z’ factor calculation, \(\delta_p\), \(\delta_n\), \(\mu_p\), \(\mu_n\) are the standard deviation and mean of positive control and negative control signal, respectively; while in Z factor formular \(\delta_s\), \(\delta_n\), \(\mu_s\), \(\mu_n\) are standard deviation and mean of samples and negative control signal, respectively. From the definition, we could see that Z’ factor measures the quality of optimization of plates, and Z factor accesses the performance of screen on actual samples. Generally, the plates with value >= 0.5 are considered as excellent assay. See [1] and [2] for more information and discussion.
By default \(\mu\) in the denominator are mean value of signals, zFactor function offers the option to use median instead, which could be more robust in certain conditions.
Here is an example, we specify negative control is “scrambled control si1”, and positive control is “PLK1 si1”. These informations are stored in EXPERIMENT_TYPE column.
res <- zFactor(exampleDat, negativeCon = "scrambled control si1", positiveCon = "PLK1 si1")
## --- number of plates to calculate Z factor: 18---
head(res)
## zFactor zPrimeFactor
## 1031121 -10.63422 -41.36317
## 1031122 -33.61932 -605.06754
## 1031123 -23.36895 -30.10814
## 1031124 -32.31389 -14.53771
## 1031125 -115.06929 -137.87454
## 1031126 -23.08055 -51.94328
As the READOUT are shuffled data, not surprising the Z and Z’ factor are negative value.
Synlet could plot the screen results in the format of heatmap, in which a dot represents a single cell. All plates are organized together in single figure. This provides a general view of RNAi screen quality.
plateHeatmap(exampleDat)
Usually in each plate there are negative and positive control siRNAs, which set the bound of siRNA effect. The scatter plot provide a direct way to examine the bound of controls. It is possible to specify multiple positive / negative controls in the function scatterPlot, and the output of all plates are arranged in a single plot.
scatterPlot(exampleDat, controlOnly = FALSE, colour = rainbow(10), "PLK1 si1", "scrambled control si1", "lipid only")
## Warning: Removed 2 rows containing missing values (geom_point).
It is often intuitive to look at the knock-down effect of a single gene in RNAi screen experiments. The function siRNAPlot plot the normalized and raw signals, the positive and negative control signals, and Z’ factor of plates into a single graph. siRNAPlot provides the option to specify control siRNAs and normalization methods. The following codes show the knock-down effect of “AAK1”.
We need to calculate the Z’ factor based on mean and median of signals. By default, a pdf file named “AAK1.pdf” will be generated in the working directory, the return value contains all subplot and could be plotted separately.
zF_mean <- zFactor(exampleDat, negativeCon = "scrambled control si1", positiveCon = "PLK1 si1")
## --- number of plates to calculate Z factor: 18---
zF_med <- zFactor(exampleDat, negativeCon = "scrambled control si1", positiveCon = "PLK1 si1", useMean = FALSE)
## --- number of plates to calculate Z factor: 18---
tem.plot <- siRNAPlot("AAK1", exampleDat, controlsiRNA = c("lipid only", "scrambled control si1"), FILEPATH = ".", zPrimeMed = zF_med,
zPrimeMean = zF_mean, treatment = "treatment", control = "control", normMethod = c("PLATE", "lipid only", "scrambled control si1"))
## +++ ProcessingAAK1+++
##
## Using siRNA, norMethod as id variables
The main goal of synthetic lethal RNAi screen experiments is to identify interesting genes led to reliable difference in mortality between treatment and control plates, which is a difficult task because of cell heterogeneity, reagent efficiency and intrinsic character of genes. Synlet tries to improve the results of hits selection by employing several algorithms that explore data from different directions, including student’s t-test, median ± k median absolute deviation, rank products and redundant siRNA activity (RSA).
Student’s t-test is commonly used to test whether the mean from two samples are identical, thus it could be a helpful strategy in identifying synthetic lethal genes [3]. Synlet applies t-test to robust B-score value calculated from raw data of each plates, and the BH method is used to correct for the multiple comparisons.
We start with B-score calculation.
bscore.res <- sapply(as.character(unique(exampleDat$MASTER_PLATE))[1], bScore, exampleDat, control = "control", treatment = "treatment", simplify = FALSE)
## ---Processing PLATE:1031121---
## 1: 34257
## Final: 33965.5
## ---Processing PLATE:1031122---
## 1: 35666
## Final: 35464.75
## ---Processing PLATE:1031123---
## 1: 26858
## Final: 26757
## ---Processing PLATE:1031613---
## 1: 33731
## Final: 33589
## ---Processing PLATE:1031614---
## 1: 32981
## 2: 32617.25
## Final: 32546.69
## ---Processing PLATE:1031615---
## 1: 37003
## 2: 36509
## Final: 36360.38
head(bscore.res$P001)
## 1031121 1031122 1031123 1031613 1031614
## PDE12 si3 -0.3888287 -0.6491217 0.149729886 1.5912198 1.0046037
## PDE12 si1 -0.3817946 -0.2183379 -0.355019918 2.6084846 0.1178065
## PDE12 si2 0.1082468 -1.4949766 2.588726140 3.6026142 -0.1232492
## MARCH1 si1 -0.1082468 0.5960921 0.006591882 -0.1949474 0.3042604
## MARCH1 si3 -3.1329041 3.1906146 -0.144079702 1.0244094 1.0653504
## MARCH1 si2 2.8445065 0.2183379 -0.980306989 -0.3807088 0.2343841
## 1031615
## PDE12 si3 -1.5310097
## PDE12 si1 -0.3052868
## PDE12 si2 0.8711040
## MARCH1 si1 0.4640073
## MARCH1 si3 2.2841457
## MARCH1 si2 -0.5888389
bscore.res is a list containing B-score of plates belonging to the same master plate. The first three plates are treament, and the following three plates are control.
Now let’s apply the student’s t-test to the B-score and combine the results together.
bscore.ttest <- sapply(names(bscore.res), tTest, bscore.res, numTreat = 3, numCont = 3, simplify = FALSE, USE.NAMES = TRUE)
## Processing MASTER PLATE:P001
bscore.combined <- data.frame(do.call(rbind, lapply(names(bscore.ttest), function(x) if (!is.null(bscore.ttest[[x]])) {data.frame(MASTER_PLATE = x, siRNAs = rownames(bscore.ttest[[x]]), bscore.ttest[[x]])})))
head(bscore.combined)
## MASTER_PLATE siRNAs pvalue Treat_Cont p_adj
## PDE12 si3 P001 PDE12 si3 0.5706962 -0.6510114 0.8836586
## PDE12 si1 P001 PDE12 si1 0.3412250 -1.1253856 0.8761181
## PDE12 si2 P001 PDE12 si2 0.5544972 -1.0494909 0.8761181
## MARCH1 si1 P001 MARCH1 si1 0.9332815 -0.0262944 0.9789864
## MARCH1 si3 P001 MARCH1 si3 0.5036774 -1.4867582 0.8761181
## MARCH1 si2 P001 MARCH1 si2 0.4953927 0.9392337 0.8761181
The columns of bscore.combined are self-explanatory.
Hits selection based on median ± k median absolute deviation (MAD) is widely used in RNAi screen data analysis due to the easy calculation and robustness to outliers in the real data and simulation study [4]. The function madSelect could calculate the median of ratio between normalized treatment and control plates and label the hits based the k (the default value is 3). By default madSelect will use the fraction of samples normalization method, this could be overridden by specifying the negative control siRNA through the normMethod parameter.
We now apply the madSelect to our example data. madSelection is a list containing results for each master plate, and we combine the results together to madSelection.c. In the setting of synthetic lethal screen we are looking for the genes have a stronger knock-down effect, thus “Yes” means the ratio < Median - k*MAD.
madSelection <- sapply(as.character(unique(exampleDat$MASTER_PLATE)), madSelect, exampleDat, control = "control", treatment = "treatment", simplify = FALSE)
madSelection.c <- do.call(rbind, lapply(names(madSelection), function(x) madSelection[[x]]))
head(madSelection.c)
## MASTER_PLATE treat_cont_ratio treat_median control_median Hits
## AAK1 si3 P001 0.8966 0.8981399 1.001757 No
## AARSD1 si3 P001 1.0917 1.1078002 1.014778 No
## AATK si2 P001 1.0618 1.0398583 0.979346 No
## ABCA12 si1 P001 0.4237 0.7905346 1.865749 No
## ABCA4 si3 P001 0.9299 1.4215600 1.528736 No
## ABCA8 si3 P001 0.7434 1.1248893 1.513181 No
The rank products is a non-parametric statistic method proposed to find consistently up/down-regulated genes between treatment and controls replicates in microarray data, and has been successfully used in the RNAi screen data [5]. It has several advantages over the parametric Student’s t-test, including clear biological meaning, fewer assumptions of the data and improved performance. Detailed information about rank products could be found in [6] and [7].
It is straight forward to use rank product in synlet, as showed in the following codes.
rankp.res <- sapply(as.character(unique(exampleDat$MASTER_PLATE)), rankProdHits, exampleDat, control = "control", treatment = "treatment", simplify = FALSE)
## Rank Product analysis for two-class case
##
## Starting 100 permutations...
## Computing pfp ..
## Outputing the results ..
## Rank Product analysis for two-class case
##
## Starting 100 permutations...
## Computing pfp ..
## Outputing the results ..
## Rank Product analysis for two-class case
##
## Warning: There are 1 genes with at least one missing value.
##
## This value is not used to compute rank product.
##
## Starting 100 permutations...
## Computing pfp ..
## Outputing the results ..
rankp.c <- data.frame(do.call(rbind, lapply(names(rankp.res), function(x) rankp.res[[x]])))
head(rankp.c)
## MASTER_PLATE pvalue_treat_lowerthan_cont
## AAK1 si3 P001 0.58322917
## AARSD1 si3 P001 0.45723958
## AATK si2 P001 0.83661458
## ABCA12 si1 P001 0.05942708
## ABCA4 si3 P001 0.48682292
## ABCA8 si3 P001 0.48505208
## FDR_treat_lowerthan_cont treat_cont_log2FC
## AAK1 si3 0.9822807 0.01849477
## AARSD1 si3 0.9439785 -0.09956684
## AATK si2 0.9915432 0.33258218
## ABCA12 si1 1.0372727 -0.81790325
## ABCA4 si3 0.9254455 0.17295802
## ABCA8 si3 0.9313000 0.05368989
We could select interesting siRNA hits by FDR or pvalue in the rankp.c,
Redundant siRNA activity (RSA) is a novel method proposed for RNAi screen data, which systemically employs the information provided by multiple siRNAs targeting a single gene to reduce the off-target and improve confirmation rate. Briefly, RSA calculates a P-value for the rank distribution of multiple siRNAs silenced the same gene under the background of all siRNA signals in the experiment by iterative hypergeometric distribution formula. Compared to the methods mentioned above, siRNAs targeted the same genes have identical P-value, and genes with several moderately effect siRNAs may have smaller P-value than genes with fewer strong effect siRNAs. Synlet provides a wrapper function to use the RSA R codes [8].
rsaHits(exampleDat, treatment = "treatment", control = "control", normMethod = "PLATE", LB = 0.2, UB = 0.8, revHits = FALSE, Bonferroni = FALSE, outputFile = "RSAhits.csv")
## File written to:RSAhits.csv
## Summary of RSA:
## Total #Genes = 192
## Total #Wells = 576
## Total #Hit Genes = 104
## Total #Hit Wells = 133
rsa.res <- read.delim("RSAhits.csv", check.names = FALSE)
head(rsa.res)
## Gene_ID Well_ID Score LogP OPI_Hit #hitWell #totalWell rank
## 1 ACOT4 ACOT4 si1 0.2343 -3.743455 1 3 3 4
## 2 ACOT4 ACOT4 si3 0.2413 -3.743455 1 3 3 5
## 3 ACTG2 ACTG2 si1 0.5399 -2.332236 1 3 3 27
## 4 ACTG2 ACTG2 si3 0.5417 -2.332236 1 3 3 28
## 5 ACTG2 ACTG2 si2 0.7171 -2.332236 1 3 3 97
## 6 ACBD4 ACBD4 si2 0.1613 -2.283301 1 2 3 1
## OPI_Rank Cutoff_Rank EXP_Rank
## 1 1 4 1
## 2 2 5 2
## 3 3 27 3
## 4 4 28 4
## 5 5 97 5
## 6 6 1 1
The meaning of column names:
Gene_ID,Well_ID,Score: columns from input spreadsheet
LogP: OPI p-value in log10, i.e., -2 means 0.0
OPI_Hit: whether the well is a hit, 1 means yes, 0 means no
#hitWell: number of hit wells for the gene
OPI_Rank: ranking column to sort all wells for hit picking
Cutoff_Rank: ranking column to sort all wells based on Score in the simple activity-based method
Note: a rank value of 999999 means the well is not a hit
Users could pick up hits based on rsa.res, or calculate the FDR from LogP to set a stringent cut-off.
We have went through the process of hits selection in synlet. However, the siRNA hits picked by each algorithms may be different. How to get a reasonable gene lists for further verification? Here are some simple suggestions:
Rule of thumb, an interesting genes should show significant knock-down effect in at least two siRNA to avoid off-target effect.
High confidence: genes are picked up by at least two methods.
Low confidence: Get a union of siRNA identified as hits, then select genes targeted by at least 2 siRNA.
It is always helpful to check the interested genes by siRNAPlot.
[1] Zhang J.H., Chung T.D. & Oldenburg K.R. A simple statistical parameter for use in evaluation and validation of high throughput screening assays. J. Biomol. Screen. B, 4 67-73 (1999).
[2] Birmingham,A. et al. (2009) Statistical methods for analysis of high-throughput RNA interference screens. Nat Methods, 6, 569–575.
[3] Whitehurst,A.W. et al. (2007) Synthetic lethal screen identification of chemosensitizer loci in cancer cells. Nature, 446, 815–819.
[4] Chung,N. et al. (2008) Median Absolute Deviation to Improve Hit Selection for Genome-Scale RNAi Screens. Journal of Biomolecular Screening, 13, 149–158.
[5] Rieber,N. et al. (2009) RNAither, an automated pipeline for the statistical analysis of high-throughput RNAi screens. Bioinformatics, 25, 678–679.
[6] Breitling,R. et al. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett, 573, 83–92.
[7] Hong,F. et al. (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics, 22, 2825–2827.
[8] Koenig, R. et al. A probability-based approach for the analysis of large-scale RNAi screens. Nat Methods 4, 847-849 (2007).
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] synlet_1.0.0 ggplot2_1.0.1 BiocStyle_1.8.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.11 magrittr_1.5
## [4] splines_3.2.2 MASS_7.3-44 munsell_0.4.2
## [7] lattice_0.20-33 colorspace_1.2-6 R6_2.1.1
## [10] dplyr_0.4.3 stringr_1.0.0 plyr_1.8.3
## [13] tools_3.2.2 parallel_3.2.2 grid_3.2.2
## [16] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6
## [19] lazyeval_0.1.10 assertthat_0.1 yaml_2.1.13
## [22] survival_2.38-3 digest_0.6.8 Matrix_1.2-2
## [25] RColorBrewer_1.1-2 reshape2_1.4.1 formatR_1.2.1
## [28] evaluate_0.8 rmarkdown_0.8.1 labeling_0.3
## [31] doBy_4.5-13 stringi_0.5-5 scales_0.3.0
## [34] RankProd_2.42.0 proto_0.3-10