RNA sequencing (RNA-seq) has been utilized as the standard
technology for measuring the expression abundance of genes,
transcripts, exons or splicing junctions. Numerous
quantification methods were proposed to quantify such
We have proposed a series of statistical metrics and data
visualization techniques to evaluate the performance of
transcript quantification. This online tool provides an easy
access to submit quantifications from any prospective methods,
and links to their metrics and visualization reports.
Three datasets are proposed here to evaluate quantification methods. For each dataset, two types of information should be prepared before submission. One is a quantification table text, which will be detailed below. The other is the log text recording the pipeline running parameters and statistics, such as pipeline versions, used commands, time and memory cost, etc.
This dataset was created using 30 samples from the Geuvadis project.
These samples were selected to represent a random sample of individuals.
To introduce batch-effect like variability, we selected 15 from one center
and 15 from another. These were then randomly divided into two groups
both having 7 samples from one center and 8 from another.
Because the samples were assigned at random, this is a null experiment
and we can consider the 15 samples in each group to be replicates.
To distinguish the two groups, we used computer simulations to generate
2,424 transcripts on chromosome 2 designed to be differentially expressed between
the two groups.
To make these abundances mimic experimental data,
we adapted the Polyester method to include GC-bias imitating the bias observed in the actual data.
We combined all simulated reads and reads originally from chromosome 1 to generate the final
SIMULATION reads or
ORIGINAL SIMULATION reads.
This dataset were simulated as using unstranded paired-end sequencing protocol.
The sample meta information can be downloaded
Note the difference between two versions of simulations:
In the ORIGINAL SIMULATION, We had unintentionally included multiple alignments when extracting chr1 reads from the GEUVADIS alignment files. When multiple alignments were available for one of the read pairs and the other read was missing (e.g. it had more than 20 alignments and was in the unmapped.bam file), such an alignment would be written by bedtools bamtofastq to both FASTQ files.
In the SIMULATION, we extract only unique, proper pair alignments from chr1 to combine with the original simulated reads from transcripts on chr2. We recommend for testing purposes to use this new simulation dataset.
We have generated both GTF and FASTA files for transcripts to be quantified in this dataset. These transcripts are from chromosome 1 and 2 only as we mentioned above. Note that these annotation files are based on human genome hg19. If genome sequence needed, you can only refer to sequences from chromosome 1 and 2 to reduce analysis cost.
Quantification file should be organized into a 30-column tab-delimited text with first row as column names. The orders of columns should follow from sample 1 to 30 as indicated by reads file names. The total number of rows should equal to the number of transcripts in FASTA file, with orders exactly the same as well. Missing values with "NA"(no quotes) is allowed, but no negative quantifications. Here is a quantification example.
The ENCODE Project has been working on evaluate RNA-seq techniques to provide more reliable standards for the whole community. Two cell types, particurly GM12878 and K562 with reasonable high sequening depth, are used for this purpose. In this webtool, we selected dataset using dUTP PolyA+ long RNA-seq protocol , which is available at the ENCODE portal with accession id ENCSR000AED (ENCLB037ZZZ & ENCLB038ZZZ) and ENCSR000AEM (ENCLB055ZZZ & ENCLB056ZZZ).
GENCODE V16 genome annotation are used for quantification. Specifically, only protein-coding transcripts are used in this online tool. For pipelines working on transcriptome based aligner or using pseudoalignment, fasta sequences of protein-coding transcripts might be already enough for quantification. However, for pipelines working on genome based aligner, whole genome GTF file is also necessary for quantification.
Quantification file should be organized into a 4-column tab-delimited text with first row as column names. The orders of columns should be as this: (ENCLB037ZZZ, ENCLB038ZZZ, ENCLB055ZZZ, ENCLB056ZZZ). The total number of rows should equal to the number of protein-coding transcripts in GENCODE annotation, with orders exactly the same as well. Missing values with "NA"(no quotes) is allowed, but no negative quantifications. Here is a quantification example.
As we mentioned in the simulation dataset with 15 replciates, some of the samples are confounded with batch effects. Specifically, we chose sample 8 to sample 23, in total 16, as the confounded dataset here.
The annotation files are the same as the simulation dataset with 15 replicates.
Quantification file should be organized into a 16-column tab-delimited text with first row as column names, while others remain the same as the simulation dataset with 15 replciates. Here is a quantification example.
Back to home page!
|This website designed and maintained by Mingxiang Teng and Rafael A. Irizarry|