News
- AnnoLnc2 reports known repeats in the inputted lncRNAs given their potential importance for lncRNAs’ functionality. Users can find the related result at the Genomic Location module.
- AnnoLnc2 reports known localization-related motifs in the query sequence based on a curated list compiled from published researches. Users can find the related result at the Subcellular Localization module for human.
- AnnoLnc2 improved the visualization component for direct superimposition of miRNA/protein binding sites over the RNA secondary structure. Users can find the related result at the miRNA Regulation and Protein Interaction module.
Data Source
The genome 2bit files for human (hg38) and mouse (mm10) were downloaded from UCSC Genome Browser, and the corresponding gene annotation files in GTF format were downloaded from GENCODE (v32 for human and vM23 for mouse). RNA-seq datasets of human tissues and cancer cell lines were downloaded from GTEx and CCLE (ArrayExpress experiment accession E-MTAB-2770), and those of normal human cell lines and mouse were downloaded from ENCODE. Binding sites of transcription factors (TFs) were directly downloaded from GTRD database. All CLIP-Seq peak files of RNA binding proteins (RBPs) were downloaded from POSTAR database, and all GEO CLIP-Seq raw read files published after POSTAR and before August 24th, 2019 were also downloaded and had their peak called (see below for details about peak calling). The phyloP and phastCons score of human among primates, mammals, and vertebrates were computed from 100-way multiple alignments data in maf format from UCSC Genome Browser (see below for details about computation); the scores of mouse were based on the 60-way conservation scores among glire, enarchontoglire, placentals, and vertebrates from UCSC Genome Browser. The trait-associated variants and eQTLs for human were downloaded from the NHGRI GWAS catalog and GTEx, respectively, and phenotypic alleles and QTL alleles for mouse were collected from MGI.
More details about the data collection:
RNA-seq data processing
Raw RNA sequencing data were mapped to reference genome (hg38, mm10) using HISAT2 (version 2.0.4), and FPKM (fragments per kilobase per million reads mapped) were quantified by StringTie (version 2.0) based on GENCODE annotation (v32 for human and vM23 for mouse). For comparing expression level of different samples, we normalized expression profile by the geometric method in normal and cancer samples of human (or tissue and cell line samples of mouse) separately.
CLIP-seq data processing
For each single GEO CLIP-Seq dataset, we first trimmed its adapter with FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), and retained only those trimmed reads with quality score more than 20 in 80% of trimmed read positions and longer than 12nt. We then removed duplicated trimmed reads (i.e., trimmed reads with the same sequence) with FASTX-Toolkit and mapped them to human (hg38) or mouse (mm10) genome with bowtie with the parameter “-m 1 -best -strata” (i.e., keeping uniquely mapped reads only). Finally, we used Piranha (on HITS-CLIP, PAR-CLIP and iCLIP datasets with the parameter “-s -b 20 -d ZeroTruncatedNegativeBinomial -p 0.01”) and PARalyzer (on PAR-CLIP, CLIP Tool Kit (CTK) for HITS-CLIP and iCLIP with default parameters) to call the peaks as protein binding sites.
Conservation score calculation
We calculated human conservation scores followed the UCSC Genome Browser pipeline (https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome-mirror/giZ09PCNBq8/EmkvI1BpBAAJ). Briefly, we downloaded the 100-way multiple alignments data from UCSC Genome Browser, constructed phylogeny for each clade using phyloFit program in Phast software package, and computed phastCons score and phyloP score by phast and phyloP, respectively.
Subcellular localization dataset generation
We used StringTie (version 1.3.0) to profile nuclear/cytosolic expression of GENCODE v32 genes on the 10 ENCODE datasets, and discarded lowly expressed transcripts (i.e., FPKM < 0.1 in all samples). We then converted FPKM to read counts with script prepDE.py from StringTie package and performed differential analysis across subcellular compartments using DESeq2 (version 1.18.1). The localization preference of a transcript was then defined as its estimated fold change (nuclear/cytosolic).