public data mining resources

By Tidyomics Team | December 25, 2018

With the advancing of sequencing technologies, more and more public data are available for you to mine. One does not have to produce his own data, rather, mining public data sets can help to generate hypothesis and even publish decent papers if done properly.

In this blog post, I am going to list some of the public data resources one can take advantage of.

Gene Expression Omnibus

Gene Expression Omnibus (GEO) is a NCBI supported public functional genomics data repository. Array- and sequence-based data are deposited by researchers. Many journals require the authors have a GEO link to their data published along with the paper. Sequencing files are deposited in SRA format and NCBI has SRA toolkit to specifically interact with those files.

You can use ascp within sratoolkit’s prefetch for way faster downloads:

prefetch -t ascp -a "${ASCP_PATH}/connect/bin/ascp|{ASCP_PATH}/connect/etc/asperaweb_id_dsa.openssh" --max-size 1000GB ${SRA_ACCESSION_ID}

or you can use the parallelized fastq-dump to get the fastqs. see here

$time fasterq-dump SRR000001 -t /dev/shm -e 8

another option is https://github.com/rvalieris/parallel-fastq-dump

European Nucleotide Archive

The European Nucleotide Archive (ENA) provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

The good part of ENA is that fastq files are available for downloading. One has to convert the SRA files to fastq files from GEO. For big files, this can take long. In this light, I always go to ENA ftp to find the fastq files for the same study. To understand the structure of the ftp, see a gist from Mike Love:https://gist.github.com/mikelove/f539631f9e187a8931d34779436a1c01

Archive generated fastq files are organised by run accession number under vol1/fastq directory in ftp.sra.ebi.ac.uk:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/[/]/

is the first 6 letters and numbers of the run accession ( e.g. ERR000 for ERR000916 ),

does not exist if the run accession has six digits.

For example, fastq files for run ERR000916 are in directory: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000916/.

If the run accession has seven digits then the is 00 + the last digit of the run accession.

For example, fastq files for run SRR1016916 are in directory: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916/.

If the run accession has eight digits then the is 0 + the last two digits of the run accession.

If the run accession has nine digits then the is the last three digits of the run accession.

Even better, without downloading the fastqs, one can stream the ENA fastq files with stream_ena say for RNAseq quantification with salmon:

#/bin/bash
# from http://www.nxn.se/valent/streaming-rna-seq-data-from-ena
fastq="$1"

prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)

dir1=${accession:0:6}

a_len=${#accession}
if (( $a_len == 9 )); then
    dir2="";
elif (( $a_len == 10 )); then
    dir2=00${accession:9:1};
elif (( $a_len == 11)); then
    dir2=0${accession:9:2};
else
    dir2=${accession:9:3};
fi

url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat

How to use it:

./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG


salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782

salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127

./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin

RNAseq/microarray specific databases

BioJupies Automatically Generates RNA-seq Data Analysis Notebooks With BioJupies you can produce in seconds a customized, reusable, and interactive report from your own raw or processed RNA-seq data through a simple user interface
RNA meta analysis has ~26,700 studies (5,717 RNA-Seq and 20,955 Microarray). https://rnama.com/ Based on 750 manually labeled studies, our clustering algorithm correctly identifies 91% of sample groups.
ReCount is an online resource consisting of RNA-seq gene count datasets built using the raw data from 18 different studies updated version here
The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. single cell RNA-seq data sets.
The Lair: a resource for exploratory analysis of published RNA-Seq data. From Lior Pachter group!
The Digital Expression Explorer The Digital Expression Explorer (DEE) is a repository of digital gene expression profiles mined from public RNA-seq data sets. These data are obtained from NCBI Short Read Archive.
blog post for it
Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive data
SHARQ Search public, human, RNA-seq experiments by cell, tissue type, and other features | Indexing 19807 files
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
ARCHS4: Massive Mining of Publicly Available RNA-seq Data from Human and Mouse ARCHS4 provides access to gene counts from HiSeq 2000, HiSeq 2500 and NextSeq 500 platforms for human and mouse experiments from GEO and SRA.
RESTful RNA-seq Analysis API A simple RESTful API to access analysis results of all public RNAseq data for nearly 200 species in European Nucleotide Archive.
intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA. Two files are provided:
ExpressionAtlas bioconductor package: >This package is for searching for datasets in EMBL-EBI Expression Atlas, and downloading them into R for further analysis. Each Expression Atlas dataset is represented as a SimpleList object with one element per platform. Sequencing data is contained in a SummarizedExperiment object, while microarray data is contained in an ExpressionSet or MAList object.
GTEx Resources in the UCSC Browser signal track on trackhub
batch recompute ~20,000 RNA-seq samples from larget sequencing project such as TCGA, TARGET and GETEX. Used hg38 and gencode v21 as annotation.
A cloud-based workflow to quantify transcript-expression levels in public cancer compendia used kallisto for TCGA/CCLE datasets and gencode v24 as annotation.
OMics Compendia Commons (OMiCC) OMiCC is a community-based, biologist-friendly web platform for creating and (meta-) analyzing annotated gene-expression data compendia across studies and technology platforms for more than 24,000 human and mouse studies from Gene Expression Omnibus (GEO)
GEOdiver An easy to use web tool for analyzing GEO datasets.
ScanGEO - parallel mining of high-throughput gene expression data
shinyGEO a web-based application for performing differential expression and survival analysis on Gene Expression Omnibus datasets.
GREIN: An interactive web platform for re-analyzing GEO RNA-seq data
ImaGEO Integrative Meta-Analysis of GEO Data.
Expression Atlas update–an integrated database of gene and protein expression in humans, animals and plants It consists of selected microarray and RNA-sequencing studies from ArrayExpress, which have been manually curated, annotated with ontology terms, checked for high quality and processed using standardized analysis methods. Since the last update, Atlas has grown seven-fold (1572 studies as of August 2015), and incorporates baseline expression profiles of tissues from Human Protein Atlas, GTEx and FANTOM5, and of cancer cell lines from ENCODE, CCLE and Genentech projects.
scRNASeqDB a database for gene expression profiling in human single cell by RNA-seq
JingleBells - A repository of standardized single cell RNA-Seq datasets for analysis and visualization in IGV of the raw reads at the single cell level. Currently focused on immune cells. (http://www.jimmunol.org/content/198/9/3375.long)

ChIPseq specific databases

ENCODE
Cistrome: The best place for wet lab scientist to check the binding sites. Developed by Shierly Liu lab in Harvard.
ChIP-Atlas is an integrative and comprehensive database for visualizing and making use of public ChIP-seq data. ChIP-Atlas covers almost all public ChIP-seq data submitted to the SRA (Sequence Read Archives) in NCBI, DDBJ, or ENA, and is based on over 78,000 experiments.
remap an integrative analysis of transcriptional regulators ChIP-seq experiments from both Public and Encode datasets. The ReMap atlas consists of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs).
A map of direct TF-DNA interactions in the human genome UniBind is a comprehensive map of direct interactions between transcription factor (TFs) and DNA. High confidence TF binding site predictions were obtained from uniform processing of thousands of ChIP-seq data sets using the ChIP-eat software.

Other field specific

Genotype-Tissue Expression (GTEx)
TCGA The Cancer Genome Atlas.
CCLE Broad Institute Cancer Cell Line Encyclopedia.
TARGET Therapeutically Applicable Research To Generate Effective Treatments.
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
1000 genomes project

There are many other databases that I may miss here. As you can see, the amount of data available is immense. It is a good time to be research parasites :)

Gene Expression Omnibus

European Nucleotide Archive

RNAseq/microarray specific databases

ChIPseq specific databases

Other field specific

Search

Categories

Tags