Next-Gen Data Analysis

Next-Gen Genomics has arrived and it has changed the way the research community is looking at diseases, genetic makeup and the genome. Ocimum Biosolutions now offers Next-Gen Data Analysis on all popular Next-Gen Sequencing platforms such as Roche 454, Illumina Solexa, ABI SOLiD™ and also Sanger sequencing.
In the past four years, high throughput DNA sequencing platforms have become widely available, leading to nose-diving in the cost of DNA sequencing. The speed and accuracy of the platforms has led to wide acceptance and empowerment among individual investigators. In fact the progress in parallel fields such as data storage and computational skills have supported the growth in next generation sequencing technology.
Next-generation DNA sequencing has changed the way genomic research is done today, by enabling comprehensive analysis of genomes, transcriptomes and epigenome without significant effort. Genetic variation, protein-DNA interaction, noncoding RNA expression profiling etc. can be assessed by Next-generation sequencing. The next gen sequencing or short read sequencing applications could range from identifying etiology and new variations that lead to increased risk of several chronic disorders and diseases, enhanced characterization of livestock genome sequence, identification of virulence markers that can cause diseases in crops and inference of population structure in microbial ecology studies. Major Next-Gen Sequencing platforms today are Roche 454, Illumina Genome Analyzer and ABI SOLiD™.
| Platforms | Roche (454) GS-FLX | Illumina Genome Analyzer II system | ABI SOLiD |
|---|---|---|---|
| Starting DNA (μg) | 3 – 5 | 0.1 – 1 | 0.1 – 20 |
| Amplification | Emulsion PCR | Bridge PCR | Emulsion PCR |
| Sequencing method | Pyrosequencing | Sequencing by synthesis | Sequencing by ligation |
| Read length (bases) | 250 | 32-40 | 35 |
| Throughput capability (Gb per run) | 0.1 | 1.3 | 4 |
| Reagent cost per run (list prices) | 8500 | 3000 | 3400 |
| Run time | 7.5 h | 3 d | 7 d |
| Ref: Applications and Case Studies of the Next-Generation Sequencing Technologies in Food, Nutrition and Agriculture. George E. Liu*. Recent Patents on Food, Nutrition & Agriculture, 2009, 1, 75-79 | |||
Next Generation Data Analysis deliverables
| Whole Genome & targeted re-sequencing | Transcriptome analysis | Small RNA analysis | De novo assembly | ChIP-Seq analysis |
|---|---|---|---|---|
| SNVs | Read counts | Reads counts | Asembled contigs and scaffolds | Protein bound regions |
| Small insertions and deletions | RPKM values | Mapped reads in BAM format | Quality reports | Motif analysis |
| Structural variations | Expressed SNVs | Coverage files (wig format) | Mapped reads in BAM format | |
| Mapped reads in BAM format | Alternative splice events | Quality reports | Coverage files (wig format) | |
| Coverage files (wig format) | Novel transcribed regions | Quality reports | ||
| Quality reports | Mapped reads in BAM format | |||
| Coverage files (wig format) | ||||
| Quality reports |
QA/QC reports
| Primary analysis | Secondary analysis | Tertiary analysis |
|---|---|---|
| Summary on reference sequence or genomic reference (length, composition, gaps) | Summary of matching results | SNP frequency report |
| Reports on raw data based on quality values a. Avg. quality value by base position of the read b. Distribution of quality values |
Coverage distribution (for each chromosome/reference sequence) | Report on indel sizes |
| Summary of raw read sequences | Regions not covered after mapping | Distribution of expression values (ex. RPKM values) |
| Total number of reads by tile/panel | Distribution of matched read lengths | Distribution lengths of peak regions |
| Base composition by base position of the read | Error rates by base position of the read sequence | Distribution of small RNA lengths |
| Distribution of read lengths | Data visualization | |
| De novo assembly report (N50 calculation, distribution contig lengths, base compostion of assembled contigs), and etc… |
Some techniques in Next Generation Genomics that have transformed Genomics research are as follows:
The need for identifying causative mutations, low frequency SNP’s and genome variation within populations is the fundamental objective of targeted and whole genome resequencing. This requires analysis of millions of sequences and that the read length is sufficiently longer to be mapped onto the genome accurately. In addition the mapping process itself must be accurate and quick. The next generation sequencing platforms have a clear edge over the electrophoresis based sequencing in such cases. These systems generate gigabytes of data in a single run providing full genome coverage.
De novo sequencing allows generation of primary genetic sequence of an organism. With longer read lengths and faster data analysis, de novo sequencing is much cheaper and quicker compared to Sanger sequencing.
Small RNA can serve as important biomarkers for diagnostic purposes. Digital gene expression using next generation genomic technologies can help in discovery and analysis of RNA without the need for previous sequence information. These systems are highly accurate and suited for analyzing low RNA expression levels.
ChIP-Seq data sets analysis and annotation for identification of protein-DNA interactions of an entire genome.
Metagenomics annotations and classification requires high throughput or short read sequencing technologies in cases where DNA is purified from an environmental sample and sequenced. The sequences could be from several different species that need to be assembled.






