De Novo based sequence assembly of next generation sequence data without chimeras: improved annotation, gene expression profiles and haplotype
reconstruction.
The vast quantities of read data generated by current sequencing platforms, hereafter referred to by the coined phrase "next generation sequencers" (NGS), have led to previously unattainable insight into biology. Generally, prior to usage, reads must be corrected for error and then either mapped to a reference, or assembled into contig sequences representing transcripts or chromosomes. Although many early difficulties have been overcome, accurately reconstructing diversity within complex datasets, such as those from transcriptomes harbouring large amounts of isoform variation or from rapidly evolving viral populations, has remained elusive. Reference based approaches are limited to where a reference exists. De novo based ones can lead to chimeric contigs that, despite having sequence similarity to transcripts, do not maintain relationships between co-evolving sites, recombination breakpoints or gene expression profiles. Chimeras reduce the power of NGS to dissect evolutionary dynamics. They are also in danger of misleading future studies as, being routinely placed into public databases, they reduce the quality of future annotations and reference datasets. A solution to the problem of chimeric sequence assembly and analysis will be developed within this proposal. Previously, members of this team developed algorithms for assembling non-chimeric contigs from small datasets containing long reads as well as for reliably annotating such contigs. Initially we will adapt our assembly algorithm to accommodate large datasets and a wide range of read lengths. It will then be used to explore and develop feature rich tools, within an integrated framework, for the assembly, annotation and analysis of data derived from sources harbouring complex variation. The framework primarily to a number of key areas of research including (i) Economically important fish populations such as Sardine and the farmed fish sepcies Dicentrarchus labrax and Sparus aurata (ii) Soil metagenomic studies with a focus on radioactive contamination of soil and water within deactivated Uranium mines in Portugal (iii) other environmental studies including the role of transcriptomics in localized environmental adaption and (iv) parasitic nematodes, medical venomics and rapidly evolving viruses such as HIV-1. Additionally, our host institute is in the late stages of setting up an NGS genomics and bioinformatics facility with 3.1 million euros from the European Commission Seventh Framework program (grant no. 286431). As such directly available to this project are (i) maintained servers (and other hardware) capable of the computational tasks required, (ii) two Illumina NGS platforms and (iii) a vast increase in bioinformatics expertise. Having the cutting-edge research proposed in this project funded will place our host institute at the forefront of bioinformatics research both within Portugal and internationally.