We used high-throughput DNA/RNA sequencing to assess the genetic complexity of Tara Oceans plankton samples. Multi-marker DNA metabarcoding (metaB) provided a primary vision of the diversity and abundance of prokaryotic and eukaryotic taxa inhabiting the ocean globally. Metagenomics (metaG) and metatranscriptomics (metaT) revealed the gene content and functional potential of viruses, prokaryotes and eukaryotes, generating large databases that can be leveraged to provide insights into population structure, gene evolution and mutation pressure across the three kingdoms of Life. In addition, metaT provide information about gene expression in natural conditions for both eukaryotes and prokaryotes. These efforts have led to the creation of large metabarcode and gene collections (Deliverables F), annotated for their taxonomic and functional content, as well as their abundance in the ocean, a unique resource not only for marine biology and ecology, but also for other disciplines.
As marine eukaryotes are poorly represented in public databases, additional efforts have been made to produce new reference genomes from uncultured protists isolated by flow cytometry before whole genome amplification and de novo sequencing, as well as new reference transcriptomes from cultured marine protists, providing taxonomic references in poorly represented areas of the tree of life. From an experimental point of view, the processing of Tara Oceans ‘omics’ samples has been very challenging due to open ocean sampling constraints that limited the quantity of nucleic acids extracted. Consequently, molecular biology protocols were either developed specifically for Tara Oceans samples, or carefully selected amongst existing ones, in order to limit technical biases and ensure the cross-comparability of the results (Alberti et al, Sci. Data, in review). For example, different cDNA synthesis protocols were applied to generate metaT data depending on the expected eukaryotic (polyA+ mRNA) or prokaryotic dominance of the sampled community. New “low input” cDNA synthesis methods (Alberti et al, BMC Genomics, 2014) were developed and applied on samples yielding only low amounts of total RNA.
In total, the current sequencing effort has produced >250 billion DNA paired reads out of >4 300 size-fractionated plankton communities, being by far the largest homogeneous multi-omics data set for any biome (>50 Terabases of raw data, Table 1). Finally, metagenetics catalogs were produced for viruses (Deliverables F1, 2), prokaryotes (Deliverables F3, 4), and eukaryotes (Deliverables F6, 8-11), using new bioinformatics pipelines adapted to each kingdom of Life, allowing gene-centric analyses to be performed, and paving the way toward genome-based meta-analyses.
Table 1. Summary of Tara Oceans ‘omics’ datasets, generated from size-fractionated plankton samples, and single cells collected across the world oceans (see Fig. 1).
Size Fractions (μm) | Target groups | Omics type | # of plankton communities analysed | Mean # of reads/sample (in million of paired reads) | Total # of reads (in billion of paired reads) |
---|---|---|---|---|---|
< 0.2 | phages | metaG | 112 | 86 | 9.6 |
0.2-1.6; 0.1-0.2; 0.45-0.8; 0.2-0.45 | giruses (giant DNA viruses) | metaG | 73 | 111 | 9 |
0.2-1.6; 0.2-3 | giruses and prokaryotes | metaT (random priming) metaG 16S metaB | 153 243 1142 | 160 117 0.5 | 33 25 0.44 |
0.8-inf; 3-inf; 0.8-5; (0.8-3); 5-20 (3-20); 20-180; 180-2000 | protists and metazoa | 16S metaB 18S metaB metaG metaT (polyA RNA) | 968 850 401 441 | 0.5 1.9 160 160 | 0.44 1.7 83 86 |
Transcriptomes | protists | De novo sequencing | 78 cultured organisms | 30 | 2.1 |
SAGs samples | protists | De novo sequencing | 281 single cells | 33 | 11 |
TOTAL | 4 383 communities | 252 billion paired reads |