We used high-throughput DNA/RNA sequencing to assess the genetic complexity of Tara Oceans plankton samples. Multi-marker DNA metabarcoding (metaB) provided a primary vision of the diversity and abundance of prokaryotic and eukaryotic taxa inhabiting the ocean globally. Metagenomics (metaG) and metatranscriptomics (metaT) revealed the gene content and functional potential of viruses, prokaryotes and eukaryotes, generating large databases that can be leveraged to provide insights into population structure, gene evolution and mutation pressure across the three kingdoms of Life. In addition, metaT provide information about gene expression in natural conditions for both eukaryotes and prokaryotes. These efforts have led to the creation of large metabarcode and gene collections (Deliverables F), annotated for their taxonomic and functional content, as well as their abundance in the ocean, a unique resource not only for marine biology and ecology, but also for other disciplines.

As marine eukaryotes are poorly represented in public databases, additional efforts have been made to produce new reference genomes from uncultured protists isolated by flow cytometry before whole genome amplification and de novo sequencing, as well as new reference transcriptomes from cultured marine protists, providing taxonomic references in poorly represented areas of the tree of life. From an experimental point of view, the processing of Tara Oceans ‘omics’ samples has been very challenging due to open ocean sampling constraints that limited the quantity of nucleic acids extracted. Consequently, molecular biology protocols were either developed specifically for Tara Oceans samples, or carefully selected amongst existing ones, in order to limit technical biases and ensure the cross-comparability of the results (Alberti et al, Sci. Data, in review). For example, different cDNA synthesis protocols were applied to generate metaT data depending on the expected eukaryotic (polyA+ mRNA) or prokaryotic dominance of the sampled community. New “low input” cDNA synthesis methods (Alberti et al, BMC Genomics, 2014) were developed and applied on samples yielding only low amounts of total RNA.

In total, the current sequencing effort has produced >250 billion DNA paired reads out of >4 300 size-fractionated plankton communities, being by far the largest homogeneous multi-omics data set for any biome (>50 Terabases of raw data, Table 1). Finally, metagenetics catalogs were produced for viruses (Deliverables F1, 2), prokaryotes (Deliverables F3, 4), and eukaryotes (Deliverables F6, 8-11), using new bioinformatics pipelines adapted to each kingdom of Life, allowing gene-centric analyses to be performed, and paving the way toward genome-based meta-analyses.

Table 1. Summary of Tara Oceans ‘omics’ datasets, generated from size-fractionated plankton samples, and single cells collected across the world oceans (see Fig. 1).

Size Fractions (μm)Target groupsOmics type# of plankton communities analysedMean # of reads/sample (in million of paired reads)Total # of reads (in billion of paired reads)
< 0.2phagesmetaG112869.6
0.2-1.6; 0.1-0.2; 0.45-0.8; 0.2-0.45giruses (giant DNA viruses)metaG731119
0.2-1.6; 0.2-3giruses and prokaryotesmetaT (random priming)
metaG
16S metaB
153
243
1142
160
117
0.5
33
25
0.44
0.8-inf; 3-inf; 0.8-5; (0.8-3); 5-20 (3-20); 20-180; 180-2000protists and metazoa16S metaB
18S metaB
metaG
metaT (polyA RNA)
968
850
401
441
0.5
1.9
160
160
0.44
1.7
83
86
TranscriptomesprotistsDe novo sequencing78 cultured organisms302.1
SAGs samplesprotistsDe novo sequencing281 single cells3311
TOTAL4 383 communities252 billion paired reads