Metagenomics and its connection to microbial community organization
Abstract
Microbes dominate most global biogeochemical cycles, and microbial metagenomics (studying the collective microbial genomes) provides invaluable new insights into microbial systems, independent of cultivation. Metagenomic approaches targeting specific genes, e.g. small subunit (ssu) ribosomal RNA (rRNA), can be used to investigate microbial community organization by efficiently showing which taxa of organisms are present, while shotgun approaches show all genes and can indicate what functions the organisms are capable of. But collecting and organizing comprehensive shotgun data is extremely challenging and costly, and, in theory, predicting functionalities from microbial identities alone would save immense effort. However, we don’t yet know to what extent such predictions are applicable.
Keywords
- PCR,
- polymerase chain reaction;
- rRNA,
- ribosomal RNA;
- ssu,
- small subunit.
Introduction
Microbes are critical to the functioning of all ecosystems on earth, not to mention most animals including ourselves [1], and often are the dominant players in most biogeochemical cycles (C, N, S, etc. [2]), so understanding the makeup and organization of microbial communities is crucial to understanding natural systems. Traditional studies that relied on cultivation missed the vast majority of organisms, with rare exceptions. But newer ways can assess microbial communities, based on studying collective community DNA. This article will discuss how this approach has evolved considerably, yielding several important discoveries, and now generates a veritable tsunami of sequence data. While such data contain immense amounts of useful information, we certainly do not need all of it to ascertain community organization. But are shortcuts suitable?
The development of metagenomics
In the 1980s, Norman Pace’s lab introduced the game-changing idea that microorganisms could be studied by the wholesale extraction of mixed microbial nucleic acid (DNA and RNA) from environmental samples and then analysis of the sequences, first RNA and then DNA [3,4]. Originally, when sequencing was time-consuming and costly, only certain phylogenetic marker genes were analysed, initially rRNA, and then when polymerase chain reaction (PCR) was invented it was used to selectively amplify rRNA genes, which were then cloned and sequenced to indicate which organisms were present [5]; ssu rRNA (16S and 18S) genes were used because they are universally present in cellular life and allow every organism to be placed on a single phylogenetic “tree of life” [6]. With such an approach, any organism, even uncultured and distantly related to anything previously studied, could be put into phylogenetic context. So, we could finally list what kinds of microbes occurred where, with the rRNA sequences providing “names”. This approach yielded remarkable and unexpected discoveries, such as the existence and high abundance of “non extremophile” marine archaea in a novel major division deeply related to thermoacidophiles [7,8]. Dozens of new major microbial divisions, at the phylum or perhaps even kingdom level, were discovered, greatly expanding our view of the microbial universe [6,9].
As sequencing got cheaper, more than just phylogenetic marker genes could be studied, allowing us, in theory, to predict the potential functions of the collective organisms in a sample. A new name was coined in 1998 when Handelsman used the term “metagenome” to describe the collective genomes of soil microflora [10], and now “metagenome” is used to describe the collective genomes of any sample (usually microbial). Handelsman’s lab and DeLong’s lab were among the first to examine large cloned fragments (>40 kb) of genomic DNA extracted from nature, with a goal of linking organisms and functions. Early on, Beja et al. reported a marine proteorhodopsin [11], among what we now know are incredibly widespread rhodopsins in many bacterial and archaeal lineages [11-13], apparently with an evolutionary origin in Euryarchaea [14]. Although many such rhodopsins appear to function as light-driven proton pumps, few organisms with the gene seem to gain a direct growth benefit from light, and the ecological functions of these rhodopsins are still enigmatic [13,15,16], a reminder that even well-studied genes may have unclear functions.
In contrast to metagenomics with large DNA fragments, “random shotgun sequencing” uses a different approach where the DNA is fragmented into pieces a few thousand bases long, cloned and sequenced (at least the ends), and assembled. Assembly is on the basis of overlapping identical sequences and the knowledge that the two ends of a single fragment are connected. This shotgun assembly approach was used by Venter et al. [12] for the Global Ocean Survey, yielding many discoveries [17,18]. One such assembled fragment pointed to the possibility that the marine archaea oxidize ammonia to nitrite, a key step in the global nitrogen cycle, a function previously thought confined to bacteria [12]. Metagenomics further clarified this unexpected archaeal function, with a fosmid-based study in soils [19] that showed an ammonia oxidation gene unambiguously connected to archaeal genes, and this functionality was confirmed by cultivation of an ammonia oxidizing archaeon, whose isolation was driven by metagenomic discoveries [20]. We now recognize that such archaea, unknown until 1992, are major players globally in the nitrogen cycle of waters and soils, with many implications for ecology and agriculture [21].
Metagenomics can also be used to ascertain essentially complete genomes of uncultivated organisms – stitching them together bioinformatically from fragments. Initially, this was done from low diversity samples like acid mine drainage where only a few taxa dominated, making the job easier [22]. Next generation sequencing of metagenomes, which requires no cloning steps, has now enabled such work in very complex environments like cow rumen, where 268 gigabases of DNA sequences were used to assemble 15 microbial genomes [23], and 58 gigabases of mate-paired short-read sequences allowed assembly of several near-complete genomes from uncultivated, relatively minor constituents of complex marine samples [14].
These metagenomic studies have greatly expanded our knowledge of what organisms occur in the “wild” and what collections of functions they possess, but how do they contribute to our understanding of microbial community organization? And more to the point, is metagenomics a suitable, efficient, and cost-effective approach to routinely assess microbial community organization? Metagenomic studies are generating terabases of raw sequence, which are hard to transmit between labs, let alone readily compare across studies or even easily comprehend. Obviously, we don’t want (and usually can’t afford) to analyse gigabases of sequence just to assess which organisms are in one sample, when we might need to analyse hundreds or thousands of samples in one study. For such questions, another version of metagenomic analysis is more suitable, a logical extension of the original PCR approach used initially to yield individual rRNA clones, where the ssu rRNA genes are first amplified then the products are sequenced directly. In this “tag-sequencing” approach, next generation sequencing effectively supercharges the data collection of phylogenetic marker genes like 16S rRNA, generating a million or more rRNA sequences in a single run, with numbers and lengths of sequences depending on the sequencing platform used, e.g. 454 [24-27] or Illumina [28]. Barcoding allows hundreds of samples to be combined in one run, yielding easily tens of thousands of tag sequences per sample at reasonable cost per sample. Therefore, even rare organisms are readily detected and compared across samples or globally [25]. And the data can be readily compared, all being based on a single gene that has been incredibly well-characterized phylogenetically [9], ideally when the same primers are used. Another advantage of this approach over the shotgun approach is that when one is interested in the bacteria and archaea, but they can’t be separated well from large amounts of animal/plant/or protistan biomass (hence the shotgun sequences would be dominated by eukaryotic DNA in the bulk extract), targeted bacterial/archaeal PCR primers amplify only the DNA of interest, although chloroplasts do amplify as cyanobacteria.
Which metagenomic approach is best for community organization?
From a metagenomic sample, tag sequencing efficiently provides the distribution of phylogenentic/taxonomic types with considerable sensitivity and depth of coverage. Shotgun sequencing provides genome-wide information about all potential functionalities, and has yielded remarkable results in many systems [29-34]. But shotgun results are “diluted” and most informative about the more abundant members of the community, providing much less information about rarer organisms. So, which is more valuable, tag or shotgun, for evaluating community organization? If all you want are identities, the tag sequencing is a clear choice in terms of “bang for the buck”. But if you care about functional types, will tag sequences do? How well can we predict functions from taxonomy – e.g. how closely correlated are phylogenetic marker genes to the ones that define functions? If they are well correlated then identities alone may suffice. The question is important because microbes can have remarkably plastic genome content, even in a single species. For example, the genomes of two Escherichia coli strains can differ by as much as a third, and sometimes two organisms with extremely close 16S rRNA sequences have significant differences in their major functions [35]. If such variation were the norm and happens randomly then predicting functions from identity alone would be almost hopeless. Yet it does not seem hopeless; at least for some habitats, there is evidence that particular phylogenetic types have predictable distributions in time and space, and such predictability suggests that particular functions correlate consistently with particular identities (from phylogenetic markers). One set of examples come from two different long-term ocean plankton time series off California and England, where microbial communities exhibit annually repeating patterns of community composition, whether measured by community fingerprinting [36], or 16S tag sequencing [37]. Another example is the highly structured and predictable global distributions of closely related varieties of the abundant marine cyanobacterium Prochlorococcus, suggesting niche partitioning [38]. A further example is the consistent co-occurrence patterns of microbes, as identified by 16S rRNA sequences, across multiple habitats [39]. These robust patterns would not exist if niche-defining functions were not well correlated to marker-gene based identities. Also consistent with such correlations, a large study of gut microbiota of 18 humans and 33 mammals, as related to diet, showed strong concordance between patterns of 16S rRNA and functional gene distributions [1].
Figure 1.
Schematic of shotgun and targeted metagenomic analysis
The metagenome is the collective genomic DNA of organisms (usually microorganisms) in a given sample. After sampling, and sometimes attempts to separate microorganisms from larger organisms, DNA is extracted from the total biomass. For overall genomic analysis, older studies cloned various size fragments and sequenced them. Next generation sequencing shotgun sequencing studies now generally fragment the DNA and sequence the fragments directly, assembling some of them into overlapping ‘contigs’, larger ‘scaffolds’, or even whole genomes. In contrast, tag sequencing uses polymerase chain reaction to amplify specific genes of interest, most often 16S rRNA, and the amplified fragments are sequenced directly.Future prospects
It remains to be seen how consistently identity from tag sequences correlates to functionality in non-marine environments, like soils and animal or plant microbiomes. Marine planktonic bacteria, which tend to be free-living and survive on low levels of nutrients, have streamlined genomes compared with most studied bacteria [40], which are probably more stable than genomes of other organisms like potential pathogens [41]. So rRNA tag sequencing alone is unsuitable to clearly identify pathogens. The phylogenetic resolution of the selected tag sequences also matters, and we need widely collected shotgun data and curated database systems [42,43] ,as well as sequenced genomes from infrequently studied organisms [44], to link functionalities to identities more broadly. Efforts like the Earth Microbiome Project [45] (http://www.earthmicrobiome.org) are working to integrate such information from samples collected globally to assess worldwide patterns of microbial diversity.