Reference Databases For Taxonomic Assignment In Metagenomics Of The Human


Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge.

Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample.

Availability: Software was implemented in C++ and is freely available at

Supplementary information:Supplementary data are available at Bioinformatics online.


Metagenomics is a powerful tool for assessing the functional and taxonomic contents in biological samples. Early shotgun metagenomics projects used giga-bases of genetic data (Venter et al., 2004) to demonstrate accurate sample surveys with less bias than previous methods. The potential to detect even lower abundance organisms and provide more accurate surveys across a broad spectrum of biological environments is being advanced now by sequencers reported to generate up to 1.3 mega-bases per second (Knight et al., 2012) (Calculated by dividing total base output by total number of sequencer hours run for the HiSeq 2500 rapid-run mode. Excludes library and sample preparation time).

Increased sequencing throughput presents a major scaling challenge to existing shotgun metagenomic classification algorithms (Drge and McHardy, 2012). The ability for an algorithm to scale can be measured by the difference between sample classification run time and sequencer run time and assumes sufficient computing resources for each sequencer run. Scaling is being addressed through the use of larger compute clusters, which can be managed by a third party service (Cloud computing) (Schatz et al., 2010). As sequencer use grows, however, algorithms that run on a single node and scale with sequencer output could be paired with individual sequencers and eliminate the need for high bandwidth network connections, which are not always available.

In this article, we attempt to meet the scaling goal, running fast and accurate taxonomic sample classification on a single compute node to match analysis throughput with sequencer output. Two major design choices were made, which present possible limitations: (i) a larger than typically used single address space memory resource is exploited (0.5–1 terabytes) and (ii) larger search seeds are used than default sensitive BLAST settings for matching reads to a reference database. Relaxing conventional memory constraints allows a reference genome database to be annotated with taxonomic information and indexed to support fast metagenomic taxonomy classification of every sequencer read for all microbial taxa, including virus, prokaryotes, fungi and protists. Decreasing memory costs make this approach accessible to many practitioners because the cost of a single large memory compute node remains a fraction of the initial sequencer cost and need not require specialized system administration expertise. Large search seed sizes can potentially limit the ability to detect novel organisms but nonetheless is proving to be more effective, as the number of microbes with representative reference genomes grows for environments like the human microbiome (Martin et al., 2012).

Our goal is to efficiently assign taxonomic labels to the reads down to the species level for reads with reference representation and maintain accuracy in the presence of novel organisms by avoiding overly specific (e.g. species and strain) taxonomic assignments. This alleviates the computational bottleneck by limiting the number of unlabeled reads subjected to additional computational interrogation. The results show comparable or better accuracy than existing methods, and even with novel genomes in a sample, accurate and scalable classification is obtained in the vast majority of cases. The methods are made available as an open source software package, Livermore Metagenomics Analysis Toolkit (LMAT).

Existing bioinformatic approaches address scalability in three ways: query size reduction, reference database size reduction and faster database search. Query size reduction is achieved with metagenomic assembly and clustering, which merges overlapping and redundant reads into longer contiguous genomic segments (Pell et al., 2012). Metagenomic assembly improves the strength of the taxonomic signal contained in individual short reads but careful parameter settings are required to avoid mis-assembly, and assembly costs could remain high (Mende et al., 2012; Teeling and Glöckner, 2012). Reference database size reduction is achieved through the use of genetic markers storing only the more informative sequences (Berendzen et al., 2012; Liu et al., 2011; Segata et al., 2012). Marker-based approaches offer efficient summarization of metagenomic contents, but only cover a portion of the query set, leaving novel and other informative reads buried within the larger pool of unclassified reads, which could require additional examination (Mohammed et al., 2011). A less lossy approach reduces sequence redundancy by storing only the genetic differences among reference genomes. This approach was shown to speed up BLAST and BLAT genome database searches (Loh et al., 2012). Faster database search methods apply larger search seeds, and examples include BLAT (Sharma et al., 2012), BWA (Davenport et al., 2012) and other read mapping tools (Martin et al., 2012), but analyzing the search results remains a challenge with some approaches selecting the lowest common ancestor (LCA) of multiple matches and others using variants of a best match selection procedure to improve rank specificity of the reported taxonomic label. Moreover, parameter settings of the search tools can dramatically alter the outcome of the reported label and must be considered carefully (Mande et al., 2012).

Our approach uses faster search using larger seeds (k-mers) with a non-redundant search of taxonomic identifiers associated with the k-mers found in the reference genome database. Our k-mer/taxonomy database supports efficient retrieval of detailed taxonomic information and allows for an exhaustive comparison between competing taxonomic assignments using a novel rank-flexible classification procedure. Our new classification algorithm invokes variants of LCA and best match selection depending on the context of the search results. The approach differs from compositional binning methods (Leung et al., 2011), as it uses larger values for k (17–20) and, unlike alignment search, each k-mer is mapped to the individual source genomes minus the genome position. The method compensates for the lack of positional information by resolving the multiple k-mer/taxonomy associations recovered during search to assign each read the most rank-specific taxonomy identifier possible.


2.1 k-mer/taxonomy database

A reference genome database consists of a collection of genome sequences with each genome sequence assigned a taxonomic identifier. The first step is to convert this ‘raw’ reference genome database into a searchable k-mer/taxonomy database by storing every overlapping k-mer along with select taxonomic IDs. A reference database was constructed from complete and partial microbial genome sequences from the NCBI genome database on October 17, 2011. The ‘raw’ genome database included 301 935 distinct genomic segments (plasmids, chromosomes and other genomic segments) and contains 67 073 viral, 4366 bacterial and 236 archaea segments. The remaining segments are shared among draft eukaryotic microbial genomes that were included as assemblies with many contigs and supplementary mitochondrial genomes from eukaryotes. The reference set includes 1272 bacterial species, 121 archaeal species, 3048 viral species and 335 eukaryotic species. Microbial genome segments range in length from a small number of single read contigs of length less than 100 bases up to a 13 033 779-base chromosome (for Sorangium cellulosum).

Taxonomic IDs represent nodes in the taxonomy tree and cover all ranks from an individual genome or strain up to the highest order domains. Figure 1 shows an example representation for the searchable database. As input, database construction requires (i) an NCBI taxonomy tree, (ii) a reference genome sequence database and (iii) mappings between the genome sequence identifiers and taxonomy identifiers from the taxonomy tree. Then, all overlapping k-mers from the genome database are computed, and the LCA for the taxonomic IDs for each k-mer is identified. Finally, a post-order tree traversal up to the LCA counts the number of genomes that contain the k-mer for each taxonomy node in the traversal. Our initial expectation was to use k-mer counts to weigh each k-mer’s contribution to a candidate taxonomic label assignment. However, the weighting procedure was found to be sensitive to genome representation bias and therefore a binary scoring scheme was chosen (see Fig. 2).

Fig. 1.

Example k-mer/taxonomy database. Input includes the taxonomy tree with interior tree nodes , and leaf node genomes , all of which are labeled with taxonomy IDs. k-mers (k-mer1, k-mer2, k-mer3) are linked to their source genomes (dotted circles) and their taxonomy hierarchy up to the LCA

Fig. 1.

Example k-mer/taxonomy database. Input includes the taxonomy tree with interior tree nodes , and leaf node genomes , all of which are labeled with taxonomy IDs. k-mers (k-mer1, k-mer2, k-mer3) are linked to their source genomes (dotted circles) and their taxonomy hierarchy up to the LCA

Fig. 2.

Example scoring procedure. The query read is converted to k-mers (k-mer1, k-mer2, k-mer3), and their associated taxonomy information retrieved from the database. A classification table is created with columns for candidate taxonomic IDs and rows representing a specific k-mer, with a binary entry reporting the presence or absence of the k-mer in some genome associated with the taxonomic node. The last row shows k-mer row sum divided by the total number of k-mer rows. The underlined entries highlight nodes that are created at run time

Fig. 2.

Example scoring procedure. The query read is converted to k-mers (k-mer1, k-mer2, k-mer3), and their associated taxonomy information retrieved from the database. A classification table is created with columns for candidate taxonomic IDs and rows representing a specific k-mer, with a binary entry reporting the presence or absence of the k-mer in some genome associated with the taxonomic node. The last row shows k-mer row sum divided by the total number of k-mer rows. The underlined entries highlight nodes that are created at run time

In our experiments, we use a ‘full’ k-mer/taxonomy database (kFull) and a smaller database built from a ‘marker library’ (kML). A marker library contains the most taxonomically informative set of k-mers present in the raw genome database. The marker library is created by separating k-mers into disjoint groups. A k-mer is in exactly one group. Each group has a unique label, which consists of the names of all genomes (or more generally, sequences including genome, plasmid, segment or chromosome) that contain the k-mers in the group. For example, k-mers that occur in exactly the genomes A, B and C are in one group ‘A, B, C’, k-mers that occur only in genomes B and C are another group ‘B, C’ and k-mers in genomes A, C and D are a third group ‘A, C, D’. As a more concrete example, suppose k-mer1 is present in Yersinia pestis KIM, Y.pestis CO92 and Yersinia pseudotuberculosis IP32953. k-mer2 is present in Y.pestis CO92 and Y.pseudotuberculosis IP32953. Both k-mers have the LCA of Yersinia, but in building the marker library, they are in two separate groups based on the exact set of genomes that contain the k-mer: k-mer1’s group is labeled ‘Y.pestis KIM, Y.pestis CO92, Y.pseudotuberculosis IP32953’, and k-mer2’s group is labeled ‘Y.pestis CO92, Y.pseudotuberculosis IP32953’. All the k-mers in groups containing more than 1000 k-mers are included in the marker library. So, if there are 1000 k-mers in k-mer1’s group, then all those k-mers go into the marker library. If there are 999 k-mers in k-mer2’s group, then none of those k-mers go into the marker library. k-mers whose LCA is above the taxonomic rank of family are not included in the marker library. A k-mer/taxonomy database is created from the marker library’s set of k-mers.

2.2 Scoring a read’s taxonomic IDs

In the k-mer/taxonomy database, each k-mer of length k is associated with a list of taxonomic IDs as outlined in Section 2.1. The first step of determining a taxonomy ID for a read from the query set is to assign a score to each taxonomy ID of each k-mer in the read. The score is derived from the proportion of k-mers of the read that occurs under that taxonomy node normalized by the proportion of k-mers of a random read that also appears under that taxonomy node.

To illustrate the details of the scoring algorithm, the example in Figure 2 shows a query read with three k-mers. The tax IDs of the read’s constituent k-mers are retrieved from the k-mer/taxonomy database. For each read s of length l, a binary classification table C of size is constructed, in which the K rows represent the k-mers retrieved from the read and the T columns represent candidate taxonomic IDs (k-mers are stored in the database in canonical order removing strand specificity. Only canonically ordered k-mers are used to query the database.). A ‘1’ entry indicates the tax ID (column) belongs to the associated k-mer (row). For each tax ID j, the proportion of k-mers having that tax ID is computed (shown in Fig. 2 in the last row): . In the example, for genome G2.

The scoring method uses a random model to limit genome representation bias in the database and avoid assigning taxonomic labels by random chance when a novel organism is not represented in the reference database. The score Sj of read s for a taxonomic ID j is defined as , where PRj represents the proportion of k-mers associated with taxonomy ID j in the random model. The random model estimates PRj, the chance of assigning a read of length l to taxonomy ID j owing to random chance (for simplicity the random score is not shown in Fig. 2). The random model is precomputed for read length l and assumes a random nucleotide composition, which can optionally sample explicitly from a range of GC content values. Reads are randomly generated, and then are searched against the database to calculate a Pj value for each random read r and observed taxonomy ID j. PRj is set to the maximum observed value . Further details of random model construction are included in the Supplementary Material.

2.3 Rank-flexible read classification

Taxonomy classification combines LCA selection with the read label score evaluation. Candidate labels are examined in order by decreasing read label score as shown in Figure 3 (for illustration, the numerator of the read label score Pj is shown). The most specific taxonomic label is selected such that no other taxonomic label from a conflicting lineage has a comparable read label score. Comparable is defined as a score within one standard deviation of the best candidate using the read label score distribution for the read. The best candidate is found using the taxonomic lineage from the highest scoring taxonomy label. The path from the highest scoring node to the LCA is created (LCAs for individual k-mers identified off-line are used as a starting point to find the LCA for all retrieved k-mers online, which in some cases can reduce run time costs) and each subsequent label is evaluated for consistency with the lineage. When conflicting labels are encountered, the lineage is pruned further up the tree to resolve the conflict. In the example shown in Figure 3, the lineage from G1 to the LCA n5 is constructed first. When G3 is examined, it is identified as conflicting with G1 and the candidate lineage is pruned to node n1. The process continues until the ninth label is encountered (), which for the purpose of illustration has a score below the threshold of comparable scores. The classification procedure terminates with the read assigned label n3.

Fig. 3.

Label selection process. The list of candidate taxonomic labels ordered 1 through 9 is sorted by label score (G1,1.0), … ,(G7,0.33). Steps 1, 2, 6 and 9 where an action is performed are shown. At step 1, the taxonomy lineage is constructed from the best first label G1. Step 2, G3 conflicts with G1 and the lineage is pruned to n1. Step 6, G4 conflicts with n1 and the lineage is pruned to n3. For demonstration at step 9, score 0.33 for the G7 label is below threshold and the procedure terminates and returns n3 as the classification


An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package ( A web-version ( provides the basic functionality for analysis of small fastq datasets.

Citation: Petersen TN, Lukjancenko O, Thomsen MCF, Maddalena Sperotto M, Lund O, Møller Aarestrup F, et al. (2017) MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads. PLoS ONE 12(5): e0176469.

Editor: Lingling An, University of Arizona, UNITED STATES

Received: November 14, 2016; Accepted: April 11, 2017; Published: May 3, 2017

Copyright: © 2017 Petersen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This study was funded by the European Community's Seventh Framework Programme [FP7/2007–2013], under grant agreement n°613754, and by The Villum Foundation (VKR023052). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


The advances in rapid and efficient DNA sequencing technologies have made it possible to study microbial communities from a wide variety of environments, such as sediments [1][2], water [3], ice [4], and humans [5][6]. Among the known DNA sequencing platforms, Illumina HiSeq and MiSeq are often preferred for both single genome and metagenomics studies, due to the large data output and a relatively low cost per base pair. Applying the whole genome shotgun sequencing technique, all DNA in a biological sample is sequenced and several millions of short read nucleotide sequences are produced. Metagenomics data from a single human gut sample is a complex system representing hundreds of organism and even more diversity is expected when samples originate as a mixture from many individuals e.g. humans or animals from sewage systems, public transportation sites or animal farms. The interest in analyzing such datasets may be monitoring of bacterial or viral pathogens, identification of anti-microbial resistance genes, phage identification or simply obtaining a complete catalogue of the organisms that are present. Such analysis is not straight-forward, it requires programs that can perform the mapping of fastq sequence reads to many reference sequence databases without an extensive memory usage, parsing and validating sequence alignment hits, taxonomy annotation with a reduced false positive rate and finally presenting output of the taxonomy analysis and also providing files for further downstream analysis (SNP or contig assembly. MGmapper was made to provide an access for routine analysis of complex datasets, enabling usage of many whole genome reference sequence databases like bacteria, virus, fungi, plant, vertebrate-mammals, invertebrates and also enables the usage of gene databases like anti-microbial resistance genes, 16S rRNA or any custom database based on a set of fasta sequences. The huge fasta sequence databases like plant (208gb fasta) vertebrate mammals (316gb fasta) or invertebrates (150gb fasta) can be split into smaller chunks (10gb fasta) thus the total memory requirement is reduced to 30–40gb memory when running MGmapper. Also, most tasks are run in parallel for fast execution.

The task to assign each of those nucleotide reads to the genome that they represent is challenging and the problem of false positive predictions is always an issue to be considered for alignment based methods where a query sequence is mapped against a large database of target sequences. As target databases increase in size, the chance of finding hits for random reasons also increases. For decades the Blast program suite [7] has been one of the most frequently used programs for pairwise alignment of a query sequence against a large database of target sequences. Blast utilizes a filter in form of an expect value as a threshold, to reduce the number of false positives. In general methods within the field of taxonomy annotation rarely use filters or cutoffs, but a recent benchmark study showed the need, as several methods vastly over predict the number of species present when evaluated on both in vitro and in silico datasets [8]. The study involved benchmarking of 15 taxonomy annotation methods where two of them, a filtered version Kraken [9] and CARMA3 [10], correctly identified all species present in an in vitro dataset, using a read count abundance threshold at 0.1%. Also, the methods MEGAN4 [11][12] RAPSearch2 [13] performed well with only one false positive species annotation.

The read count abundance measure is biased as more reads are sequenced from larger genomes compared to smaller genomic sequences like viruses and plasmids. Thus a normalization of read count abundances with the reference sequence size can adjust for a skewed performance that favors large reference sequences. In the MGmapper method we have introduced such a measure to reduce the false positive taxonomy annotations. During whole genome shotgun sequencing, all DNA is fragmented and sequenced from one or both ends depending on whether single-end or paired-end reads are produced. Thus our size normalized read count abundance (S_Abundance) is divided by 2 in case paired-end reads are used and multiplied by 100 for convenience i.e. 100*ReadCount/Size(bp)*2, where Size is the length of a reference sequence. Using the normal read count abundance (ReadCount/Reads_in_sample) as in the work by Peabody et al. [8], a threshold at 0.1% appeared to be the best cut-off to differentiate between true and false taxonomy annotations. For the S_Abundance, a threshold of 0.01 was the best cut-off, based on benchmarking data as used by Peabody et al. One drawback of using a size normalized abundance as criterion for true positive annotations is that, in case of small reference sequences, only a few assigned reads are needed to pass the cutoff. Therefore a lower read-count abundance-threshold could be introduced, although the exact value may be difficult to assign.

Identifying a specific reference sequence e.g. a bacterial strain or an anti-microbial resistance gene in a pool of highly similar sequences is a challenge for any taxonomy annotation method. Only a few sequence reads aligned to specific marker regions may enable the differentiation between closely related genes or strains. For this reason the presence of uniquely mapped sequence reads to one specific target reference sequence is a strong indicator that the target sequence is actually present in the sample. K-mer based methods like Kraken [9] or KmerFinder [14] require a 100% identity between a query fragment and a database hit with the annotation for that fragment. Typically k-mers with a of length of 31bp are used and at that size a kmer may be assigned to several reference sequences. For the Kmerfinder [14] method an expectation value is calculated as the number kmers assigned to a specific reference sequence within a database compared to the number of hits to other reference sequences in the same database. The Kraken [9] method utilizes another approach based on identifying the Lowest Common Ancestor (LCA) for each of the kmers originating from a sequence read. A score is calculated as the fraction of kmers that are rooted to a specific taxa compared to total number of LCA kmers that are assigned to a fastq read. As Kraken can calculate a score for a fastq read based on k-mer counts, alignment based methods utilize a sequence alignment score for the individual reads. Alignment based methods can handle nucleotide variations between a query read and a reference sequence, and the alignment score is typically used to differentiate the best alignment from secondary alignments with a lower score. The scoring scheme itself is based on heuristics arguments as in Blast [7], and BWA-mem [15], where scores for a nucleotide match, mis-match and inserts/deletions (INDELs) are summed to an overall alignment score. Thus alignment-based methods provide both numbers for uniquely mapped reads and total number of nucleotide matches, mis-matches and INDELs. Also, in BWA-mem [15] the edit-distance is the number of changes that are needed to obtain a perfect match between a sequence read and a reference sequence.

In summary, the normalized read count abundance, a low-read-count value, the number uniquely mapped reads and the edit-distance are measures used by MGmapper, rather than a single read count abundance threshold, with the aim to reduce the number of false positive taxonomy annotations from next generation sequence data.


Mock bacterial datasets

Two mock bacterial datasets were previously used to benchmark metagenomics classification methods at genus and species level [8]. An in vitro dataset comprised of single-end reads with an average length of 223bp can be downloaded from the metagenomics RAST server [16] with accession id 4545485.3. Also, 4 paired-end in silico datasets were downloaded: 4545483.3 (100bp), 4548385.3 (250bp), 4548991.3 (500bp) and 4548990.3 (100bp). There are a few changes in species nomenclature and dataset composition compared to the taxonomy that was presented in the original work by Peabody et al.[8]. For the sake of clarity the updated taxonomy annotations for both the in vitro and in silico datasets are given in Table 1.

Both the in vitro and the four in silico datasets were mapped against a bacteria and a plasmid reference sequence database. Databases were compiled as subset of entries from the assembly_summary file available at Criteria for sequence selection were defined by these parameters: version_status = ‘latest’, genome_rep = ‘Full’ and assembly_level = 'Complete Genome' or assembly_level = 'Chromosome'. In total the bacteria database is composed of 7451 genomic sequences (created: Feb 23, 2016), where entries with the word ‘plasmid’ in the fasta header were compiled into a separate plasmid database composed of 4429 genomic sequences.


The MGmapper package consists of a pipeline of scripts to process FASTQ files as either single or paired-end reads to perform sequence mapping and taxonomy annotation against user defined reference sequence databases. MGmapper utilizes a number of publicly available programs: Cutadapt [17] for trimming and adaptor removal, BWA-mem [15] and SAMtools [18] to produce and process the reference based sequence alignments to one of many reference sequence databases. A short summary of the procedure is described below for paired-end sequence data, followed by more details outlined in the section “Fastq mapping procedure”.

Initially, a filtering step checks for properly paired reads, followed by trimming and adaptor removal. The biological relevant reads are obtained by always mapping to a PhiX bacteria phage and continuing with the subset of reads that do not align to the PhiX genome (commonly used as a control in Illumina sequencing runs). Next, sequence reads are mapped to user defined reference sequences and only properly paired reads are accepted, provided that both reads pass a lower alignment score threshold and relative alignment length. After mapping reads to all reference sequence databases (eg human, bacteria, fungi etc.), some reads may align to reference sequences in different databases and depending on the mapping mode (bestmode or fullmode explained further down) the best hit is identified and used to assign taxonomy. Taxonomy annotations ( are added via lookup in a pre-made Kyoto Cabinet database ( containing key, value pairs in form of the reference sequence name (the key) and full taxonomy path from strain to superfamily clades (the value). Finally, a post-processing step (section “post-processing”) identifies confident assignments at strain, species, genus or any user defined taxonomy clade up to superfamily.

MGmapper can map sequence reads against any nucleotide sequence database i.e. both genomic and gene sequence databases and for each database the mapping can be performed in either bestmode or fullmode. In bestmode, reads are assigned to only one reference sequence if it is the best hit that is observed when mapping to all databases specified for bestmode mapping. Best hit is identified based on the highest alignment score. In fullmode reads are assigned to a reference sequence even if a better hit is seen when mapping to another database. Typically the fullmode is used to search for sequences (e.g. a gene database), that may be a subset of another database (e.g. a full genome database). Analyzing a sample for both genomic bacterial composition and anti-microbial resistance genes is a situation where MGmapper should be run with the bacterial database specified for bestmode mapping and at the same time specifying the anti-microbial resistance gene database for fullmode mapping. The reason is that the resistance genes are or may be a subset of the bacterial genomic sequences and we want to assign a sequence read both a bacteria genome and also a resistance gene. If both databases were specified for bestmode mapping (bacteria, anti-microbial genes), then a read can only be assigned to one of the databases and if identical alignment scores are observed, then priority is to the database that was specified first.

Fastq read mapping procedure

The MGmapper pipeline analysis is done in four main steps: I. Pre-processing of raw reads to remove potential positive control reads, II. Mapping of reads to one or more reference sequence databases and filtration of alignment hits, III. Identification the best hits, and IV. Post-processing of taxonomy annotations and preparation of excel and text files with insert-size distribution, size normalized abundances, read and nucleotide count statistics, depth, and coverage. A schematic flowchart of the paired-end mapping processing is shown in Fig 1.

Fig 1. A schematic flowchart for processing of paired-end sequences with MGmapper.

MGmapper processes fastq reads in four steps. These consist of: (I) Trimming and mapping reads against a phiX bacteriophage to remove potential positive control reads. (II) Mapping to specified reference databases, post-processing of BWA-mem alignments to remove reads with low alignment score or insufficient alignment coverage. (III) Identification of best hits in bestmode: Assignment of a read-pairs to only one specific reference sequence based on the highest sum of alignment scores. In fullmode, assigned a read-pair to a reference sequence even if a higher alignment score is found when mapping to another reference sequence database. This will provide best target match, considering only the sequences present in one particular reference database. (IV) Compilation of abundance statistics, read and nucleotide counts, depth, coverage, and summary reports.

I. Pre-processing of raw reads.

An optional trimming and filtering of raw reads is performed by use of the Cutadapt [17] program. Users can skip this step if reads are already trimmed. Default setting is that reads are initially trimmed before searching for adaptor sequences (equivalent to the Cutadapt option—q). In addition, a read is discarded unless a minimum of 30 nucleotides remains after trimming. Trimmed reads are next paired up and singleton reads are removed when using the paired-end version of MGmapper. To this follows another cleaning process where reads from potential PhiX Control v3 adapter-ligated libraries are removed via BWA-mem [15] and SAMtools [18], as they may originate from a control for Illumina sequencing runs ( The outcome is a cleaned set of reads that are believed to originate from the biological sample of interest. The number of reads in this set (noPhiX dataset) is set to 100% and used for calculation of R_abundance, a read count abundance measure.

II. Mapping of reads to reference sequence databases and alignment based filtering.

FASTQ reads are first extracted from the noPhiX set and mapped to one or several reference databases via ‘bwa mem—t procs—M database’ marking shorter splits as secondary hits, which are then removed when piping to ‘samtools view -F 256 -Sb -f2’ in paired-end mode or ‘samtools view -F 260 –Sb’ in single-end mode i.e. keeping properly paired reads or mapped reads, respectively. Next, reads with insufficient alignment qualities are removed based on user-defined minimum alignment score (MAS) and minimum fraction of nucleotides being assigned with an ‘M’ state in the CIGAR format, where an ‘M’ indicates a match or mismatch. The user-defined threshold for fraction of matches+mismatches (FMM) is in relation to the full length of a read. In paired-end mapping both reads are removed if just one of them does not fulfill the filtering criteria. Default settings in the MGmapper programs are MAS = 30 and FMM = 0.8. At this step properly paired read may align to more than one reference sequences, located in different reference sequence databases. In bestmode a read pair can only be assigned to one reference sequence (section “Identification of the best hit”).

III. Identification of the best hit.

Having paired-end sequences, both the forward and the reverse fastq reads are aligned to a reference sequence, each with an associated alignment score. The sum of alignment scores (SAS) is used as a measure to identify the best hit for a read-pair. Typically, all input query reads are mapped to multiple reference sequence databases e.g. bacteria, virus, fungi, human and others. Thus a read-pair may map to multiple reference sequences from different databases and in bestmode the taxonomy annotation is only assigned to one best hit, namely the one with the highest SAS score.

For single-end reads mapped to several databases, the best hit is the one with the highest alignment score. In cases where a read or read-pair achieves identical alignment scores to reference sequences from different databases, the priority is given to the order by which the databases are specified by the user, and thus a read or read-pair can still be associated to one single reference sequence.

IV. Output and post-processing of results.

The fastq reads are mapped to multiple user-defined reference sequence databases. A tab-separated file is produced for each database including reference sequence hits with read count statistics provided at strain level. A strain is named according to the header name originating from the fasta file that was used to make the database. The tab-separated file is composed of 14+16 columns of read count statistics and annotations, where the latter are taxid and taxonomy clade name for 8 clades, i.e. strain, species, genus, family, order, class, order and superfamily.

The first 14 columns are described below in Table 2.

The tab-separated files contains the unprocessed results as obtained by the BWA-mem [15] mapping and Samtools mpileup [18]. As false positive annotations are likely to be present, a subset of confident mapping results is obtained at a specified clade level (strain, species …superfamily) via a post-processing procedure described in the section below.


A combination of four criteria (I-IV) is used to identify a positive taxonomy annotation. Identifiers highlighted in italics are also described in Table 2.

  1. I. Minimum ReadCount of 10
  2. II. Mismatch ratio < 0.01, defined as Mismatches/Nucloetides.
  3. III. S_Abundance, the size normalized abundance > 0.01.
  4. IV. Unique read count fraction > 0.5%, defined as ReadCount uniq/ReadCount.

At strain level all four criteria are imposed. At species level, the criteria IV is used in a pre-cycle, to identify the lowest S_Abundance for the selected species. The new S_Abundance threshold is used in a second round where criteria IV are omitted. At genus level or higher only criteria I, II and III are used.

Taxid values are used to identify strains belonging to the same species or species belonging to the same genus etc. All identifiers as shown in Table 2, are summed at clade levels higher than strain i.e. the S_Abundance value for a species is the sum of all strain S_Abundance values. It is likewise for R_Abundance, Size, Seq_Count, Nucleotides, Covered positions, Coverage Depth, ReadCount, ReadCount uniq and Mismatches.


Benchmarking of MGmapper at strain, species and genus level against the in vitro and the four in silico datasets is shown in Tables 3 and 4. Excel sheets are provided as supplementary information in S1–S10 Files, at strain and species level for both annotations that passes the post-processing criteria and for those that are rejected. Also, plasmids sequence annotations are provided in the supplementary excel sheets.

A summary of mapping the in vitro and in silico datasets are shown in Tables 3 and 4, respectively.

For both the in vitro data and the four in silico datasets, MGmapper identifies all species correctly with no false positive predictions.

The work by Peabody et al. [8] benchmarked several methods at species level using a read count abundance > 0.1% for the in vitro data and the in silico dataset (although only for the 250bp dataset).

When benchmarked against the in vitro dataset, the methods that correctly identified all species with no false positives were a filtered version of Kraken [9] and CARMA3 [10]. For the in silico dataset (250bp), six methods performed with no errors i.e. CLARK, Kraken, Kraken filtered, MEGAN4 BLASTN, MetaCV and RITA.

In Table 3 we showed that MGmapper was able to identify all species and genuses present in the in vitro dataset without any false positives. Peabody et al., reported that a filtered version of Kraken the same result. Both MGmapper and the filtered version of Kraken need to reject sequence reads in order to report only reliable annotations. In Tables 5 and 6 MGmapper and Kraken have been benchmarked on the percentage of reads that are assigned to each of the genusus and species present in the in vitro dataset. The Kraken data was prepared by filtering the read assignments, using a threshold in kraken-filter (version 0.10.6-unreleased-20160118) of 0.2. Next, setting a fair read count abundance threshold for Kraken is not straightforward. We chose to set it at 1.1% such that all species were correctly identified by the Kraken method with no false positives. Also, at genus level a threshold was set at 2.2% to identify all 8 genuses with no false positives. The filters for MGmapper were those described in section ‘Post-processing’. Using these filters the read count abundances are shown in Tables 5 and 6 at genus and species level, respectively. The comparison shows that there is a high degree of similarity between the reported abundances for the individual species and genuses. Also, it shows that fewer reads are mis-classified by MGmapper where a higher total fraction of reads that have been correctly assigned at genus and species clade levels. A strain level comparison was not possible as taxid numbers that are now obsolete and only provided at species level and above.


The intention behind the development of the MGmapper pipeline is to simplify the processing of next generation sequences from biological samples, and to enable users an easy access to an NGS analysis without necessarily understanding all the computational details of the process. Also, setting up threshold values to provide reliable taxonomy annotations by reducing the number of false positives.

In its present form MGmapper follows a mapping protocols against reference sequence databases, and provide BAM files, text and Excel summary files. These contain read-count statistics for those reference sequences that passed a post-processing procedure, but also for those annotated reference sequences that did not meet the criteria set up in the post-processing, thus enabling a user to see discarded mapping results and possibly redo the post-processing if other threshold settings are preferred. Thus no time consuming fastq mapping needs to be re-done, just the fast post-processing that finishes in seconds.

One of the challenging issues that arise when short sequence reads are mapped against a set of reference sequences is that a read, or a pair of reads, may map equally well to more than one reference in a sequence database i.e. multiple hits with identical alignment scores. When this happens, the reference sequence assignment reported by BWA-mem [15] is arbitrary, as there is not yet any procedure available to unravel the multiple hit ambiguity. However, the fact that there are a number of reads that can be uniquely assigned to one single reference sequence, is a strong indicator that a reported strain or species is actually the one present in the sample. Benchmarking against an in vitro dataset, we obtained a sensitivity and precision at 75% for taxonomy annotations at strain level resolution. At higher clade level annotations we identified all species and genuses with no false positives.

The four in silico datasets proved easier to annotate correctly compared to the in vitro dataset. Only two false negatives were observed at strain level for the 100bp and 250bp datasets and one false negative for the 500bp and 1000bp datasets. At higher clade level annotations, i.e., species and genus, we obtained 100% correct taxonomy annotation. As the most challenging sample was the in vitro dataset, we used that in a benchmark analysis to the well performing Kraken method [9]. Overall the MGmapper method correctly assigned 84.82% of the reads at species level, compared to the Kraken method that assigned 70.45% of the reads correctly. At genus level the percentages were 90.75% and 87.21% for MGmapper and Kraken, respectively.

Benchmarking of metagenomics taxonomy classification methods is a challenging effort as programs produce different output and the benchmarking can be done in terms of correctly annotated reads or collapsed into strain/species annotations, run time and memory usage. We chose to compare to the extensive work by Peabody et al. [8], where methods were benchmarked to correctly identify species taxonomy. One of the main results from that study was that all methods, evaluated on an in vitro dataset, vastly over predicted the number of species present in a sample unless a post-processing was performed. In this work we have shown that our mapping results together with the post-processing procedure provide 100% correct taxonomy annotations at species and genus level and even at strain level resolution we have trustworthy annotations with sensitivity and precision at 75% when using a combination of up to four criteria to filter the initial mapping results. However, those datasets have limited complexity and for real metagenomics data we can expect much more diversity compared to the 11 and 12 species that are present in the in vitro and in silico datasets. Metagenomic samples from soil, human gut, sewage or public sites like metro stations, will likely contain a highly diverse set of organisms and also many closely related strains. Reference based sequence alignment methods allow for nucleotide mis-matches between a query read and a reference sequence and the mis-match threshold can be adjusted to assign sequence reads with a remote identity. For practical purposes, a threshold at 10–15% nucleotide mis-matches may be used. A specific organism assigned to be present in a sample may be a representative sequence for closely related sequences with nucleotide variations (SNPs or INDEls). Also Kmer based methods are challenged by highly diverse metagenomics data as they rely on perfect matches between a query fragment and a database hit.

The mapping procedure is the most time-consuming task when running MGmapper. Small datasets that only needs to be mapped against a bacteria, plasmid, fungi or virus databases will finish within minutes up to an hour, whereas mapping millions of fastq reads against many big reference sequence databases like plant (208gb fasta), vertebrate mammals (316gb fasta), invertebrates (150gb fasta) and nt (125gb fasta) will finish days (3–7) even when run in parallel on 16 processors. The in vitro and in silico datasets used in this study were both run against a bacteria and a plasmid database using 8 cores. The runtime for each of the datasets was 7 min using Computerome—the Danish National Supercomputer for Life Sciences (

In future golden benchmark dataset would be most welcome and initiatives like CAMI (Critical Assessment of Metagenome Interpretation) may be a useful platform for benchmark comparison of metagenomics classification tools.


We would like to thank Peter Wad Sackett and John Damm Sørensen for their contribution to programming and computational setup of the pipeline. This study was funded by the European Community's Seventh Framework Programme [FP7/2007–2013], under grant agreement n°613754, and by The Villum Foundation (VKR023052).

Author Contributions

  1. Conceptualization: TNP O. Lund FMA TSP.
  2. Data curation: TNP O. Lukjancenko.
  3. Formal analysis: TNP O. Lukjancenko.
  4. Funding acquisition: TSP FMA.
  5. Investigation: TNP.
  6. Methodology: TNP.
  7. Project administration: TSP FMA.
  8. Resources: FMA TSP.
  9. Software: TNP MCFT TSP.
  10. Supervision: TSP.
  11. Validation: O. Lukjancenko MCFT.
  12. Visualization: TNP TSP.
  13. Writing – original draft: TNP.
  14. Writing – review & editing: TNP MMS.


  1. 1. Mackelprang R, Waldrop MP, DeAngelis KM, David MM, Chavarria KL, Blazewicz SJ, et al. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature. 2011. pp. 368–371.
  2. 2. Mason OU, Scott NM, Gonzalez A, Robbins-Pianka A, Bælum J, Kimbrel J, et al. Metagenomics reveals sediment microbial community response to Deepwater Horizon oil spill. ISME J. 2014;8: 1464–75. pmid:24451203
  3. 3. Hazen TC, Dubinsky EA, DeSantis TZ, Andersen GL, Piceno YM, Singh N, et al. Deep-sea oil plume enriches indigenous oil-degrading bacteria. Science. 2010;330: 204–208. pmid:20736401
  4. 4. Hauptmann AL, Stibal M, Bælum J, Sicheritz-Pontén T, Brunak S, Bowman JS, et al. Bacterial diversity in snow on North Pole ice floes. Extremophiles. 2014.
  5. 5. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al. Enterotypes of the human gut microbiome. Nature. 2011;473: 174–180. pmid:21508958
  6. 6. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32: 822–828. pmid:24997787
  7. 7. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research. 1997. pp. 3389–3402. pmid:9254694
  8. 8. Peabody MA, Van Rossum T, Lo R, Brinkman FSL. Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics. BMC Bioinformatics; 2015;16: 363.
  9. 9. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15: R46. pmid:24580807
  10. 10. Gerlach W, Stoye J. Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 2011;39.
  11. 11. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007;17: 377–386. pmid:17255551
  12. 12. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 2011;21: 1552–1560. pmid:21690186
  13. 13. Zhao Y, Tang H, Ye Y. RAPSearch2: A fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics. 2012;28: 125–126. pmid:22039206
  14. 14. Larsen M V., Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, et al. Benchmarking of methods for genomic taxonomy. J Clin Microbiol. 2014;52: 1529–1539. pmid:24574292
  15. 15. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26: 589–595. pmid:20080505
  16. 16. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, et al. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9: 386. pmid:18803844

0 Replies to “Reference Databases For Taxonomic Assignment In Metagenomics Of The Human”

Lascia un Commento

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *