What Kind of Data Can I Upload to Mgrast

Abstract

Every bit technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;viii:i–iii] is an instance; we use existing well-studied information sets equally gold standards representing different environments and different technologies to evaluate whatsoever changes to the pipeline. Currently, we use well-understood data sets in MG-RAST every bit platform for benchmarking. The use of bogus information sets for pipeline functioning optimization has not added value, as these information sets are not presenting the same challenges as real-earth information sets. In add-on, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the customs and our partners that volition enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language every bit a standard to specify bioinformatics workflows, both to facilitate evolution and efficient loftier-performance implementation of the community'due south data analysis tasks.

Introduction

The e'er-increasing corporeality of DNA sequence data [1] has motivated significant developments in biomedical research. Currently, however, many researchers keep to struggle with big-scale calculating and data management requirements. Numerous approaches have been proposed and are being pursued to alleviate this burden on awarding scientists. The approaches include focusing on the user-interface layer while relying primarily on legacy engineering science [two]; reimplementing significant chunks of lawmaking in new languages [3]; and developing clean-slate designs [4]. Breakthroughs that appreciably reduce computational burden, such as Diamond [5], are the exception. While important, few if any of the solutions contribute to solving the central trouble: data assay is becoming increasingly expensive in terms of both time and price, with reference databases growing rapidly and information volumes ascent. In essence, more and more than data are beingness produced without sufficient resource to analyze the information. All indicators show that this trend will continue in the foreseeable hereafter [ane].

Nosotros strongly believe that a change in how the enquiry customs handles routine data analytics is required. While we cannot predict the effect of this evolutionary process, scalable, flexible and—most important—efficient platforms will, in our opinion, be part of any 'new computational ecosystem'. MG-RAST [6] is one such platform that handles hundreds of submissions daily, often aggregating >0.5 terabytes in a 24 h menstruation.

MG-RAST is a hosted, open up-source, open-submission platform ('Software as a Service') that provides robust assay of environmental DNA data sets (where environment is broadly defined). The organization has three master components: a workflow, a data warehouse and an API (with a Web frontend). The workflow combines automatic quality control, automated analysis and user-driven parameter- and database-flexible assay. The data warehouse supports data archiving, discovery and integration. The platform is accessible via a Spider web interface [7], as well as a RESTful API [8].

Analysis of ecology DNA (i.eastward. metagenomics) presents a number of challenges, including characteristic extraction (e.yard. gene calling) from (mainly unassembled) oft lower-quality sequence data; data warehousing; movement of oftentimes large data sets to be compared confronting many equally big data sets; and data discovery. A key insight (run across Lessons learned, L1) is that the challenges faced here are distinct from the challenges facing groups that render services for private genomes or even sets of genomes [9].

Several hosted systems currently provide services in this field: JGI IMG/M [10], EBI MG-Portal [11] and MG-RAST [6]. Myriad stand up-solitary tools be, including integrative user-friendly interfaces [12]; feature prediction tools [5, 13]; tools that 'bin' individual reads using codon frequencies, read abundance and cross-sample abundance [14–17]; and sets of marker genes reducing the search space for analysis with associated visualization tools [18]. MG-RAST seeks to select the best-in-form implementations and provide a hosted resource-efficient service that implements a balance between custom analysis and one-size-fits-all recipes. The approach taken in MG-RAST to accomplish this goal is by defining parameters late—during download or analysis—not a priori before running a prepare of analyses tools. The analysis workflow in MG-RAST is identical across all data sets, except for data set-specific operations such as host Deoxyribonucleic acid removal and variations in filtering to suit dissimilar user-submitted data types.

While many approaches to metagenome analysis exist, we chose an approach that allows large-scale assay and massive comparisons. The core principle for the design of MG-RAST was to provide consequent analyses every bit deep and unbiased as possible at affordable computational toll. Other approaches, such as comprehensive genome and protein binning, adopted by IMG/M, or the contour hidden Markov model-based approaches using MG-Portal, do add value and provide valuable alternative analyses. These portals complement each other'south capabilities, and we routinely share best practices with them. MG-RAST'south potent suit is treatment raw reads direct from a sequencing service. Information technology has been extended to handle assembled metagenomes and metatranscriptomes equally well. In its current form, however, information technology does non back up metagenomics associates or a genome-axial approach to metagenomics (i.east. binning).

Like many hosted applications (not just in bioinformatics), MG-RAST started out as a traditional database-oriented system using largely traditional blueprint patterns. While expanding the number of machines able to execute MG-RAST workflows, we learned that data access input and output (I/O) is as limiting a factor every bit the processing ability or memory (encounter Lessons learned, L2 and L8). MG-RAST has rapidly adapted [xix–21] to come across the needs of a growing user community, also as the changing engineering landscape. Nosotros have run MG-RAST workflows on several computational platforms, including OpenStack [22], Amazon'due south AWS [23], Microsoft's Azure [24], several local clusters and fifty-fifty laptops on occasion. In many ways, MG-RAST has evolved to be the counterpoint to the at present-abundant 1-offs that are routinely implemented in many laboratories for sequence analysis. It offers reproducibility and was designed for efficient execution [9] (see Lessons learned, L9).

To appointment, MG-RAST has processed >295 000 data sets from 23 000 researchers. As of June 2017, over 1 trillion individual sequences totaling >40 terabase pairs have been processed, and the full volume of data generated is well over half a petabyte of data. A fair assessment is that we do a lot of the heavy lifting of high-book automated assay of amplicon and shotgun metagenomes every bit well as metatranscriptomes for a large user community. Currently, only twenty% (44 000) of the data sets in MG-RAST are publicly available. Data are frequently shared by researchers with but their collaborators. In future releases, we will introduce a series of features to incentivize information publication.

The vast bulk of the data sets in MG-RAST represent user submissions; <3000 are information sets extracted from SRA by the developers. Determining identity between whatever two information sets is far from piffling if available metadata does not provide sufficient prove—one more reason to incentivize metadata (see Lessons learned, L7). All the same, the developers are working jointly with researchers at EBI to synchronize the contents of EBI's ENA with the contents of MG-RAST. Currently, to the best of our combined knowledge, at that place is little overlap between the data sets in SRA/ENA and those in MG-RAST.

The analysis shown in Figure 1 is typical for 1 form of user queries. We note that in addition to requesting SEED annotations, the user might as well request annotations from the M5NR sub-databases (i.eastward. namespaces) such as KEGG pathways [25], KEGG orthologues [26], COG [27] and RefSeq [28]. Providing a smart data product that can be projected with no computation onto other namespaces (read annotation databases) saves a significant amount of computational resources (see Lessons learned, L3).

Effigy 1.

MG-RAST data and analysis results can be reused for other purposes. Here, we show a muscle [29] alignment of (the prodigal translations) of filtered sequences from the following unauthenticated API call: http://api.metagenomics.anl.gov//annotation/sequence/mgm4662210.3?evalue=10&type=function&source=Subsystems&filter=Inosine-5.

Figure 1.

MG-RAST data and analysis results can be reused for other purposes. Here, we show a muscle [29] alignment of (the prodigal translations) of filtered sequences from the following unauthenticated API call: http://api.metagenomics.anl.gov//annotation/sequence/mgm4662210.3?evalue=10&type=function&source=Subsystems&filter=Inosine-5.

Compared with previous versions of MG-RAST, the latest version has increased throughput dramatically while using the same amount of resources: ∼22 million core-hours annually are used to run the MG-RAST workflow for user-based submissions. In improver, the RESTful API has allowed a rethinking and restructuring of the user interface (different model–view–controller pattern) and, nearly of import, the reproducibility of results using containers.

Implementation

While MG-RAST started as an awarding built on a traditional LAMP stack [30], we quickly realized that a unmarried database could non provide sufficient flexibility to support diverse underlying components at the scale required. Instead, we chose to rely on an open up API [8] that provides the power to change underlying components as required by scale.

We note that MG-RAST is not a comprehensive analysis tool offering every conceivable kind of bazaar analysis. By making some data-filtering parameters user adjustable at analysis or download fourth dimension, MG-RAST provides flexibility. Via the API, users and developers tin piping MG-RAST information and results into their in-house analysis procedures. Figure one shows an MG-RAST API query for sequences with similarities to proteins with the SEED Subsystem namespace note inosine-5′ phosphate dehydrogenase from the soil metagenome mgm4662210.3 that is streamed into a filtering and alignment procedure. A key characteristic of MG-RAST (run across Lessons learned, L6) is its power to adjust database match parameter at query time—a office frequently not recognized by researchers and in some cases missed fifty-fifty past studies comparing systems [31].

MG-RAST has been designed to treat every data set with the same pipeline. Given the expected volume and variety of datasets, per-information ready optimization of parameters has not been a design goal. The system is optimized for robust handling of a wide variety of input types, and users can perform optimizations within sets of parameters that filter the pipeline results. The automatic setting of, for instance, detection thresholds for dramatically different information types and inquiry questions is not the role of a data analysis platform. While this one-size-fits-all nature of the processing might somewhat limit sensitivity and potentially limit downstream scientific inquiry, these limitations are balanced past the vast scope of the consistently analyzed data universe that the uniformly applied workflows and data management and discovery systems enable researchers to access. We believe that relying on smart data products that enable adjustment of parameters after processing and using custom downstream assay scripts more than compensate for any reduction in sensitivity (see Lessons learned, L3 and L6).

Backend components

Figure 2 shows the electric current design of the MG-RAST backend components, using various databases and caching systems [32–35] equally appropriate to support the API with the performance needed.

Effigy ii.

Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API.

Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API.

Figure 2.

Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API.

Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API.

A major criterion for success of the workflow is the ability to calibration to the throughput levels required. Algorithmic changes (east.g. adoption of Diamond [v]) can help, only the design of the execution environment—most specifically its portability—is the single cardinal to scaling (see Lessons learned, L4 and L5).

Access to data

In computational biology, shared filesystems traditionally are used to serve data to the computational resources. Sharing data between multiple computers is necessary considering the data typically require more computational resources than a single auto can provide. Shared filesystems tin can return data accessible on several computers. This arroyo, however, limits the range of available platforms or requires pregnant time for configuring access or moving data into the platform. In addition, many shared filesystems exhibit poor scaling performance for scientific discipline applications. Slow or inadequate shared filesystems accept been observed past well-nigh every practitioner of bioinformatics (see Lessons learned, L2). This situation has forced the use of complex I/O middleware to transform science I/O workloads into patterns that tin scale in various scientific discipline domains, including breakthrough chromodynamics and astrophysics [36], molecular dynamics [37], fusion scientific discipline [38] and climate [39].

Rather than adopting this approach, we conducted a detailed analysis of our workloads, which revealed that individual computational units (e.g. cluster nodes) typically utilize a pocket-sized fraction of the data and do not crave access to the entire data set. Consequently, we chose to centralize data into a unmarried indicate and access information technology in a RESTful way, thus providing efficient admission while requiring no configuration for the vast bulk of computing systems. A unmarried object shop tin can support distributed streaming assay of information across many computers (meet Lessons learned, L8).

The SHOCK object store [xl] provides secure access to data and, virtually important, to subsets of the data. A computational customer node tin can request a number of sequence records or sets of records coming together specific criteria. Data are typically streamed at significant fractions of line speed, and as results are oftentimes returned as indices that are much smaller than the original information files, writing is extremely efficient. Furthermore, the information are primarily write-one time, which significantly simplifies the design of the object store with respect to data consistency.

Information in SHOCK is bachelor to third parties via a RESTful API, and thus, Stupor supports the reuse of both data and results.

Execution format

Executing workflows beyond a number of systems requires that the code be fabricated bachelor in suitable binary form on those platforms. Among the emerging challenges, reproducibility is a key trouble for scientific disciplines that rely on the results of sequence analysis at scale without the ability to validate every single computational step in depth.

Virtual machines accept been used to provide stable and portable execution environments [41] for a number of years. Nevertheless, because of many technical details (due east.g. pregnant number of binary formats required to cover all platforms) and significant overhead [42] in execution, containers provide a more suitable platform for most scientific computations.

In particular, the relatively contempo advent of binary Linux containers (notably, Docker) in computing affords a novel way to distribute execution environments. Containers reduce the set of requirements for whatever given software packet to one: a container. Nosotros accept devised a scalable organisation [43] to execute scientific workflows beyond a number of containers continued only via a RESTful interface to an object store. With increasing numbers of systems supporting containerized execution [44] and with compatibility mechanisms [45] emerging to support legacy installations, Linux containers are quickly becoming the lingua franca of binary execution environments (see Lessons learned, L5). Equally with all of MG-RAST, the recipes for edifice the containers ('Dockerfiles') are available every bit open source on github, and the binary containers are available on DockerHub. The resulting containers are non specific to the MG-RAST systems, and the binary containers and the recipes are available to third parties for their adoption.

Current MG-RAST workflow

MG-RAST has been used for tens of thousands of data sets. This all-encompassing utilize has led to a level of stability and robustness that few sequence analysis workflows tin can match.

The workflow (version 4.01) consists of the following logical steps:

  1. Data hygiene Providing quality control and normalization steps that also include mate pair merging with ea-utils fastq-join [46–48]. The focus, withal, is on antiquity removal and host Dna removal [48, 49].

  2. Characteristic extraction Using a predictor that has been shown to be robust confronting sequence noise (FragGeneScan [l]) to predict potentially protein-coding features, and using a purposefully simple similarity-based approach to predict ribosomal RNAs using VSEARCH [51]. The similarity-based predictions use a version of M5RNA [52] that was clustered at 70% identity to find candidate ribosomal RNA sequences.

  3. Data reduction Clustering of predicted features at 90% identity (poly peptide coding) and 97% (ribosomal RNA). Features overlapping with predicted ribosomal RNA (rRNA) sequences are removed. For each cluster, the longest representative is used.

  4. Feature annotation Using similarity-based mapping of cluster representatives using super nonredundant M5NR [52] with a parallelized version of BLAT [53] for candidate proteins and ribosomal RNAs. This creates 'annotations' with M5NR database identifiers but.

  5. Profile creation Mapping the M5NR identifiers to several functional namespaces (east.g. RefSeq or SEED), hierarchical namespaces (due east.k. COG and Subsystems), pivoting into functional and taxonomic categories, and thus creating a reduced fingerprint ('profile') for each namespace and hierarchy.

  6. Database load Uploading profiles to the diverse MG-RAST backend databases that support the API.

We notation that the arroyo taken to sequence analysis is different from the state of the art for more or less consummate microbial genomes [54].

Using data from MG-RAST

A key problem of current big data bioinformatics is the barrier to reuse of data and results. Comparing results of an expensive computational procedure with results from some other laboratory can be problematic if the procedures used are not identical (potentially compromising integrity of the study). Another common approach is to not reuse existing results but to do an expensive reanalysis of both data sets, thus duplicating the work originally performed. One key trouble with this approach is that data-driven scientific discipline is no longer reviewable, equally no reviewer can be expected to retrace the steps of the investigators while duplicating their computational work. If the information and results (also intermediate results) were bachelor as reproducible entities, the trouble of information uncertainty and costly recomputations would disappear.

This exceptional waste of estimator time is acceptable default behavior in a discipline that is rich in computation and poor in information. In a data-rich ecosystem, yet, either the terms of engagement have to alter or the percentage of the research budgets allocated to computational resources has to dramatically increment. One of the key goals of MG-RAST is to provide a wealth of data sets and the underlying assay results. Both the Web-based user interface and the RESTful API make these results accessible. To get closer to our goal of transparent and reproducible MG-RAST information analysis, nosotros already execute all workflow steps in containers. The missing building cake—which nosotros are currently working on and which volition enable every interested political party to hands to execute, compare or modify our analysis pipeline—is support for the Common Workflow Language (CWL) [55] in our workflow engine. We think that producing data with a CWL workflow adds more value considering it adds executable provenance (run across Lessons learned, L4). Executable provenance is disquisitional, equally it allows recreation of the results on a wide multifariousness of computational platforms.

Using profiles generated by MG-RAST

Profiles are the primary data product generated by MG-RAST, and they feed into the Web user interface and the various other tools. They encode the abundance of entities in a given sample combining information from several databases. Most important, profiles include information on the quality of the underlying observation (e.m. results of sequence similarity search) (Figure 3). Profiles are a compressed representation of the environmental samples, assuasive large-calibration comparisons.

Some other critical feature is the ability to adjust matching parameters (e.g. minimal alignment length required for inclusion) at analysis fourth dimension, allowing data reuse without the need for recomputing the contour with different cutoffs. With this 'smart data product', data consumers can switch between reference databases and parameter sets without recomputing the underlying sequence similarity searches (see Lessons learned, L3).

Metadata—making data discoverable

A primal component of data reuse is the much-discussed 'metadata' (or 'information describing data'). With tens of thousands of data sets available, the ability to identify the relevance of data sets has get critical. Approaches include 'simple' motorcar-readable encoding of data items such as pH, temperature and location and the utilize of controlled vocabularies to allow unambiguous encoding of, for instance, anatomical organs via [56] or geographical features using the ENVO ontology [57].

Auto-readable metadata, such as the concepts championed past the Genomic Standards Consortium (GSC) [58], is key. GSC metadata is intentionally kept as simple and lightweight every bit possible while trying to run across the needs of the data producers and data consumers. Despite its simplicity, still, for the occasional user (e.g. a scientist depositing data), information technology is still cumbersome. Tools such as Metazen [59] help bridge the gap between data scientists and occasional users. MG-RAST implements the cadre MIxS [60] checklist, as well as all bachelor environmental packages [61].

GSC-compliant, machine-readable markup of data sets at the time of upload to or deposition in online resources offers a unique opportunity. Data become discoverable, and assay is fabricated easier. MG-RAST incentivizes the addition of metadata by offering priority admission to the compute resources to data sets with valid GSC metadata (see Lessons learned, L7).

Web user interface

Non all scientists spend a pregnant fraction of their time on the control line or savour using the command line to solve their bioinformatics questions. Extracting and displaying the relative abundance of proteins from proteins classified as part of the subsystem class 'Poly peptide Metabolism'from the phylum Proteobacteria are elementary via the Spider web interface (Figure 4) merely require many control line invocations.

Figure three.

MG-RAST profile encoding abundance and matching parameter information as well as information on the observed entities.

MG-RAST profile encoding abundance and matching parameter information likewise as data on the observed entities.

Figure 3.

MG-RAST profile encoding abundance and matching parameter information as well as information on the observed entities.

MG-RAST profile encoding abundance and matching parameter information likewise as information on the observed entities.

For these users, MG-RAST provides a graphical user interface (GUI) implemented in JavaScript/HTML5. The GUI provides guidance for nontrivial procedures such every bit data upload and validation, data sharing and data discovery, besides as data assay Figure 5A. Data export in various formats is besides supported Figure 5B.

Effigy 4.

Relative abundance of protein functional classes ('Subsystems') in Proteobacteria ('RefSeq Phylum') displayed as a waterfall diagram for data sets in study mgp128 as displayed by the version 4.0 MG-RAST graphical user interface.

Relative abundance of protein functional classes ('Subsystems') in Proteobacteria ('RefSeq Phylum') displayed every bit a waterfall diagram for information sets in written report mgp128 every bit displayed by the version 4.0 MG-RAST graphical user interface.

Effigy 4.

Relative abundance of protein functional classes ('Subsystems') in Proteobacteria ('RefSeq Phylum') displayed as a waterfall diagram for data sets in study mgp128 as displayed by the version 4.0 MG-RAST graphical user interface.

Relative abundance of protein functional classes ('Subsystems') in Proteobacteria ('RefSeq Phylum') displayed as a waterfall diagram for data sets in report mgp128 as displayed past the version 4.0 MG-RAST graphical user interface.

User'due south view of MG-RAST

Every user has a different view of the data in MG-RAST. All users take access to the public metagenomics data, but shared or private information available to the user are linked to the user'due south login information. Each data set has a unique identifier and information on visibility; until the data are fabricated publicly available, temporary identifiers are used to minimize the number of data sets mentioned in the literature without beingness publicly bachelor. Figure 6 provides a comparison of public and individual data sets and highlights the sharing and data organization capabilities of the platform.

Effigy 5.

(A) Heatmap and clustering of the occurrence of Corynebacteria in study mgp128 as displayed by the MG-RAST web frontend. (B) Data export options available for the data and visualization, including sequences and abundance in tabular and JSON format.

(A) Heatmap and clustering of the occurrence of Corynebacteria in study mgp128 every bit displayed by the MG-RAST web frontend. (B) Data consign options available for the information and visualization, including sequences and affluence in tabular and JSON format.

Effigy 5.

(A) Heatmap and clustering of the occurrence of Corynebacteria in study mgp128 as displayed by the MG-RAST web frontend. (B) Data export options available for the data and visualization, including sequences and abundance in tabular and JSON format.

(A) Heatmap and clustering of the occurrence of Corynebacteria in report mgp128 as displayed by the MG-RAST web frontend. (B) Data export options available for the data and visualization, including sequences and abundance in tabular and JSON format.

Figure 6.

Public study (with permanent unique identifier mgp128) and private study set with temporary identifier. A study groups multiple data sets, provides a single identifier and allows sharing via simply providing an email address for the person the data are to be shared with.

Public study (with permanent unique identifier mgp128) and private study set with temporary identifier. A written report groups multiple data sets, provides a single identifier and allows sharing via but providing an email address for the person the data are to be shared with.

Figure vi.

Public study (with permanent unique identifier mgp128) and private study set with temporary identifier. A study groups multiple data sets, provides a single identifier and allows sharing via simply providing an email address for the person the data are to be shared with.

Public written report (with permanent unique identifier mgp128) and individual report prepare with temporary identifier. A study groups multiple data sets, provides a single identifier and allows sharing via but providing an email accost for the person the data are to be shared with.

A cardinal blueprint characteristic of MG-RAST is to allow individual data sets; users are in charge of uploading, sharing and releasing the information. Once submitted, data are private to the submitting user. The submitting user is reminded to share their data at their primeval convenience.

In addition to data, the processing pipeline and the data warehousing, MG-RAST provides an analytical tool set up. Information technology is implemented as a user-friendly Web application and consuming the profiles generated past the MG-RAST pipeline.

Future piece of work

As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large-volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [62] is an example. Nosotros use existing well-studied data sets as gilded standards representing different environments and dissimilar technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST every bit a platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value because these data sets practice not nowadays the aforementioned challenges as real-globe information sets do.

The MG-RAST team welcomes suggestions for improvements of the workflow. Nosotros are currently working on versions 4.02 and four.1, both of which contain pregnant input from the community and our partners that will enable double barcoding and stronger inferences supported by longer-read technologies and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA.

On the technical platform side, the MG-RAST team intends to support the CWL every bit a standard to specify bioinformatics workflows, to facilitate both development and efficient high-performance implementation of the community's data analysis tasks.

Lessons learned

L1. Analyzing big-scale ecology Dna is unlike from genomics.

Because of the absence of high-quality assembled data (in most projects) and the lack of good models for removing contaminations upstream, a metagenomics portal site has to accept over quality command and normalization and go good at it.

L2. Data I/O is as limiting equally CPU and RAM.

A bad tradition in bioinformatics is ignoring the toll of I/O. Big-calibration distributed systems demand to model the I/O price explicitly and design their solution to include I/O toll too every bit CPU toll.

L3. Using smart data products helps avoid costly recomputations and empowers downstream tool builders.

The bad tradition of downloading raw information and creating spreadsheets with results is not sustainable. While bioinformatics is not yet able to fully rely on disseminating data every bit inquiry objects [63], we need to move toward them.

L4. The use of reproducible workflows such as CWL [55, 64] is a crucial requirement for any service generating data meant for reuse.

Providing a detailed, portable, executable recipe for how the data were generated is important to data consumers. In improver, making the recipes available supports comeback to the workflows by third parties.

L5. Containers should be used to capture the execution surroundings.

Containers (e.thou. Linux containers) capture the surround in a reproducible format.

Workflows without their environment are less than useful.

L6. Data reuse is critical for saving computational cost.

While the reproducibility resulting from reproducible execution environments is slap-up, providing intermediate results adds significantly more value to reviewers and fosters reuse of computational results for a variety of purposes such every bit edifice software to improve existing components (e.one thousand. feature predictors) or use the data for scientific projects.

L7. Metadata is invaluable and should be required.

Users require encouragement to provide metadata. Nosotros aim to make users submit metadata equally early on as possible, and to incentivize users, nosotros provide high-quality tools that make metadata collection piece of cake.

L8. The complication of shared filesystems should be avoided whenever possible.

Relying on RESTful interfaces instead of shared filesystems provides cantankerous-cloud execution capabilities, allowing us to run on well-nigh any computational platform including the cheapest computational platform available.

L9. Portals are the correct place for performance engineering.

While many biomedical informatics groups are computationally proficient, the convergence of big-scale processing and domain expertise makes portal sites an ideal location for optimization. Running many workflows thousands of times and providing services to many other groups is a adept platform for accumulating expertise.

Discussion

As more environmental DNA sequence data become available to the enquiry community, a new fix of challenges emerges. These challenges require a change in approach to computing at the customs level. We depict a domain-specific portal that, like its European companion organization [11], acts every bit an integrator of information and efficiently implements domain-specific workflows. The lessons learned about building scale-out infrastructure dedicated to executing bioinformatics workflows and the resulting middleware systems [19, twenty, 40, 59, 65] will do good both the community of users and researchers attempting to build efficient sequence analytics workflows.

Reproducible efficient execution of domain-specific workflows is a central contribution of the MG-RAST arrangement. Provisioning of data and results via a Web interface and a RESTful API is another key attribute. Encouraging data reuse past provisioning both data and results (besides equally intermediate files) via a stable API is a key function that serves the community of bioinformatics developers, who can use precomputed information that are well described by a workflow, rather than implementing their own (frequently subpar preprocessing steps), and thus tin can focus on their cardinal mission.

By providing preanalyzed data (using an open recipe that is bachelor to the community for discussion and improvement), MG-RAST tin can help reduce the current 'method uncertainty', where individual information sets analyzed with dissimilar analysis strategies can lead to dramatically different interpretations.

The role of MG-RAST is not one-size-fits-all. Rather than being the one and only assay mechanism, MG-RAST is a well-designed high-performance organization on summit of an efficient scale-out platform [66] that can take some of the heavy lifting off the shoulders of individual researchers. Researchers can add their own custom boutique analyses at a fraction of the computational and development toll, allowing them to focus on their specific trouble and thus maximizing overall productivity.

With the state of the fine art of sequencing engineering shifting, MG-RAST will adapt to extract maximum value by, for instance, explicitly supporting value-added information from longer sequences with multiple features, for example for taxonomy calling. We also anticipate that the currently used alignment-based methods will be supplemented past profile-based methods for performance reasons within a few years.

Key Points

  • Analyzing the growing volume of biomedical environmental sequence information requires cost-constructive, reproducible and flexible analysis platforms and data reuse and is significantly dissimilar from analyzing (most) complete genomes.

  • The hosted MG-RAST service provides a Linux container-based workflow organisation and a RESTful API that allow data and analysis reuse.

  • Customs portals are the right location for operation engineering, as they operate at the required scale.

Folker Meyer is a Senior Computational Biologist at Argonne National Laboratory; a Professor at the Department of Medicine, Academy of Chicago; and a Senior Young man at the Ciphering Institute at the University of Chicago. He is also deputy segmentation managing director of the Biological science Partition at Argonne National Laboratory and a senior fellow at the Found of Genomics and Systems Biology (a joint Argonne National Laboratory and University of Chicago Institute).

Saurabh Bagchi is a Professor in the Schoolhouse of Electrical and Reckoner Applied science and the Department of Estimator Science (past courtesy) at Purdue University. He is the founding Director of CRISP, a academy-broad resiliency heart at Purdue.

Somali Chaterji is a biomedical engineer and medical information analyst. She is a Enquiry Kinesthesia at Purdue University, specializing in high-performance computing infrastructures and algorithms for synthetic biology and epigenomics.

Wolfgang Gerlach, PhD is a Bioinformatics Senior Software Engineer at the University of Chicago with a joint appointment at Argonne National Laboratory.

Ananth Grama is a Professor of Reckoner Scientific discipline at Purdue University. He also serves as the Associate Director of the Center for Science of Information, a Scientific discipline and Engineering Center of the National Scientific discipline Foundation.

Travis Harrison is a Bioinformatics Senior Software Engineer at the University of Chicago with a joint engagement at Argonne National Laboratory.

Tobias Paczian is a Senior Developer at the University of Chicago with a joint appointment at Argonne National Laboratory. He has more than a decade of experience building User Interfaces for bioinformatics applications.

William L. Trimble, PhD is a postdoctoral researcher at Argonne National Laboratory with a background in physics and data science.

Andreas Wilke is a Chief Bioinformatics Specialist Argonne National Laboratory with a joint appointment at the University of Chicago. He has more than than a decade of experience building bioinformatics applications.

Acknowledgements

The authors thank Dion Antonopoulos, Gail Pieper and Robert Ross for their input and assistance.

Funding

The work reported in this article was supported in part by a grant from the National Institutes of Health (NIH) grant 1R01AI123037-01. Work on this commodity was also supported past NSF award 1645609. This work was supported in part by the NIH honour U01HG006537 'OSDF: Back up infrastructure for NextGen sequence storage, assay, and management', by the Gordon and Betty Moore Foundation with the grant '6-34881, METAZen-Going the Final Mile for Solving the Metadata Crunch)'. This material was based on work supported past the Usa Section of Energy, Part of Science, nether contract DE-AC02-06CH11357.

References

2

Afgan

East

,

Bakery

D

,

van den Beek

M

, et al.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update

.

Nucleic Acids Res

2016

;

44

:

W3

x

.

3

Doring

A

,

Weese

D

,

Rausch

T

, et al.

SeqAn an efficient, generic C ++ library for sequence assay

.

BMC Bioinformatics

2008

;

ix

:

11.

4

Xia

F

,

Dou

Y

,

Xu

J.

Families of FPGA-based accelerators for BLAST algorithm with multi-seeds detection and parallel extension. In: Elloumi One thousand, Küng J, Linial M, et al. (eds), Bioinformatics Enquiry and Evolution: Second International Conference, BIRD 2008 Vienna, Austria, July seven-nine, 2008 Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg,

2008

, 43–57.

5

Buchfink

B

,

Xie

C

,

Huson

DH.

Fast and sensitive protein alignment using DIAMOND

.

Nat Methods

2015

;

12

:

59

60

.

6

Meyer

F

,

Paarmann

D

,

D'Souza

M

, et al.

The metagenomics RAST server - a public resources for the automatic phylogenetic and functional assay of metagenomes

.

BMC Bioinformatics

2008

;

9

:

386.

vii

Wilke

A

,

Bischof

J

,

Gerlach

W

, et al.

The MG-RAST metagenomics database and portal in 2015

.

Nucleic Acids Res

2016

;

44

:

D590

4

.

8

Wilke

A

,

Bischof

J

,

Harrison

T

, et al.

A RESTful API for accessing microbial community data for MG-RAST

.

PLoS Comput Biol

2015

;

11

:

e1004008.

9

Desai

Due north

,

Antonopoulos

D

,

Gilbert

JA

, et al.

From genomics to metagenomics

.

Curr Opin Biotechnol

2012

;

23

:

72

half-dozen

.

10

Chen

IA

,

Markowitz

VM

,

Chu

G

, et al.

IMG/M: integrated genome and metagenome comparative data assay system

.

Nucleic Acids Res

2017

;

45

:

D507

sixteen

.

11

Mitchell

A

,

Bucchini

F

,

Cochrane

G

, et al.

EBI metagenomics in 2016–an expanding and evolving resources for the analysis and archiving of metagenomic information

.

Nucleic Acids Res

2016

;

44

:

D595

603

.

12

Huson

DH

,

Weber

N.

Microbial community analysis using MEGAN

.

Methods Enzymol

2013

;

531

:

465

85

.

13

Kopylova

E

,

Noe

Fifty

,

Touzet

H.

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

.

Bioinformatics

2012

;

28

:

3211

seven

.

14

Kang

DD

,

Froula

J

,

Egan

R

, et al.

MetaBAT, an efficient tool for accurately reconstructing unmarried genomes from circuitous microbial communities

.

PeerJ

2015

;

three

:

e1165.

15

Eren

AM

,

Esen

OC

,

Quince

C

, et al.

Anvi'o: an avant-garde analysis and visualization platform for 'omics data

.

PeerJ

2015

;

three

:

e1319.

16

Imelfort

M

,

Parks

D

,

Woodcroft

BJ

, et al.

GroopM: an automated tool for the recovery of population genomes from related metagenomes

.

PeerJ

2014

;

two

:

e603.

17

Alneberg

J

,

Bjarnason

BS

,

de Bruijn

I

, et al.

Binning metagenomic contigs by coverage and composition

.

Nat Methods

2014

;

11

:

1144

6

.

18

Segata

N

,

Waldron

L

,

Ballarini

A

, et al.

Metagenomic microbial community profiling using unique clade-specific marker genes

.

Nat Methods

2012

;

9

:

811

iv

.

19

Tang

W

,

Bischof

J

,

Desai

Due north

, et al.  Workload characterization for MG-RAST metagenomic information analytics service in the cloud. In: Proceedings of IEEE International Briefing on Large Data, Washington, DC, U.s.,

2014

. IEEE Press, Piscataway, NJ, USA.

xx

Tang

W

,

Wilkening

J

,

Bischof

J

, et al.  Edifice scalable data direction and analysis infrastructure for metagenomics. In: 5th International Workshop on Information-Intensive Computing in the Clouds, Poster at Supercomputing

2013

.

21

Wilke

A

,

Wilkening

J

,

Drinking glass

EM

, et al.

An feel study: porting the MG-RAST rapid metagenomics analysis pipeline to the cloud

.

Concurr Comput

2011

;

23

:

2250

7

.

25

Kanehisa

G

,

Goto

Due south

,

Sato

Y

, et al.

Data, data, knowledge and principle: dorsum to metabolism in KEGG

.

Nucleic Acids Res

2014

;

42

:

D199

205

.

27

Tatusov

RL

,

Fedorova

ND

,

Jackson

JD

, et al.

The COG database: an updated version includes eukaryotes

.

BMC Bioinformatics

2003

;

4

:

41.

28

O'Leary

NA

,

Wright

MW

,

Brister

JR

, et al.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

.

Nucleic Acids Res

2016

;

44

:

D733

45

.

29

Edgar

RC.

MUSCLE: multiple sequence alignment with loftier accuracy and high throughput

.

Nucleic Acids Res

2004

;

32

:

1792

7

.

31

Plummer

E

,

Twin

J

,

Bulach

DM

, et al.

A comparing of three bioinformatics pipelines for the assay of preterm gut microbiota using 16S rRNA gene sequencing data

.

J Proteom Bioinform

2016

;

283

91

.

32

Alexandre

R.

Instant Apache Solr for Indexing Information How-to

.

Packt Publishing Limited

,

2013

.

36

Aptitude

J

,

Gibson

M

,

Grider

G

, et al.

A Checkpoint Filesystem for Parallel Applications

.

2009

.

37

Jens Freche

WF

,

Sutmann

G.

High-Throughput Parallel-I/O using SIONlib for Mesoscopic Particle Dynamics Simulations on Massively Parallel Computers, Advances in Parallel Computing; Volume 19: Parallel Computing: From Multicores and GPU's to Petascale, 371–78, DOI: 10.3233/978-ane-60750-530-3-371. IOS Press, Amsterdam.

38

Jay

FL

,

Scott

K

,

Karsten

S

, et al.  Flexible IO and integration for scientific codes through the adaptable IO system (Evict). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments. Boston, MA: ACM,

2008

, fifteen–24.

39

Dennis

JM

,

Edwards

J

,

Loy

R

, et al.

An application level parallel I/O library for earth system models

.

Int J High Perform Comput Appl

2012

;

26

:

43

53

.

40

Bischof

J

,

Wilke

A

,

Gerlach

Due west

, et al.  Stupor: active storage for multicloud streaming data analysis. In: second IEEE/ACM International Symposium on Big Data Computing. Limassol, Cyprus,

2015

.

41

Wilkening

J

,

Wilke

A

,

Desai

N

, et al. Using Clouds for Metagenomics: A Case Study. CLUSTER. New Orleans, LA: IEEE Estimator Guild,

2009

, 1–6.

43

Gerlach

W

,

Tang

Due west

,

Keegan

K

, et al.  Skyport: container-based execution surroundings management for multi-cloud scientific workflows. In: Proceedings of the fifth International Workshop on Data-Intensive Computing in the Clouds.

2014

, 25–32. IEEE Press, Piscataway, NJ, USA.

44

Kurtzer

Yard

,

Sochat

V

,

Bauer

M.

Singularity: Scientific containers for mobility of compute

.

PLoS ONE

2017

;

12

(

5

):

e0177459.

46

Keegan

KP

,

Trimble

WL

,

Wilkening

J

, et al.

A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE

.

PLoS Comput Biol

2012

;

8

:

e1002541.

47

Marcais

Thousand

,

Kingsford

C.

A fast, lock-free approach for efficient parallel counting of occurrences of thou-mers

.

Bioinformatics

2011

;

27

:

764

70

.

48

Aronesty

E.

Comparison of sequencing utility programs

.

Open Bioinform J

2013

;

vii

:

1

viii

.

49

Langmead

B

,

Salzberg

SL.

Fast gapped-read alignment with Bowtie 2

.

Nat Methods

2012

;

9

:

357

9

.

50

Rho

M

,

Tang

H

,

Ye

Y.

FragGeneScan: predicting genes in short and mistake-decumbent reads

.

Nucleic Acids Res

2010

;

38

:

e191.

51

Rognes

T

,

Flouri

T

,

Nichols

B

, et al.

VSEARCH: a versatile open up source tool for metagenomics

.

PeerJ

2016

;

4

:

e2584.

52

Wilke

A

,

Harrison

T

,

Wilkening

J

, et al.

The M5nr: a novel not-redundant database containing poly peptide sequences and annotations from multiple sources and associated tools

.

BMC Bioinformatics

2012

;

13

:

141.

53

Kent

WJ.

BLAT–the BLAST-like alignment tool

.

Genome Res

2002

;

12

:

656

64

.

54

Overbeek

R

,

Bartels

D

,

Vonstein

5

, et al.

Annotation of bacterial and archaeal genomes: improving accuracy and consistency

.

Chem Rev

2007

;

107

:

3431

47

.

56

Mungall

CJ

,

Torniai

C

,

Gkoutos

GV

, et al.

Uberon, an integrative multi-species anatomy ontology

.

Genome Biol

2012

;

xiii

:

R5.

57

Buttigieg

PL

,

Morrison

Northward

,

Smith

B

, et al.

The surroundings ontology: contextualising biological and biomedical entities

.

J Biomed Semantics

2013

;

4

:

43.

58

Field

D

,

Sterk

P

,

Kottmann

R

, et al.

Genomic standards consortium projects

.

Stand Genomic Sci

2014

;

9

:

599

601

.

59

Bischof

J

,

Harrison

T

,

Paczian

T

, et al.

Metazen - metadata capture for metagenomes

.

Stand Genomic Sci

2014

;

9

:

18.

sixty

Yilmaz

P

,

Kottmann

R

,

Field

D

, et al.

Minimum information about a marker gene sequence (MIMARKS) and minimum data nearly whatever (10) sequence (MIxS) specifications

.

Nat Biotechnol

2011

;

29

:

415

20

.

61

Glass

EM

,

Dribinsky

Y

,

Yilmaz

P

, et al.

MIxS-BE: a MIxS extension defining a minimum information standard for sequence information from the congenital environs

.

ISME J

2014

;

8

:

one

3

.

62

Trimble

WL

,

Keegan

KP

,

D'Souza

Yard

, et al.

Short-read reading-frame predictors are not created equal: sequence fault causes loss of signal

.

BMC Bioinformatics

2012

;

13

:

183.

63

Sean

B

,

Buchan

I

,

De Roure

D

, et al.

Why linked information is not enough for scientists

.

Fut Gener Comput Syst

2013

;

29

(

2

):

599

611

.

64

Crusoe

MR

,

Brown

CT.

Walking the talk: adopting and adapting sustainable scientific software development processes in a small biology lab

.

J Open Res Softw

2016

;

iv

:

e44

.

65

Tang

Due west

,

Wilkening

J

,

Desai

N

, et al.  A scalable information assay platform for metagenomics. In: 2013 IEEE International Conference on Big Data, Silicon Valley, CA, Us,

2013

. IEEE Press, Piscataway, NJ, Us.

66

Michael

Yard

,

Moreira

JE

,

Shiloach

D

, et al.  Scale-upwards x calibration-out: a case study using Nutch/Lucene. In: 2007 IEEE International Parallel and Distributed Processing Symposium.

2007

, 1.

This work is written by US Regime employees and is in the public domain in the Us.

bowmanfinge1953.blogspot.com

Source: https://academic.oup.com/bib/article/20/4/1151/4237462

0 Response to "What Kind of Data Can I Upload to Mgrast"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel