MaizeGDB Genome Assembly and Annotation Manifesto

Genome Assembly
Genome Annotation
Suggested Elements for a Successful Collaboration with the Maize Community


If you need to document involvement of MaizeGDB in your planned assembly or annotation efforts, contact Carson Andorf (carson.andorf@ars.usda.gov) for a letter of collaboration.




B73 GENOME ASSEMBLY

It is imperative that the community work from the same genome coordinate system across projects in order to allow the data generated by various groups to be fully leveraged and displayed in a comparable manner. Like many Model Organism Databases, MaizeGDB is charged to facilitate this process and is committed to releasing official genome assemblies as they are made available.


November 2010
B73 RefGen_v2 was released as the default view of the assembly at MaizeGDB. This version was calculated by the Maize Genome Sequencing Consortium and became available via GenBank on December 7th, 2012. The project record is 10769.


April 2013
The next version, B73 RefGen_v3, became the default assembly view of the MaizeGDB Genome Browser in April 2013. RefGen_v3 was not a global re-assembly. B73 RefGen_v3 used Roche/454 reads produced from a whole genome shotgun (WGS) sequencing library to capture missing gene space within and between the original BACs. The 454 reads were assembled into contigs with AbySS and aligned to the B73 RefGen_v2 assembly to identify new contiguous pieces of DNA sequence that were already represented in the v2 assembly. In addition, ~65,000 Full Length cDNAs (FLcDNAs- from the Maize Full Length cDNA project; more information here and here) were aligned to both the B73 RefGen_v2 contigs and the new contigs. B73 RefGen_v3 was the final product of the Maize Genome Sequencing Consortium.


August 2016
An entirely new assembly of the maize genome (B73 RefGen_v4) was constructed from PacBio Single Molecule Real-Time (SMRT) sequencing at approximately 60 fold coverage and scaffolded with the aid of a high-resolution whole-genome restriction (optical) mapping. This new assembly was constructed without the assistance of the BAC physical map that had been used to guide the previous V1-V3 assemblies. The pseudomolecules of maize B73 RefGen_v4 were assembled nearly end-to-end, representing a 52-fold improvement in average contig size relative to the previous reference (B73 RefGen_v3). Additional information on this assembly can be found here. B73 RefGen_v4 was funded by the NSF IOS #1112127 award to Gramene.


ADDITIONAL GENOME ASSEMBLIES

Initially, the B73 genome was the only reference quality genome assembly available for maize due the high costs of sequencing and assembling a large (~2.1 GB) genome. More recently, as sequencing and assembling costs for a large genome have dropped, a number of maize research groups have constructed reference quality genome assemblies for some of the more widely used maize inbred lines. Detailed information on those genome assemblies can be found here.


GENOME ASSEMBLY AND GENE MODEL NOMENCLATURE

A well-developed nomenclature system is necessary to prevent confusion and to relay as much information as possible without being overly cumbersome. A nomenclature system needs to account for species-specific information so that the exact inbred line used and project-specific metadata can be accessed easily. The change from GRMZM IDs to the new nomenclature was necessitated for a few reasons. The main reason is to connote which maize line the models are derived from. This is particularly important in maize, which is well documented to contain substantial presence/absence variation (PAV) and copy number variation (CNV) across inbred lines. To make this transition easier, older maize nomenclature is retained as a synonym and can be used to look up gene models at MaizeGDB. Note also that gene model names in maize DO NOT CONVEY ORDER ALONG THE PSEUDOMOLECULE. Specific details on the current maize nomenclature standards in use can be found here and here.


MAIZEGDB ACCEPTS FUNCTIONAL ANNOTATION

Functional annotation can mean different things to different people. It generally involves attaching information regarding gene product identity, biological or biochemical function, expression, regulation, and interactions to a genomic DNA sequence. Are you generating RNAseq data and wish for that to be aligned to assemblies to show that the genes in a particular region are expressed? Do you have a mutation for a gene that is mapped to a genome assembly and the mutant phenotype is known? Have you experimentally determined the temporal and spatial regulation of a small group of transcription factors? MaizeGDB is interested in both small and large functional annotation data sets determined by either in silico analysis or experimental validation. Contact us at MaizeGDB to find out how your functional annotations can be included in the MaizeGDB resource.

In addition to the types of functional annotations already described, we at MaizeGDB accept functional annotations that are based upon assignment of terms from the Gene Ontologies (GO; http://www.geneontology.org) to gene structures. When GO terms are assigned to a particular gene, standard Evidence Codes are required to document how the inference of function was made. For example, an annotation that was made on the basis of an published, peer reviewed experiment would have the evidence code EXP, whereas an annotation made on the basis of an enzyme assay would have the evidence code IDA. Evidence Codes used by the Gene Ontology Consortium are available here.


SUGGESTED GUIDELINES FOR RESEARCH GROUPS PLANNING TO SEQUENCE, ASSEMBLE, AND ANNOTATE A MAIZE GENOME FOR SUBMISSION TO MAIZEGDB

A plan for providing documentation that is complete, accurate, and timely. A centrally accessible plan should be made available at the time that your project begins and include a timeline for data delivery. Functional and structural annotation should be provided with standard evidence codes, clearly discriminating annotation with experimental evidence from purely in silico analyses.

A plan for developing a close working relationship with MaizeGDB as the ultimate disseminators of the information. Assemblies and annotations should be delivered to MaizeGDB regularly and in a timely fashion. MaizeGDB can display the deliverable dates so as to keep the community informed. Ideally, these dates should be known in advance, and should be adhered to if at all possible. MaizeGDB will create a genome assembly webpage to display your project metadata for your genome assembly. It is understood that delays can occur. The intent here is to make the process more transparent to the research community.

A mechanism for interacting with the maize community directly and with a single voice. Maize researchers comprise a vibrant community with researchers at all levels in both the public and private sectors. A bidirectional means of communicating with the maize community should be deployed at the start of the project so that the maize community can both absorb and respond to new project information quickly. The goal is to provide all community members with the same information at the same time so that they can plan their research activities accordingly. This can be accomplished in many ways (FAQs, blogs, social media, conferences, etc.) and all options should be considered so as to reach the largest number of stakeholders.

A robust way to capture genome assembly and annotation information from the community. For any genome assembly, researchers often have high-quality structural and functional annotations for their genes of interest, both stored on lab computers and documented in publications. Researchers are usually willing to share this information freely, but currently, there is no robust means to capture it. Groups developing genome assemblies are encouraged to work with MaizeGDB to develop a plan for collection of high value annotations that are specific to their assemblies. All annotation submitted by community members for specific genome assemblies will be vetted by MaizeGDB curators and then be incorporated into the assembly, with an indication of who provided the data. It is expected that while there will be comparatively little data entering the assembly process in this way, these data would be of very high quality.