For my last post, I wanted to talk about metadata. These are the data that describe the who, what, where and how of the genomic data that we submit to Genbank SRA and/or to Dryad. Despite efforts by journals to require that this information be submitted alongside the genetic data, many papers provide minimal information: Pope et al. (2015) found that 31% of papers in Molecular Ecology did not provide enough information to recreate the analysis.
Nowadays if I wanted to do a meta-analysis of local adaptation in Echinoderms, I would have to start with a literature search, and then recover associated datasets from Dryad one at a time. Imagine if I could simply search for genomic-scale datasets and download all the data straightaway. That is what we were shooting for when we created the Genomics Observatory Metadatabase. Initially, our goal was just to capture all published population genetic data from the Indo-Pacific, but I was lucky to meet up with John Deck, who was building the Field Information Management System, and found common cause (Deck et al. 2017).
The database we’ve developed is an exciting new approach to the problem of metadata and reproducibility. Notably, we don’t try to reinvent the wheel by providing storage for the genomic data. Instead, you fill out one of our metadata templates, upload it to GeOMe, which will validate your entries in a variety of ways, and then will provide you with all the files necessary for direct upload to the Genbank Sequence Read Archive. Once they are uploaded, GeOMe harvests the accession information, thereby creating a permanent link between your metadata and the genomic data.
GeOMe is organized by discreet “Projects.” Each project can represent all of the data from one particular researcher, or a consortium of researchers, or all of the data generated by a grant or an institution. Groups working together on projects can determine which metadata fields are most important, and whether to keep the metadata in a structured, relational format, or a simple flat spreadsheet. Once these decisions have been made, GeOMe provides a spreadsheet template with all field definitions ready to go: your data management plan is done!. Within each project are “Expeditions” which can contain all of the data from an actual expedition, or all the data underlying a particular paper. Metadata fields in GeOMe follow DarwinCore, the community standards for biodiversity data. Finally, upon accession, every sample is given a globally unique identifier that means that anyone will be able to find your data into the next century.
For a lot of researchers, curation of the data underlying a paper is a little-valued last step after the grueling process of submitting it. We hope that GeOMe will put metadata collection in a more central place in your project, so that you are thinking about this important aspect as you collect the data, and so that archiving it will be as painless as possible.
Literature Cited
Deck J, Gaither MR, Ewing R, Bird CE, Davies N, Meyer C, Riginos C, Toonen RJ, Crandall ED. 2017. The Genomic Observatories Metadatabase (GeOMe): A new repository for field and sampling event metadata associated with genetic samples. PLoS Biol 15:e2002925.
Pope LC, Liggins L, Keyse J, Carvalho SB, Riginos C. 2015. Not the time or the place: the missing spatio-temporal link in publicly available genetic data. Mol Ecol 24:3802–3809.