Authors: Rosemary Hartman and the IEP Data Utilization Work Group
Here at IEP, we collect a lot of data, and we do a lot of science. However, people haven’t always realized how much data we collect because it hasn’t always been easy to find. For scientists that were able to find the data, sometimes it was difficult to understand or it was shared in a hard-to-use format. That’s why IEP’s Data Utilization Work Group (DUWG) has been pushing for more Open Science practices over the past five years to make our data more F.A.I.R (Findable, Accessible, Interoperable, and Reusable). And wow! We’ve come a long way in a short time.
This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807
What is Open Science anyway? Well, I was going to call it “the cool-kids club” but really it’s the opposite of a club! It’s the anti-club that makes sure everyone has access to science – no membership required. Open science means that all scientists communicate in a transparent, reproducible way, with open-access publications, freely shared data, open-source software, and openness to diversity of knowledge. Open science encourages collaboration and breaks down silos between researchers – so it’s a natural fit for a 9-member collaborative organization like IEP.
For IEP, the ‘open data’ component is where we’ve really been making strides. While “share your data freely” sounds easy, it’s actually taken a lot of work to make our data FAIR. As government entities, theoretically all of the data we collect is held in the public trust, but putting data in a format that other people can use is not simple. Here are some of the things we have done to make IEP data more open:
Data Management Plans
The first thing the DUWG did was get all IEP projects to fill out a simple, 2-page data management plan outlining what was being done with the data in short, clear sections:
- Who: Principal investigator and point of contact for the data.
- What: Description of data to be collected and any related data that will be incorporated into the analysis.
- Metadata: How the metadata will be generated and where it will be made available.
- Format: What format the data will be stored in and what format it will be shared in, which may not be the same. For example, you may store data in an Access database but share it in non-proprietary .csv formats.
- Storage and Backup: Where you will put the data as you are collecting it and how it will be backed-up for easy recovery. This is about short-term storage.
- Archiving and Preservation: This is about long-term storage to keep your data for someone years down the line. This is best done with publication on a data archive platform, such as the Environmental Data Initiative (EDI).
- Quality Assurance: Brief description of Quality Assurance and Quality Control (QAQC) procedures and where a data user can access full QAQC documentation.
- Access and Sharing: How can users find your data? Is it posted on line or by request? Are there any restrictions on how the data can be used or shared?
You can find instructions (PDF) and a template (PDF) for Data Management Plans on the DUWG page. All of IEP’s data management plans are also posted on the IEP website.
Data Publication
Many IEP agencies were already sharing data on agency websites, but most of this was done without formal version control, machine-readable metadata, or digital object identifiers (DOIs), making it difficult to track how data were being used. Now IEP is recommending publishing data on EDI or other data archives. Datasets now have robust metadata, open-source data formats (like .csv tables instead of Microsoft Access databases), and DOIs for each version so studies using these data can be reproduced easily.
This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807
Metadata Standards
The term “metadata” can mean different things to different people. Some people may think it simply means the definitions of all the columns in a data set. Some people may think it means a history of changes to the sampling program. Some people think it’s your standard operating procedures. Some people may think it means data about social media networks. What is it? Well, it’s the “who, what, where, when, why, and how” of your data set. It should include everything a data user needs to understand your data as well as you do. The DUWG developed a template for metadata that includes everything we think you should include in full documentation for a dataset. Some of it might not apply to every dataset, but it is a good checklist to get you started.
You can find the Metadata template (PDF) on the DUWG page.
QAQC standards
The DUWG is just starting to dig into QAQC. Quality assurance is an integrated system of management activities to prevent errors in your data, while quality control is a system of technical activities to find errors in your data. QAQC systems have become standard practice in analytical labs, but the formalization and standardization of QAQC practices is new for a lot of the fish-and-bug-counters at IEP. The DUWG QAQC sub-team developed a template for Standard Operating Procedures (PDF), and is working to provide guidance for QAQC of all types of data, and for integrating QAQC into all sampling programs. This promotes consistency across time, people, and space, increases transparency, and gives users more confidence in your data.
Dataset integration
One of the great things about laying down the framework for open data that includes data publication, documentation, and quality control is that it then becomes much easier to integrate datasets across programs. The IEP synthesis team (spearheaded by Sam Bashevkin of the Delta Science Program) has developed several integrated datasets that pull publicly accessible data, put them in a standard format, and publish them in a single, easy-to-use format.
Spreading the Word
We’re also making sure EVERYONE knows about how great our data are.
- We’ve revamped the data access webpage on our IEP site.
- Publishing data on EDI makes it available on DataOne, which allows searches across multiple platforms.
- Publishing data papers is a relatively new way to let people know about a dataset. For example, this zooplankton data paper was recently published in PLOSOne.
- We’ve made presentations at the Water Data Science Symposium and other scientific meetings.
- We published an Open Data Framework Essay in San Francisco Estuary and Watershed Sciences.
- We also put on a Data Management Showcase (video) that you can watch via the Department of Water Resources YouTube Channel.
- Plus, we have lots more data management resources available on the DUWG website.
Together, we're putting IEP Data on the Open Science Train to global recognition.
Questions? Feel free to reach out to the DUWG co-chairs: Rosemary Hartman and Dave Bosworth. If you have any suggestions for improving data management or sharing, we want to hear about it.
This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807
Further Reading
- Baerwald MR, Davis BE, Lesmeister S, Mahardja B, Pisor R, Rinde J, Schreier B, Tobias V. 2020. An Open Data Framework for the San Francisco Estuary. San Francisco Estuary and Watershed Science. 18(2).
- DataONE. Primer on Data Management: What you always wanted to know. 2012.
- Borer ET, Seabloom EW, Jones MB, Schildhauer M. 2009. Some simple guidelines for effective data management. The Bulletin of the Ecological Society of America. 90(2):205-214.
- Hampton SE, Anderson SS, Bagby SC, Gries C, Han X, Hart EM, Jones MB, Lenhardt WC, MacDonald A, Michener WK. 2015. The Tao of open science for ecology. Ecosphere. 6 (7):1-13.
- The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, Anna Krystalli, Alexander Morley, Martin O'Reilly, & Kirstie Whitaker. (2019). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. https://doi.org/10.5281/zenodo.3233986
- National Center for Ecological Analysis and Synthesis Learning Hub Curriculum NCEAS materials for reproducible research and synthesis
- USGS data management website