Science Stories: Adventures in Bay-Delta Data

rss
  • July 5, 2022

Authors: Rosemary Hartman and the IEP Data Utilization Work Group

Here at IEP, we collect a lot of data, and we do a lot of science. However, people haven’t always realized how much data we collect because it hasn’t always been easy to find. For scientists that were able to find the data, sometimes it was difficult to understand or it was shared in a hard-to-use format. That’s why IEP’s Data Utilization Work Group (DUWG) has been pushing for more Open Science practices over the past five years to make our data more F.A.I.R (Findable, Accessible, Interoperable, and Reusable). And wow! We’ve come a long way in a short time.

A staircase with FAIR Principles written on it and stick figures climbing it. Circles are around the staircase.  One shows a map pin that says Persistent and Findable. One shows an open lock that says 'Accessible' with meaningful interaction. One shows a person and a puzzle and says 'Reusable with Full Disclosure', and one shows two computers with a line between them and says 'Interoperable'.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

What is Open Science anyway? Well, I was going to call it “the cool-kids club” but really it’s the opposite of a club! It’s the anti-club that makes sure everyone has access to science – no membership required. Open science means that all scientists communicate in a transparent, reproducible way, with open-access publications, freely shared data, open-source software, and openness to diversity of knowledge. Open science encourages collaboration and breaks down silos between researchers – so it’s a natural fit for a 9-member collaborative organization like IEP.

For IEP, the ‘open data’ component is where we’ve really been making strides. While “share your data freely” sounds easy, it’s actually taken a lot of work to make our data FAIR. As government entities, theoretically all of the data we collect is held in the public trust, but putting data in a format that other people can use is not simple. Here are some of the things we have done to make IEP data more open:

Data Management Plans

The first thing the DUWG did was get all IEP projects to fill out a simple, 2-page data management plan outlining what was being done with the data in short, clear sections:

  • Who: Principal investigator and point of contact for the data.
  • What: Description of data to be collected and any related data that will be incorporated into the analysis.
  • Metadata: How the metadata will be generated and where it will be made available.
  • Format: What format the data will be stored in and what format it will be shared in, which may not be the same. For example, you may store data in an Access database but share it in non-proprietary .csv formats.
  • Storage and Backup: Where you will put the data as you are collecting it and how it will be backed-up for easy recovery. This is about short-term storage.
  • Archiving and Preservation: This is about long-term storage to keep your data for someone years down the line. This is best done with publication on a data archive platform, such as the Environmental Data Initiative (EDI).
  • Quality Assurance: Brief description of Quality Assurance and Quality Control (QAQC) procedures and where a data user can access full QAQC documentation.
  • Access and Sharing: How can users find your data? Is it posted on line or by request? Are there any restrictions on how the data can be used or shared? 

You can find instructions (PDF) and a template (PDF) for Data Management Plans on the DUWG page. All of IEP’s data management plans are also posted on the IEP website.

Data Publication

Many IEP agencies were already sharing data on agency websites, but most of this was done without formal version control, machine-readable metadata, or digital object identifiers (DOIs), making it difficult to track how data were being used. Now IEP is recommending publishing data on EDI or other data archives. Datasets now have robust metadata, open-source data formats (like .csv tables instead of Microsoft Access databases), and DOIs for each version so studies using these data can be reproduced easily.

Cartoons of stick people illustrating the phases of the data life cycle with arrows connecting them. Data collection - People with nets catching shapes.  Data processing - people take shapes out of a box labeled short-term storage and lay them out on a table. Data Study and Analysis - people make patterns with the shapes. Data publishing and access - People present the data to an audience. Data Preservation - People put shapes in tubes and boxes. Data re-use - people open tubes and a string of shapes come out. Research ideas - Shapes inside a light bulb.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

Metadata Standards

The term “metadata” can mean different things to different people. Some people may think it simply means the definitions of all the columns in a data set. Some people may think it means a history of changes to the sampling program. Some people think it’s your standard operating procedures. Some people may think it means data about social media networks. What is it? Well, it’s the “who, what, where, when, why, and how” of your data set. It should include everything a data user needs to understand your data as well as you do. The DUWG developed a template for metadata that includes everything we think you should include in full documentation for a dataset. Some of it might not apply to every dataset, but it is a good checklist to get you started.

You can find the Metadata template (PDF) on the DUWG page.

QAQC standards

The DUWG is just starting to dig into QAQC. Quality assurance is an integrated system of management activities to prevent errors in your data, while quality control is a system of technical activities to find errors in your data. QAQC systems have become standard practice in analytical labs, but the formalization and standardization of QAQC practices is new for a lot of the fish-and-bug-counters at IEP. The DUWG QAQC sub-team developed a template for Standard Operating Procedures (PDF), and is working to provide guidance for QAQC of all types of data, and for integrating QAQC into all sampling programs. This promotes consistency across time, people, and space, increases transparency, and gives users more confidence in your data.

Dataset integration

One of the great things about laying down the framework for open data that includes data publication, documentation, and quality control is that it then becomes much easier to integrate datasets across programs. The IEP synthesis team (spearheaded by Sam Bashevkin of the Delta Science Program) has developed several integrated datasets that pull publicly accessible data, put them in a standard format, and publish them in a single, easy-to-use format.

Spreading the Word

We’re also making sure EVERYONE knows about how great our data are.

  • We’ve revamped the data access webpage on our IEP site.
  • Publishing data on EDI makes it available on DataOne, which allows searches across multiple platforms.
  • Publishing data papers is a relatively new way to let people know about a dataset. For example, this zooplankton data paper was recently published in PLOSOne.
  • We’ve made presentations at the Water Data Science Symposium and other scientific meetings.
  • We published an Open Data Framework Essay in San Francisco Estuary and Watershed Sciences.
  • We also put on a Data Management Showcase (video) that you can watch via the Department of Water Resources YouTube Channel.
  • Plus, we have lots more data management resources available on the DUWG website.

Together, we're putting IEP Data on the Open Science Train to global recognition. 

Questions? Feel free to reach out to the DUWG co-chairs: Rosemary Hartman and Dave Bosworth. If you have any suggestions for improving data management or sharing, we want to hear about it.

Two birds are in a fountain labeled Fountain of Open Data. One asks: You mind if I reuse this data? The other says: Go ahead! we can even work together on it.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

Further Reading

Categories: BlogDataScience