Science Stories: Adventures in Bay-Delta Data

rss
  • July 5, 2022

Authors: Rosemary Hartman and the IEP Data Utilization Work Group

Here at IEP, we collect a lot of data, and we do a lot of science. However, people haven’t always realized how much data we collect because it hasn’t always been easy to find. For scientists that were able to find the data, sometimes it was difficult to understand or it was shared in a hard-to-use format. That’s why IEP’s Data Utilization Work Group (DUWG) has been pushing for more Open Science practices over the past five years to make our data more F.A.I.R (Findable, Accessible, Interoperable, and Reusable). And wow! We’ve come a long way in a short time.

A staircase with FAIR Principles written on it and stick figures climbing it. Circles are around the staircase.  One shows a map pin that says Persistent and Findable. One shows an open lock that says 'Accessible' with meaningful interaction. One shows a person and a puzzle and says 'Reusable with Full Disclosure', and one shows two computers with a line between them and says 'Interoperable'.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

What is Open Science anyway? Well, I was going to call it “the cool-kids club” but really it’s the opposite of a club! It’s the anti-club that makes sure everyone has access to science – no membership required. Open science means that all scientists communicate in a transparent, reproducible way, with open-access publications, freely shared data, open-source software, and openness to diversity of knowledge. Open science encourages collaboration and breaks down silos between researchers – so it’s a natural fit for a 9-member collaborative organization like IEP.

For IEP, the ‘open data’ component is where we’ve really been making strides. While “share your data freely” sounds easy, it’s actually taken a lot of work to make our data FAIR. As government entities, theoretically all of the data we collect is held in the public trust, but putting data in a format that other people can use is not simple. Here are some of the things we have done to make IEP data more open:

Data Management Plans

The first thing the DUWG did was get all IEP projects to fill out a simple, 2-page data management plan outlining what was being done with the data in short, clear sections:

  • Who: Principal investigator and point of contact for the data.
  • What: Description of data to be collected and any related data that will be incorporated into the analysis.
  • Metadata: How the metadata will be generated and where it will be made available.
  • Format: What format the data will be stored in and what format it will be shared in, which may not be the same. For example, you may store data in an Access database but share it in non-proprietary .csv formats.
  • Storage and Backup: Where you will put the data as you are collecting it and how it will be backed-up for easy recovery. This is about short-term storage.
  • Archiving and Preservation: This is about long-term storage to keep your data for someone years down the line. This is best done with publication on a data archive platform, such as the Environmental Data Initiative (EDI).
  • Quality Assurance: Brief description of Quality Assurance and Quality Control (QAQC) procedures and where a data user can access full QAQC documentation.
  • Access and Sharing: How can users find your data? Is it posted on line or by request? Are there any restrictions on how the data can be used or shared? 

You can find instructions (PDF) and a template (PDF) for Data Management Plans on the DUWG page. All of IEP’s data management plans are also posted on the IEP website.

Data Publication

Many IEP agencies were already sharing data on agency websites, but most of this was done without formal version control, machine-readable metadata, or digital object identifiers (DOIs), making it difficult to track how data were being used. Now IEP is recommending publishing data on EDI or other data archives. Datasets now have robust metadata, open-source data formats (like .csv tables instead of Microsoft Access databases), and DOIs for each version so studies using these data can be reproduced easily.

Cartoons of stick people illustrating the phases of the data life cycle with arrows connecting them. Data collection - People with nets catching shapes.  Data processing - people take shapes out of a box labeled short-term storage and lay them out on a table. Data Study and Analysis - people make patterns with the shapes. Data publishing and access - People present the data to an audience. Data Preservation - People put shapes in tubes and boxes. Data re-use - people open tubes and a string of shapes come out. Research ideas - Shapes inside a light bulb.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

Metadata Standards

The term “metadata” can mean different things to different people. Some people may think it simply means the definitions of all the columns in a data set. Some people may think it means a history of changes to the sampling program. Some people think it’s your standard operating procedures. Some people may think it means data about social media networks. What is it? Well, it’s the “who, what, where, when, why, and how” of your data set. It should include everything a data user needs to understand your data as well as you do. The DUWG developed a template for metadata that includes everything we think you should include in full documentation for a dataset. Some of it might not apply to every dataset, but it is a good checklist to get you started.

You can find the Metadata template (PDF) on the DUWG page.

QAQC standards

The DUWG is just starting to dig into QAQC. Quality assurance is an integrated system of management activities to prevent errors in your data, while quality control is a system of technical activities to find errors in your data. QAQC systems have become standard practice in analytical labs, but the formalization and standardization of QAQC practices is new for a lot of the fish-and-bug-counters at IEP. The DUWG QAQC sub-team developed a template for Standard Operating Procedures (PDF), and is working to provide guidance for QAQC of all types of data, and for integrating QAQC into all sampling programs. This promotes consistency across time, people, and space, increases transparency, and gives users more confidence in your data.

Dataset integration

One of the great things about laying down the framework for open data that includes data publication, documentation, and quality control is that it then becomes much easier to integrate datasets across programs. The IEP synthesis team (spearheaded by Sam Bashevkin of the Delta Science Program) has developed several integrated datasets that pull publicly accessible data, put them in a standard format, and publish them in a single, easy-to-use format.

Spreading the Word

We’re also making sure EVERYONE knows about how great our data are.

  • We’ve revamped the data access webpage on our IEP site.
  • Publishing data on EDI makes it available on DataOne, which allows searches across multiple platforms.
  • Publishing data papers is a relatively new way to let people know about a dataset. For example, this zooplankton data paper was recently published in PLOSOne.
  • We’ve made presentations at the Water Data Science Symposium and other scientific meetings.
  • We published an Open Data Framework Essay in San Francisco Estuary and Watershed Sciences.
  • We also put on a Data Management Showcase (video) that you can watch via the Department of Water Resources YouTube Channel.
  • Plus, we have lots more data management resources available on the DUWG website.

Together, we're putting IEP Data on the Open Science Train to global recognition. 

Questions? Feel free to reach out to the DUWG co-chairs: Rosemary Hartman and Dave Bosworth. If you have any suggestions for improving data management or sharing, we want to hear about it.

Two birds are in a fountain labeled Fountain of Open Data. One asks: You mind if I reuse this data? The other says: Go ahead! we can even work together on it.

This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. DOI: 10.5281/zenodo.3332807

Further Reading

Categories: BlogDataScience
  • May 17, 2021

One of life’s greatest joys is playing with data. However, not everyone has the time or experience needed to make fancy graphs. Fortunately, availability of on-line web applications that allow people with no data analysis experience to visualize status and trends of data across space and time has exploded in recent years.

Three fish look at a graph. One says 'I want to make graphs, but I can't type with fins'. Another says 'I don't even know where to get the data!'. The third says 'Don't worry, there are lots of apps that you can use to graph the data automatically.'
Figure 1. Fish love data, but they need a little help making their graphs.

One of the first data visualization tools was the mapping widgets on the CDFW website. These maps allow you to plot the catch for different fish species as different size bubbles, and have been available since the late 1990s:

But we needed better ways to display data from multiple surveys at once at the click of a button. The website Bay Delta Live was launched in 2007 as a home for Bay-Delta data and data visualizations. It includes summaries, graphs, and interactive visualizations for water quality, operations, fish monitoring, and special studies.

A similar website, SacPas, was built specifically for synthesizing, summarizing, and displaying data for salmonids in the Central Valley. It allows a user to visualize data on salmon abundance, temperature thresholds, river conditions, and hydrologic conditions. It also lets you play with a nifty Chinook Salmon population model and download all the underlying data.

Three fish look at a map. The tule perch says 'This app lets you see how much flow you need to get different amounts of salmon habitat.' The splittail says 'This is great! Where is ths splittail habitat app?'
Figure 2. FLowWest's Central Valley Instream Rearing Habitat Calculator shiny app

Custom-built websites like Bay Delta Live and SacPas are great, but they are built by web developers, not fisheries scientists. Now, thanks to user-friendly data display tools such as Tableau and the increase in coding literacy among environmental scientists, more and more people can create their own on-line data visualizations. This means the number of data visualizations apps has grown astronomically in the past few years, and many apps are custom-built for specific scientific questions.

The Delta Science Program now hosts a number of these visualizations built with the R package “shiny’

Three fish look at a map. The tule perch says 'This app lets you make maps of all the IEP fish sampling stations.' The striped bass says 'Oh, good, now I know all the places I should avoid.'
Figure 3. You can now map all the stations monitored by IEP's long-term surveys.

Other Shiny apps have launched recently on a variety of other platforms:

Three fish look at a graph of salmon survival. The Tule Perch says 'you can use the STARS model to look at survival probabilities'. The splittail says 'I'm glad I don't have to migrate through the Delta.'
Figure 4. CalFishTrack includes a Shiny App of their Survival Travel time And Routing Simulation (STARS).

USGS has developed several new dashboards for mapping water and water quality data:

With all these tools out there, it’s one big data playground! If you’re interested in making your own, it’s easy to get started with Shiny. Visit the Learn Shiny video tutorial!

Categories: General