| tags: [ open data open science reproducibility ecology evolution ] categories: [reading ]

Cullina et al 2018 Navigating the unfolding open data landscape in ecology and evolution

The open data movement has the capacity to provide new and powerful insights into complex systems, in ecology and evolution. However, in ecology and evolution there has not been great uptake / implementation of open data to the extent seen in other disciplines (e.g. medicine, climate sciences).

Why open data?

  • identify broader eco evo processes across space, time, species
  • reanalysing data using new statistical approaches
  • error checking
  • using existing data to answer new questions
  • era of the Anthropocene: large, complicated questions with high degree of uncertainty requires combined data from multiple sources, and multidsciplinary data synthesis.

Aim of paper: to provide ecologists and evolutionary biologists with tools to navigate this emerging open data landscape. So that we may increase the use of open data and facilitate robust and comprehensive analysis and inference.

State of the open data landscape

Data are fragmented. They exist in a multitude of locations, often in the supplementary materials sections of the papers they accompany, personal websites, or perhaps they are published in a data repository. But there are many data repositories. There is a register of data repositories r3data.org. But there is no unified system of searching all of these repositories and data sources, meaning that locating appropriate data is extremely difficult.

Long tail of science.

Dispersed scientic research that is conducted by many individual researchers/teams, and is often of a limited spatial and temporal scale. Data produced in the long tail tend to be small in volume, and less standardized within the same eld of study. The majority of scientic funding is spent on this type of research.

I believe the above “long tail of science” compounded by a lack of unified data infrastructure to be a significant barrier to the advancement of ecology and conservation, and perhaps even science more broadly.

  • Impediment to meta-research, impeding validation of existing evidence and knowledge through reproductions and replications.
  • lost opportunities - unable to answer novel questions and new hypotheses, perhaps at new scales
  • Wasted money - are people paying for new data when existing data could do the job?
  • Impediment to longitudinal studies or retrospective evaluations
  • Source of uncertainty in models for applied contexts - the data’s out there we just can’t find it, or get our hands on it, or it’s in the wrong format!

Transitioning to Open Science

Citation

  • Need for giving value equal status to research objects such that they are equivalent in status to a journal paper. “First class research objects”
  • Give your data a DOI so people can cite it!

Misinterpretation and potential biases

drumroll… METADATA!!!

Despite good descriptions of a dataset (how data were collected, where, and when, how were they processed and analysed?), a lot of ecological datasets lack complete information to enable a full understanding of what the data describe.

Why? Reuse is not in the authors mind, really! I’ll add that it takes TIME to properly describe your data. And familiarity and access to a good set of tools and workflows for doing so. Moreover, we lack a standardised method of describing our data.

Another issue the paper raises is that details about the subtleties of the study-system cannot be described or are difficult to describe in metadata. Paper suggests contacting authors to overcome this issue.

Another important point of consideration is that: “working with a large amount of data requires careful consideration of the possible biases, statistical issues and inferences that can be drawn when using these data.”

For example, one recent study 32 identified multidimensional biases, gaps and uncertainties in global plant occurrence information data in the GBIF database, while another work 33 examined spatial biases in collected data sets used in two different meta-analysis that (wrongly) concluded that there was no net loss of biodiversity due to anthropogenic disturbances.

The future of open data in EcoEvo

“However, the major historical drawbacks of ecological research (the challenge to standardize, validate and generalize findings) often limit the relevance of ecological findings for most urgent societal and scientific needs.” Page 5

So in the era of the Anthropocene, the transition to open data in EcoEvo is one of pressing concern. (Can put EcoEvo decision spin on this here).

Essentially the open data landscape in EcoEvo is in its infancy.

## Future direction:

  • need uptake by the EcoEvo research community: As the resources are increasingly adopted it should provide the impetus for improvements
  • Increase the reuse of open data… paper argues this can be done by awareness raising.. a) that ecoEvo open data repositories exist, and b) by increasing awareness of how to find / where to look for open data.

I think this is an important first step. More needs to be done in terms of unified infrastructure, methods, tools for collecting, locating, accessing and synthesizing open data.

Ontologies: The paper also created its own ontology for “describing data sources that contain or refer to datasets relevant to the EcoEvo”. It “makes a distinction between the data source and the collection of EcoEvo datasets that the data source contains” … enabling description and identification of data sources." So in the sense of Madin et al [@Madin:2008jv], this paper is a “framework ontology”, an ontology that links multiple domains and data sources, which might have their own unique ontologies in their domain of interest.