| tags: [ modelling open science transparency computational reproducibility reproducibility QRP ] categories: [synthesis ]

Choose your own adventure: researcher degrees of freedom and questionable research practices in ecological modelling for decision support

Abstract: Initial meta-science research in ecology and evolution suggests that the discipline is not immune to the sorts of reproducibility issues highlighted in other scientific disciplines, such as Psychology and Medicine. A recent study has revealed rates of self-reported Questionable Reserach Practices (QRPs) in ecology and evolution comprable to other disciplines. QRPs include practices such as cherry-picking, p-hacking and hypothesising after results are known (HARKing), and result in low rates of reproducibility, contributing to biased accounts of the subject domain in the body of literature.

While nearly all QRP research is focussed on Null Hypothesis Significance Testing, in applied ecology and conservation decision making we often utilise non-frequentist statistical analyses and analytic approaches outside of the classical hypothetico-deductive model of science. The risk of QRPs in ecologial modelling is arguably heightened due to the sheer number of ‘researcher degrees of freedom’ inherent in these approaches: researchers make numerous choices during model development, fitting, claibration, validation and reporting. Moreover, these degrees of freedom are typically undocumented due to institutional and editorial disencentives for full and transparent model reporting practices.

In this talk I expand the concept of QRPs to ecological modelling for decision support. I present a summary of the group discussions held at the QAECO/CEBRA lab retreat, where we brainstormed QRPs in four areas of applied ecological research: Bayesian statistics, multiple models and model selection, species distribution modelling, and field data collection. Finally I overview work in progress to develop a survey aimed at measuring the prevalence and potential impact of QRPs in ecological and conservation decision support.

This post represents a written version of the content of the presentation, which can be found here: INSERT LINK


  1. Repro crisis in ecology
  2. HF’s work, QRPs in ecology.
  3. But, QRP work and broader repro research is focussed almost exclusively on frequentist statistics and a hypothetico-deductive model of science.
  4. In ecology, especially in applied ecology and conservation, we often employ non-frequentist methods: Bayesian statistics, decision theoretic approaches, etc. etc.
  5. Given the prevalence of NHST QRPs (Fraser), and the inherent researcher degrees of freedom involved in developing complex models in ecology, it is likely, that there is a QRP problem in the work we do too. (Give little choose your own adventure spiel to tie into the title). 5.a. What are the consequences / implications for irreproducibility of decisions?
  6. However, the very definition of QRPs has been developed in an NHST framework. I overview a proposed/working expansion of the concept of QRPs to non-frequentist and non-NHST methods, and present the findings from the QAECO retreat where we generated lists of QRPs for four areas / methods in ecology and conservation.
  7. I then discuss next steps for investigating the prevalence and impact of QRPs across the discpline.
  8. If we have time we might engage in a voting exercise.

Choose your own adventure: resesarcher degrees of freedom and questionable research practices in ecological modelling for decision support

Motives / aims of this talk: feedback.

Preamble:

Present some preliminary work from the first chapter of my PhD. Hope to test out some ideas, so I welcome any feedback. Invite people to save their questions for a brief discussion at the end. Some content might be familiar, but I’ll try and cover the background material. In particular seeking feedback around the design of the survey, as well as the model of QRPs.

Talk Strucutre:

  1. Problem Background - 5 mins
  2. Report Results, and what we did - 5 mins
  3. What’s next - QRP study - 5 mins
  4. Discussion - 15 mins

The Presentation for this post can be found LINK IS NOT WORKING... to update![here](/blog/html/QAECO-seminar/index.html).

Researcher degrees of freedom

What’s special about ecological modelling, why does it warrant investgation of QRPs? Number of decision points, choice of analytical methods and statistics at each point. What is the relevance of this, why is it important? I.e. what is the relationship of QRPs to Reproducibility / replicability of ecological models? Just like a choose your own adventure book, With more and more decision points should the procedure be replicated, the likelihood of arriving at a different story increases. There’s the potential for the system to be ‘gamed’ either wittingly, or unwittingly.

Arguably such approaches are subject to greater risk of questionable research practices because of the sheer number of ‘researcher degrees of freedom’ involved in ecological modelling: from data collection to the reporting of results researchers make numerous choices during model development, fitting, calibration, validation and reporting. Moreover, the risk of QRP occurrence is potentially amplified due to institutional and editorial disincentives to thoroughly and transparently document the full model development process: these decision points and their justification and assumptions behind the choices remain undocumented and hidden from scrutiny – much like a “Choose your own adventure” book where the choice has been made for you – if not everything is reported then you’re potentially getting a different story.

Transparency and Modelling Reporting Practices - what are the issues?

What exactly are the transparency problems for environmental models, as opposed to other NHST-based research? With an emphasis on models used to inform management and policy decisions…

I propose that the set of conditions fostering QRPs is somewhat particular to environmental models due in part to the inherent properties of environmental models (greater complexity and number of analytical steps result in greater researcher degrees of freedom) and also because there are clear institutional disincentives to fully transparent model documentation and reporting practices (active discouragement of full and thorough model reporting practices).

As well as the general set of conditions that apply

Arguably more fraught / prone to bad reporting practice because of the myriad of modelling choices that remain undocumented.

p.268: The general thesis of these papers is very simple: the inherent flexibility of empirical research (e.g., sample size, operationalizations, controls, data management) provides a near limitless number of opportunities for the data to support a researcher’s prior theory—or any story, for that matter. – Highlighted 5 Sep 2018 (Jebb, Parrigon, and Woo 2017).

issues

  • publication bias: modeling papers are focused on scientific novelty, rather than on model documentation (de Vos, Alexandrov 2011, Schmolke 2010).
  • No consensus on standards / agreed upon reporting of environmental models (de Vos)
  • daisy-chaining - parts of the model are documented separately in separate papers that cross-reference each other (de Vos).

barriers to transparent reporting practices

Problem Background

  • repro crisis and meta-science research in ecology
  • outline thesis goals and chapters - then zoom in on the QRP one.
  • Outline QRP research… grey areas, disagreement because is inherently uncertain whether they are okay or not.
  • Researcher degrees of freedom, what are they, why are they important in ecological modelling. -> modelling approaches have many components to them! Ambiguity and cognitive fallacies human biases (SEM paper?) Good this leads on nicely to Jian and Ralph’s recent paper: “There is no universally agreed-upon measure for the goodness-of- fit of a model to data. The amount of variance (or deviance) that needs to be explained or predicted to make a model “good” is sub- jective, and depends on the ecological question and the intended use of the model." Mac Nally et al. (2018). These are exactly the conditions that the ambiguity notion describes!!
  • Defining QRPs - trust and belief, given assumptions (that paper?!?)

Failing to provide all analyses undertaken prevents readers from making proper evaluations of a model:

“Such measures allow editors, reviewers and ultimately readers to make their own judgements of: (1) whether the nominal best model is a meaningful fit to build data; and (2) model inferences, especially on the relative importance of predictor variables.” Mac Nally et al. (2018).

Those other papers looking at problems of model documentation and model evaluation reporting in ecology.

A quick tour of QRP Research

Expanding QRP research to non-NHST research

Problems

I have some reservations about this definition, chiefly:

  1. Models are necessarily abstractions, simplifications of systems and shared theories or expectations. The notion of a true model doesn’t actually make sense? (King 1991, De Vos 2011). “All models are wrong, but some are useful” (SO HOW DO WE accept this understanding and put forth a definition / conceptual model of the influence of QRPs, TRY AND CONCEPTUALISE THE EFFECT OR INFLUENCE OF QRPs ON A MODEL?) We Need some baseline representing the model and its outputs as if there were no impact of the QRP
  2. What about in a Bayesian Framework – The idea of a ‘true’ model conflicts with the philosophical underpinnings of Bayesian probability theory –is there even any notion of a ‘true’ model?
  3. Parsimony - prediction vs. fit, and model purpose. What do we care about? This definition is arguably centred around modelling in a predictive capacity, but what about explanatory models (where we care less about the outputs of the model, and more on the form and fit of the model). We’ve taken ‘signal’ in the above definition to be a signal in the outputs. Perhaps this focus is ok, given that predictive models are often the backbone of formal decision support systems.

QAECO Retreat Discussions

Reminder: what we did

Results: here’s the lists

Give the data collection list in the interests of everyone at the lab, but otherwise don’t focus on. Just want to think about modelling for now (but can some of the QRPS from the other areas go into this, afterall there is an interplay between data collection and analysis that is often a problematic research practice)

What i did with the results: - then consulted the literature to build upon the lists for each area, and mapped them onto modelling or analysis workflows identified in the literature (this is a work in progress). - the objective was to identify all the points in the analysis where the same class of QRP can occur.

Summarising, what are the key themes from this?

For each of the four groups, I mapped the QAECO generated QRPs onto an overarching workflow representing discrete steps in the modelling process. I then searched the literature to add others onto this process. Then I attempted to code each of the QRPs according to the broad classes of QRPs already identified for NHST-research.

  1. Cherry-picking
  2. HARKing
  3. Test-statistic hacking
  4. Failing to report assumption violations or methodological flaws that alter or invalidate conclusions
suppressPackageStartupMessages(library(DiagrammeR))

DiagrammeR("graph LR
A(Data Collection)-->B(Model<br />development)
B--> C(Test for convergence<br />of MCMC chains)
C--> |Fail | B
C--> |Pass | D(Model<br />Checking)
D--> |Fail | B
D--> |Pass | E(Model<br />Comparison)
E--> F(Robustness<br />Analysis)
        ")
DiagrammeR("graph LR
subgraph Hypothesis and Model Generation
A(Generate candidate<br />biological hypotheses<br />as candidate models)-->
B(Translate all models<br />into mathematical models)
end
subgraph Model Fitting and Evaluation
B-->C(Fit Global Model)
C-->
E{Goodness of fit <br />test for global model}
E-->|Pass|F(Fit models to data)
end
D(Data Collection)-->C
subgraph Model Selection
F-->G(Calculate model selection criterion,<br />compute scores for each model)
end
           ")

Examining each of the research practices reveals that the same broad classes of QRPs applicable to NHST research still hold for non-NHST research. The primary difference in terms of the classes is that practices falling under the banner of P-hacking are applicable to other test-statistics.

  1. Each of these issues is able to be distilled into the same class of QRPs common to NHST research (Cherry picking, HARKing, p-hacking – except with other test-statistics etc).
  2. It’s just that the particular form they take is unique to the type of method under consideration (e.g. SPARKing), and moreover, that the same general QRP can occur at multiple points in a modelling / analytical workflow, because there are many sub-components. (Demonstrate using the workflows)

Key Take-aways of the workflow mapping exercise:

  1. Researcher degrees of freedom - There are many many choices and tests that a modeller must make and undertake throughout this process. When to stop collecting data, what prior to select, was their convergence when fitting the model, what model checking or robustness tests should be undertaken? What thresholds should be used?
  2. QRPs can occur multiple times in the entire process, at each of these decision points
  3. The same class of QRP can also occur at multiple points in the modelling process

Discussion of key themes and summary of issues:

Does reporting remedy the issue? Looking at the reasoning about the questionability of the practices, it seems to be the case that most people consider the practice not to be questionable if it is reported. (Give example from the list) This sentiment is mirrored in the published literature also, with calls for QRP’s to be changed to questionable reporting practices (INSERT PAPER). However, arguably reporting doesn’t rectify the issue (what’s the issue) for all practices. Arguably the case for model dredging (especially in a predictive or confirmatory inference application), and potentially also for the selective debugging case (selective analysis, only when results are interesting or there appears to be a big problem).

Ralph MacNally (2018). AUC is susceptible to poor choices of model formulation… Johnson et al. 2014.

Is failing to undertake particular analyses, such as checking the influence of priors, model checking or other forms of model evaluation, a questionable practice, or just ‘bad’ practice? (paper reviewing bayesian model checkign papers and found that most people don’t check their Bayesian models). Arguably failing to undertake any or give thorough treatment to proper model evaluation gives a false impression of the quality of the model? selective actions, only if interesting or if a strong result… even if deviate from expectation. Where do we draw the line in defining the boundaries of QRPs?


Model or data dredging practices, are not inhrently questionable.

Can yield meaningful insight.

This leads to the next issue, some practices questionable only under particular types of inference. So the purpose of the model is an important moderator of the questionability of these practices. Give example from the list.

Inference: Confirmatory vs. Exploratory Analysis -> Model Purpose: Explanatory vs. Predictive Models

Model Purpose: Sheueli (2010), Explanatory vs. predictive modelling.

Problem occurs when we cast exploratory work as confirmatory: “Confusion might arise because the same behavior can be either EDA or p-hacking depending on how it is reported.” Jebb et al. “Thus, “EDA as CDA” is harmful because the results are given an inflated degree of credibility; it breaks the cardinal rule that a result cannot be discovered and validated using the same data.” Jebb, A. T., Parrigon, S., Woo, S. E. (2017) Exploratory data analysis as a foundation of inductive research. Human Resource Management Review. 27, 265–276.

“Current practices with many journals that all but required a deductive approach have led to authors positioning papers as de- ductive even when the underlying research was not. This masking of inductive and abductive research is at the heart of the de- bate on questionable research practices” Woo, S. E., O’Boyle, E. H., Spector, P. E. (2017) Best practices in developing, conducting, and evaluating inductive research. Human Resource Management Review. 27, 255–264.

Then see discussion on inductive, abductive inference and and EDA / CDA.


Bayesianism is NOT a remedy for QRPs… as has been argued in reproducibility literature! - cite example saying Bayesianism is a remedy - then point to the workflow model here - then cite evidence about invalidation

Designing the Study

  • GOALS
  • QUESTIONS

measuring prevalence: (Fiedler and Schwarz 2015) two components of prevalence.


Review:

Not even aware of when you’re doing it or not - how do you measure? That is the same for non-NHST and NHST work. –> show the actual questions that were used in Hannah’s and What kind of question lead to those bars on the graph. help with imagining how to get from general issues to a question.

Don’t get bogged down in debate about whether you disclose or not. Undisclosed flexibility -> makes it more straight forward. Interesting question, but zooms in on the scope of the chapter.

Eliciting the impact isn’t going to work… too soon. Poeple don’t believe they have an impact or cause a type I error.

Option to tick “yes i did it, but I think it was justified when I did it” From Hannah’s research.

–> When give the Brisbane and HPS talk. Manifesto for Reproducibility –> juxtapose my Bayesian one against frequentist, this is what we know about how frequentist QRPs and how they influence

Confusion around prevalence and impact (chance that there is type I error) but then there’s the value judgment of your intent.. How bad given the consequence and impact.

Elicitnig quantitative: people thinks there is no impact. Don’t yet have quantitative benchmark. Maybe there’s one QRP where can benchmark against.

Threatened Species: Deciding whether to list or delist, might be a good case study. Threatened Species Committee.

How reproducible? National Energy Guarantee. Experiment.

How bad is it?

HPS: What’s the equivalence of a false positive? Can be a range of things. SOmetimes immediate translation: whereever there is a threshold, it translates. Sometimes no equivalent.

model fit statistics not as wedded to publication bias. Sometimes really obvious (just take out .05 and substitute). THEN show diagram demonstrating drive for certainty.

working definition -> doesn’t capture ALL the things that are QRPs. Doesn’t capture HARKing… there’s an extra step of perceptions or belief about the model.

DiagrammeR::DiagrammeR("graph LR
A(Generate &<br /> Specify Hypotheses)-->B(Design Study)
B--> C(Collect Data)
C-->D(Analyse Data &<br /> Test Hypotheses)
D-->E(Interpret Data)
E--> F(Publish or Conduct<br /> Next Experiment)
F-->A
        ")

Fiedler, Klaus, and Norbert Schwarz. 2015. “Questionable Research Practices Revisited.” Social Psychological and Personality Science 7 (1): 45–52. https://doi.org/10.1177/1948550615612150.

Jebb, Andrew T, Scott Parrigon, and Sang Eun Woo. 2017. “Exploratory data analysis as a foundation of inductive research.” Human Resource Management Review 27 (2). Elsevier Inc.: 265–76. https://doi.org/10.1016/j.hrmr.2016.08.003.