Summary and purpose of the workshop:
We ran a workshop where we hosted small discussions attempting to propose questionable research practices that arise in ecological and conservation research using non-frequentist and/or non hypothetico-deductive inquiry.
The purpose of the workshop was two-fold:
- To bring awareness to our research group of QRPs for the types of work relevant to the group, and to consider how reproducibility issues might affect us, even if reproducibility research seems irrelevant.
- To inform future work where we plan to survey the broader ecology community to measure the prevalence of non-frequentist QRPs. The goal was to brainstorm as many QRPs as possible.
Hannah Fraser presented a summary of her recent research on the prevalence of questionable research practices in ecology. I proposed a working definition of QRPs for non-frequentist / non-hypothetico deductive research, as these methodological frameworks are commmon in applied ecology and conservation decision-making.
- Gave proposed definition to participants and asked them to refer back to this definition during their discussions and when trying to list QRPs.
- Supplied a list of NHST QRP examples, for participants to consider direct analogues with statistical significance thresholds other than p-values.
People divided into different groups depending on their research expertise and interests. The four discussion groups were:
- Bayesian statistics
- Species Distribution Modelling
- Multiple models: model dredging and model selection
- Field study design and data collection
We asked participants in each group to do the following:
- list as many QRPs applicable to your group as possible
- Describe the QRP,
- provide a reason as to why or why not those practices might be questionable. If the questionability of those practices is context-dependent, provide a reason as to why and how it is questionable.
- Don’t focus on trying to reach consensus among the group, note down the point of disagreement and move on.
Were assigned a facilitator each.
library(tidyverse) library(kableExtra) dat <- readr::read_csv("../../public/files/data/retreat_qaeco_qrps.csv") first_last <- dat %>% dplyr::mutate(num = row_number()) %>% dplyr::group_by(Group) %>% dplyr::summarise(first = first(num), last = last(num)) dat %>% dplyr::group_by(Group, QRP) %>% tidyr::spread(key = Questionable, value = Reasoning) %>% dplyr::arrange(Group, QRP) %>% dplyr::ungroup() %>% dplyr::select(-Group) %>% kable() %>% group_rows("Bayesian", 1, 11) %>% group_rows("Multiple models", 12, 18) %>% group_rows("Study Design and Data Collection", 19, 27) %>% group_rows("SDM", 28, 41)
|Failing to report influence of prior.||Must be reported, otherwise is QRP.||NA||NA|
|HARKing||NA||NA||This is still an issue with Bayesian methods.|
|Interpreting credible intervals in NHST framework||potential P-hacking issues.||NA||NA|
|Large computational burden might disuade full / thorough analysis / checking of results if an interesting result appears.||NA||NA||NA|
|MCMC convergence - is it checked thoroughly, is a wide range of initial values used?||NA||NA||NA|
|Model selection: failing to report all models.||NA||NA||Equivalent to NHST issue of failing to report all covars etc.|
|Priors, MCMC convergence: will debug if results are unexpected, but might not if results are “expected” or “exciting”.||NA||NA||NA|
|Selection of prior: are priors measured on the same scale / units of likelihood? If the result is interesting, you might keep the result.||NA||NA||NA|
|Selection of prior: checking influence.||Is it often done? Is it reported?, QRP if not reported.||NA||NA|
|selection of prior: weighting.||NA||NA||Questions on how to weight if from another location for inference. Post-hoc rationalisation of weighting is a QRP|
|Use of a model as source of priors||NA||Consensus that not questionable.||NA|
|Absence of well specified a priori hypotheses: “let’s test this too!”||Inference vs. prediction, or both?||NA||NA|
|Combining categories of an independent variable: regrouping post collecting||Done if not enough data per category. Bad if the recategorisation is done to impose fit.||NA||NA|
|Dredging across many models but only reporting a subset||OK if put in supplementary materials.||NA||NA|
|Post-hoc change of random effects: removing extra random effects after looking at model results.||BUT, if you report that, is it still a QRP if you still remove the random effects?||NA||If study has a nested structure, then that is your model!!|
|Post-hoc variation additions: “why not collect this too”||Is it still bad if you do it before looking at your data?||NA||NA|
|Shape testing: univariate GAMs for variable shape||What is an alternative?||NA||NA|
|Univariate to start: exploring single variable models to choose which rain data.||Maybe necessary if computationally complex||NA||NA|
|Study Design and Data Collection|
|AUC rate hacking||NA||NA||similar to p-hacking|
|checking / changing map based on decision-maker expert opinion||NA||You don’t want a crap map, why not use expert opinion?||NA|
|Cherry picking case studies||Just don’t pretend it was random chance||isn’t this just science?||confirmation bias|
|Cherry picking which papers / maps you publish||NA||NA||NA|
|Fitting everything available||Depends on what your aim is? Do you want a good map, or do you care about what’s driving the distribution? [is this the same binary as the prediction vs. explannation inference binary?]||Necessary?||Unthinking, inflates the chance of finding a significant model|
|HARKing: narrative due to sexy variable importance. Adjusting which values you use based on results.||NA||NA||e.g. throwing out old records or trying different lab study values in mechanistic modelling|
|Overfitting models to improve ability to cross-validate||We’re always using a subset of our data because we have too little data to hold some out entirely.||NA||NA|
|Partial covariate reporting||NA||NA||NA|
|Using inappropriately scaled data||NA||But could send results either way, so is it a QRP?||not made for scale of your point data|
|Cherry picking data to use||NA||NA||NA|
|Choosing which list to make species sound more threatened||NA||NA||NA|
|Design based on costs / external factors [instead of??]||NA||NA||NA|
|Falsely claiming adaptive design / management||NA||NA||NA|
|Filling in missing data with best guess||NA||NA||NA|
|Non-random site selection (look for species, choose strongest example)||NA||NA||NA|
|Not reporting / looking into limitations of data appropriateness||NA||NA||NA|
|Not reporting data conversion (e.g. changing conditions to discrete, ambiguous condensing / collapsing, -retrofitting response groups)||NA||NA||NA|
|Not reporting exceptions||NA||NA||NA|
|Not reporting who surveyed what||NA||NA||Potential for the identify of the surveyer / data recorder to contribute to noise.|
|Presenting / cherry picking best example (e.g. sites)||NA||NA||NA|
|shifting sites for an outcome or need (but not stating)||NA||NA||NA|
|Simplifying methods [for write up], omitting modifications [minor].||NA||NA||NA|
Summaries / key points and other information
Multiple Model selection:
Key issues were the absence of well-specified a priori hypotheses / dredging. And Post-hoc changing of random effects.
Proposed solution: if you report it is it ok? You must be explicit with what you did and how you made your decisions.
study design and data collection
- Seeking strongest signal
- Confirmation bias
- Inappropriate use of data
- Not reporting methodological decisions /realities
“A lot of the issues with mechanistic modelling apply to PVAs.”
The top 3 QRPs were:
- Fitting everything available,
- cherry picking case studies,
- and HARKing the narrative due to sexy variable importance.
Then I wonder if the PVA reproducibility paper might have some QRP content we can bring over into the SDM domain?
- “will debug if results are unexpected. Might not of results are ‘expected’ or ‘exciting’. Relevant to choice of priors and MCMC convergence.”
- “Checking the influence of priors: failing to report, and not doing at all.”
- “Weighting: post-hoc rationalisation, and how to weight if from other locations, or sources, for inference.”
Designing the survey for the QAECO group, and also for the paper:
Exercise two was abandoned, but the design of the next exercise will be an ongoing task for me over the next week or so.
PV: Need to give the voting exercise a bit more thought: want to ensure that I capture the multi-factorial aspects in the voting, don’t lose those finer nuances of information. Fiona: we need to separate out impact vs. commonality. I.e. some QRPs might have a low impact in terms of their severity in causing a type I error, but are very common (easy target in terms of trying to change research culture, low-hanging fruit?). I guess the cause for concern would be if we identify relatively common practices that some weight as severe.
The other aspect that we need to be careful about considering in the voting exercise is the particular context that shapes the questionable nature of the practice. For example, if some practices are only questionable
later chapters: replication case study
DD suggested a case-study for testing the problem of replicating a DSS, specifically in an expert elicitation setting.
Tasks to follow up:
- Jane had a few ideas, and Andrew also. Fiona would like to be present for both of those discussions.
- Human Ethics Sub-committee (HESC) Meeting submission deadlines: 24 August and 21 September. I can’t see when the HEAG (Human Ethics Advisory Group; first port of call) submission deadlines are, access problem. Hannah, can I please see your Human Ethics application for the survey, as a guide?