| tags: [ accuracy false positive transparency confidence interval p-value practice QRP ] categories: [synthesis ]

Bayesianism and questionable research practices

  • What are the different model types / applications used in ecology / conservation?
  • what is the process for deriving, evaluating, and reporting these models?
  • At what point in the process might QRPs arise?
  • Can the same QRP arise at different points during the analysis?

Next actions

  • read the stopping rules paper I just imported into papers
  • add prior selection / weighting and other qaeco retreat points to the table below
  • can we generalise overarching workflows across methods? Can we generalise QRPS across different points in a workflow?

Types of Bayesian methods / analyses

  • Bayesian Hierarchical Models: further sub-types used in ecology include spatial regression, N-mixture models, and Hidden Markov Models (Conn et al. 2018)
  • Bayesian estimation (Simonsohn 2014) is this used in ecology?
  • Bayes factors (Simonsohn 2014) is this used in ecology?
  • Bayesian P values (and other metrics used in model checking) (Conn et al. 2018)
read_csv("../../static/files/data/Bayesian_qrps.csv") %>% knitr::kable()
Method QRP Source Applicable to Frequentist Reason is QRP Other comment
Bayesian estimation P-hacking: inc data-peeking, dropping conditions or covariates, dropping outliers (Simonsohn 2014) Y inflates risk of type I error to the same magnitude as frequentist methods NA
Bayesian P values (e.g. used in model checking) Could theoretically be prone to P-hacking See discussion in (Conn et al. 2018) Y As above Conn et al demonstrated that posterior predictive P values used in model checking (as well as other metrices) “can have a larger than nominal value, so that our ability to”reject" the null hypothesis that data arose from the model is over-stated." In asecond case study (spatial regression), “the Bayesian P value often failed to reject models without spatial structure even when data were simulated with considerable spatial autocorrelation. The overstated probability of rejection is due to the double use of data, which are used both to fit the model and also to calculate a tail probability.”
Bayes factors As above As above Y As above NA
Bayesian Hierarchical Modelling Failure to undertake and / or report model checking (Conn et al. 2018) Y Not enough to report MCMC chain convergence to a stationary distribution only: “Perhaps there is a mistaken belief among authors and reviewers that convergence to a stationary dis- tribution, combined with a lack of prior sensitivity, implies that a model fits the data. In reality, convergence diagnostics such as trace plots only allow us to check the algorithm for fitting the model, not the model itself” (Conn et al. 2018). “Results of such goodness-of-fit tests are routinely reported when publishing analyses in the ecological literature. The implicit requirement that one conduct model checking exercises is not often adhered to when reporting results of Bayesian analyses. For instance, a search of Ecology articles published in 2014 indicated that only 25% of articles employing Bayesian analysis on real data sets reported any model checking or goodness-of-fit testing” (Conn et al. 2018)
Bayesian Hierarchical Modelling Selective reporting of model checking (Conn et al. 2018) Y NA “As in the case of “P hacking” (Head et al. 2015), care should be taken to choose appropriate goodness of fit measures without first peeking at results (i.e., employing multiple discrepancy measures and only reporting those that indicate adequate fit)." (Conn et al. 2018). Their paper demonstrates that different model checking methods may be better at diagnosing the particular causes or regions in the model where there is a lack of fit. THe plethora of choices for model checking givse rise to researcher degrees of freedom, and hence the very real potential for P-hacking and its equivalents.
Bayesian Hierarchical Modelling Failure to report convergence diagnostic of MCMC chains (Conn et al. 2018) N NA “all of the articles we examined did a commendable job in reporting convergence diagnostics to support their contention that MCMC chains had reached their stationary distribution” (Conn et al. 2018).
Bayesian Hierarchical Modelling Selective reporting of convergence diagnostics (Conn et al. 2018) N NA NA

Bayesian vs. frequentist.The generic form of the QRPs are common across frequentist and Bayesian methods (e.g. selective reporting, and even P-hacking). E.g. the failure to report or selctive reporting problem can occur in both frequentist and Bayesian analyses, but the set of required things to report and the overarching workflow in reporting differs between Bayesian and frequentist methods. For example, in Bayesian methods, reporting of diagnostics around MCMC convergence is an additional reporting element.

Moreover, more complex modelling methods such as Bayesian Hierarchical Modeling are potentially more prone to QRPs, simply because “there are more places where things can go wrong” (Conn et al. 2018). This is known as researcher degrees of freedom. Simply because there are more decision points in the overall workflow, AND at each decision point there are so many potential choices of analyses to implement. Potentially, the magnitude is more when there are more than one QRP in a study / model, or the same QRP arises at multiple points in the model development workflow.

Data Collection

  • stopping rules (optional or not, v. controversial)

Model Development and fitting

  • report all covariates
  • report all conditions
  • report if outliers dropped and justify, but report model results for both.

Model Development: Prior selection

  • Do not peek at data. Rationalise choice before fitting model.
  • Check influence and report
  • Do not selectively debug (e.g. if results look interesting / unexpected)
  • Ensure measured on the same scale as likelihood

MCMC Convergence

  • check and report ALL attempts
  • perform for Null model and all alternative models
  • do not selectively debug

Model Checking

  • Check and report all diagnostic analyses
  • do not selectively debug
  • ensure assumptions of model are not violated, report.

Model Selection and Comparison

  • report all models, including the null model and alternative models
  • Do not interpret in frequentist framework (remember that the relative plausibility of different models should be interpreted as conditioned on the data… not on some “hypothetical truth” (Rouder 2014)).
nodes <- DiagrammeR::create_node_df(7, label = c("Data Collection", 
                                                 "Model Development and Fitting", 
                                                 "Test for Convergence", 
                                                 "Model Checking", 
                                                 "Model Comparison", 
                                                 "Robustness Analysis"))
from <- 1:length(nodes$label[-7])
to <- lead(from,default = length(nodes$label))
edges <- DiagrammeR::create_edge_df(from = from, to = to)

#Add in Yes / No
nodes <- nodes %>%
        bind_rows(create_node_df(2, label = c("Yes", "No")) %>%
                          dplyr::mutate(id = id + 7))
edges <- edges %>%
        bind_rows(create_edge_df(from = c(3,9,5,9,3,8,4,8), to = c(9,2,9,2,8,4,8,5)))

graph <- DiagrammeR::create_graph(nodes_df = nodes, edges_df = edges)
DiagrammeR::render_graph(graph, layout = "tree")
##### ggdag
## Attaching package: 'ggdag'
## The following object is masked from 'package:ggplot2':
##     expand_scale
## The following object is masked from 'package:stats':
##     filter
ggdag::dagify(robustnessanalysis ~ modelcomparison,
              modelcomparison ~ pass,
              pass ~ modelchecking + yes,
              modelchecking ~ testforconvergence + yes1,
              testforconvergence ~ modeldevelopmentandfitting,
              modeldevelopmentandfitting ~ datacollection + no,
              yes1 ~ testforconvergence,
              yes ~ modelchecking,
              no ~ testforconvergence,
              labels = c(
                      "robustnessanalysis" = "Robustness\n Analysis",
                      "modelcomparison" = "Model Comparison\n and Selection",
                      "pass" = "Pass?",
                      "modelchecking" = "Model\n Checking",
                      "modeldevelopmentandfitting" = "Model Development\n and Fitting",
                      "yes1" = "yes",
                      "yes" = "yes"
              )) %>% 
        ggdag(use_labels = "label", edge_type = "link", layout = "tree")
## Warning in layout_as_tree(structure(list(10, TRUE, c(0, 1, 1, 2, 3, 4, 5, :
## At structural_properties.c:3346 :graph contains a cycle, partial result is
## returned

Bayesian methods equally suscpetible and invalidated by p-hacking as frequentist methods

The probability of finding a type I error increases substantially due to undisclosed flexibility in data collection (Simmons, Nelson, and Simonsohn 2011). Bayesian methods have been proposed as a remedy to these questionable research practices: “We have diverse views about how best to improve reproducibility, and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P values” (Benjamin et al. 2018). However Simonsohn (2014) demonstrates that this misconception is false, finding that both Bayesian confidence intervals and Bayes factors are equally suscpetible and invalidated to the same degree by “p-hacking” practices as their frequentist inference equivalents.

The p-hacking practices included in Simonsohn (2014) were: data-peeking, dropping dependent variables, dropping conditions, dropping outliers.

Criticisms of Bayesian P values used in model checking

type II error rates

Posterior predictive checks are straightforward to implement. Unfortunately, Bayesian P values based on these checks tend to be conservative in the sense that the distribution of P values calculated under a null model (i.e., when the data generating model and estimation model are the same) is often dome shaped instead of the uniform distribution expected of frequentist P values (Robins et al. 2000). This feature arises because data are used twice: once to approximate the posterior distribution and to simulate the reference distribution for the discrepancy measure, and a second time to calculate the tail probability (Bayarri and Berger 2000).

As such, the power of posterior predictive Bayesian P values to detect significant differences in the discrepancy measure is low. Evidently, the degree of conservatism can vary across data, models, and discrepancy functions, making it difficult to interpret or compare Bayesian P values across models. In an extreme example, Zhang (2014) found that posterior predictive P values almost never rejected a model, even when the model used to fit the data differed considerably from the model used to generate it.

Failure to report model checking / evaluation diagnostics

Implications of failure to report model checking

The role of model-checking in assessing model reliability of inferences:

The goal of model checking is not to “develop a model that fits perfectly, but rather to probe models for assumption violations that result in systematic errors” (Conn et al. 2018). Fitting models is a tricky business in ecology, because “we expect lack of fit in our models” (due to simplistic processes in ecology being rare, and environmental hetereogeneity manifesting in individuals, patchy responses, and always some persisting variation that can never be explained by all covariates). However, a goal of modeling should be to “minimise biases attributable to poor modeling assumptions.”

“Ultimately, the reliability of inference from a fitted model (Bayesian or otherwise) depends on how well the model approximates reality. There are multiple ways of assessing a model’s performance in representing the system being stud- ied. A first step is often to examine diagnostics that compare observed data to model output to pinpoint if and where any systematic differences occur. This process, which we term model checking, is a critical part of statistical inference because it helps diagnose assumption violations and illumi- nate places where a model might be amended to more faith- fully represent gathered data. Following this step, one might proceed to compare the performance of alternative models embodying different hypotheses using any number of model comparison or out-of-sample predictive performance met- rics (see Hooten and Hobbs 2015 for a review) to gauge the support for alternative hypotheses or optimize predictive ability (Fig. 1).” (Conn et al. 2018).

And the implications of the presence of systematic errors in our models differ depending on the use context of the model, Conn describe two examples:

  1. When the models are used to underpin management decisions
  2. Basic-science

In the first context, if for instance a model is deployed and is underdispersed because the wrong distribution has been used (Poisson instead of -ve Binomial, or normal instead of the t distribution), estimates are often too precise, and predictions less extreme than in the real world. The effect of such a model used to inform environmental policy would “tend to make decision makers overly confident in their projections.” (Conn et al. 2018). In the second context, “overconfidence can have real ramifications for confirmation or refutation of existing theory.”

Reasons giving rise to failure to report model checking / evaluation results

Conn et al (2018) posit two reasons as to why Bayesians fail to report (and hence possibly undertake at all) model checking results:

  1. A misunderstanding of the information conveyed in MCMC convergence diagnostics

Perhaps there is a mistaken belief among authors and reviewers that convergence to a stationary distribution, combined with a lack of prior sensitivity, implies that a model fits the data. In reality, convergence diagnostics such as trace plots only allow us to check the algorithm for fitting the model, not the model itself.

  1. Structural disensitive / burden to engage in model evaluation

Finally, it may just be a case of fatigue: it takes considerable effort to envision and code complex hierarchical models of ecological systems, and the extra step of model checking may seem burdensome.

  • [ ] I believe the TRACE papers and other “Good Environmental Modeling Practice” papers discuss this same problem for non-Bayesian ecological models. Perhaps the technical burden of fitting Bayesian models is even more cumbersome than for other ecological modelling practices? This point is something to follow up on.

Optional Stopping Rules and Bayes Factors


Benjamin, Daniel J, James O Berger, Magnus Johannesson, Brian A Nosek, E J Wagenmakers, Richard Berk, Kenneth A Bollen, et al. 2018. “Redefine statistical significance.” Nature Human Behaviour, January. Springer US, 1–5. https://doi.org/10.1038/s41562-017-0189-z.

Conn, Paul B, Devin S Johnson, Perry J Williams, Sharon R Melin, and Mevin B Hooten. 2018. “A guide to Bayesian model checking for ecologists.” Ecological Monographs 9 (June): 341–17. https://doi.org/10.1002/ecm.1314.

Rouder, Jeffrey N. 2014. “Optional stopping: No problem for Bayesians.” Psychonomic Bulletin & Review 21 (2): 301–8. https://doi.org/10.3758/s13423-014-0595-4.

Simmons, Joseph P, Leif D Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology.” Psychological Science 22 (11). SAGE PublicationsSage CA: Los Angeles, CA: 1359–66. https://doi.org/10.1177/0956797611417632.

Simonsohn, U. 2014. “Posterior-hacking: Selective reporting invalidates Bayesian results also.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2374040.