SciBeh-Topic-Visualization

bayesian, causal, measurement, replication, statistical

Topic 21

bayesian causal measurement replication statistical error bias heterogeneity explanation pervasiveness prior significant prediction multiverse naïve

Measuring evidence for mediation in the presence of measurement error

Mediation analysis empirically investigates the process underlying the effect of an experimental manipulation on a dependent variable of interest. In the simplest mediation setting, the experimental treatment can affect the dependent variable through the mediator (indirect effect) and/or directly (direct effect). Recent methodological advances made in the field of mediation analysis aim at developing statistically reliable estimates of the indirect effect of the treatment on the outcome. However, what appears to be an indirect effect through the mediator may reflect a data generating process without mediation, regardless of the statistical properties of the estimate. To overcome this indeterminacy where possible, we develop the insight that a statistically reliable indirect effect combined with strong evidence for conditional independence of treatment and outcome given the mediator is unequivocal evidence for mediation (as the underlying causal model generating the data) into an operational procedure. Our procedure combines Bayes factors as principled measures of the degree of support for conditional independence, i.e., the degree of support for a Null hypothesis, with latent variable modeling to account for measurement error and discretization in a fully Bayesian framework. We illustrate how our approach facilitates stronger conclusions by re-analzing a set of published mediation studies.
analysis
modeling
statistics
scientific method
methodology
bayesian, causal, measurement, replication, statistical
neural, accuracy, elusive, task, trade
Bayesian evaluation of replication studies

In this paper a method is proposed to determine whether the result from an original study is corroborated in a replication study. The paper is illustrated using data from the reproducibility project psychology by the Open Science Collaboration. This method emphasizes the need to determine what one wants to replicate: the hypotheses as formulated in the introduction of the original paper, or hypotheses derived from the research results presented in the original paper. The Bayes factor will be used to determine whether the hypotheses evaluated in/resulting from the original study are corroborated by the replication study. Our method to assess the successfulness of replication will better fit the needs and desires of researchers in fields that use replication studies.
data
evaluation
open science
collaboration
psychology
study
replication
bayesian
bayesian, causal, measurement, replication, statistical
peer, publish, publication, review, preregistration
Missing data should be handled differently for prediction than for description or causal explanation

Missing data is much studied in epidemiology and statistics. Theoretical development and application of methods for handling missing data have mostly been conducted in the context of prospective research data, and with a goal of description or causal explanation. However, it is now common to build predictive models using routinely collected data, where missing patterns may convey important information, and one might take a pragmatic approach to optimising prediction. Therefore, different methods to handle missing data may be preferred. Furthermore, an underappreciated issue in prediction modelling is that the missing data method used in model development may not match the method used when a model is deployed. This may lead to over-optimistic assessments of model performance.
epidemiology
modeling
statistics
prediction
scientific practice
optimization
scientific method
causality
humanity, statistic, wrong, communicate, fit
bayesian, causal, measurement, replication, statistical
New statistical metrics for multisite replication projects

Increasingly, researchers are attempting to replicate published original studies by using large, multisite replication projects, at least 134 of which have been completed or are on going. These designs are promising to assess whether the original study is statistically consistent with the replications and to reassess the strength of evidence for the scientific effect of interest. However, existing analyses generally focus on single replications; when applied to multisite designs, they provide an incomplete view of aggregate evidence and can lead to misleading conclusions about replication success. We propose new statistical metrics representing firstly the probability that the original study's point estimate would be at least as extreme as it actually was, if in fact the original study were statistically consistent with the replications, and secondly the estimated proportion of population effects agreeing in direction with the original study. Generalized versions of the second metric enable consideration of only meaningfully strong population effects that agree in direction, or alternatively that disagree in direction, with the original study. These metrics apply when there are at least 10 replications (unless the heterogeneity estimate 𝜏̂ =0τ^=0<math altimg="urn:x-wiley:09641998:media:rssa12572:rssa12572-math-0001" location="graphic/rssa12572-math-0001.png"> <mrow> <mover accent="true"> <mi>τ</mi> <mo stretchy="false">^</mo> </mover> <mo>=</mo> <mn>0</mn> </mrow> </math>, in which case the metrics apply regardless of the number of replications). The first metric assumes normal population effects but appears robust to violations in simulations; the second is distribution free. We provide R packages (Replicate and MetaUtility)
metascience
replication
r package
bayesian, causal, measurement, replication, statistical
peer, publish, publication, review, preregistration
Robust Bayesian Meta-Analysis: Addressing Publication Bias with Model-Averaging

Meta-analysis is an important quantitative tool for cumulative science, but its application is frustrated by publication bias. In order to test and adjust for publication bias, we extend model-averaged Bayesian meta-analysis with selection models. The resulting Robust Bayesian Meta-analysis (RoBMA) methodology does not require all-or-none decisions about the presence of publication bias, can quantify evidence in favor of the absence of publication bias, and performs well under high heterogeneity. By model-averaging over a set of 12 models, RoBMA is relatively robust to model misspecification and simulations show that it outperforms existing methods. We demonstrate that RoBMA finds evidence for the absence of publication bias in Registered Replication Reports and reliably avoids false positives. We provide an implementation in R and JASP so that researchers can easily apply the new methodology to their data.
modeling
metascience
false positive
meta-analysis
implementation
r
bayesian
publication bias
bayesian, causal, measurement, replication, statistical
patient, hydroxychloroquine, cohort, mortality, observational
The practical alternative to the p-value is the correctly used p-value

Due to the strong overreliance on p-values in the scientific literature some researchers have argued that p-values should be abandoned or banned, and that we need to move beyond p-values and embrace practical alternatives. When proposing alternatives to p-values statisticians often commit the ‘Statistician’s Fallacy’, where they declare which statistic researchers really ‘want to know’. Instead of telling researchers what they want to know, statisticians should teach researchers which questions they can ask. In some situations, the answer to the question they are most interested in will be the p-value. As long as null-hypothesis tests have been criticized, researchers have suggested to include minimum-effect tests and equivalence tests in our statistical toolbox, and these tests (even though they return p-values) have the potential to greatly improve the questions researchers ask. It is clear there is room for improvement in how we teach p-values. If anyone really believes p-values are an important cause of problems in science, preventing the misinterpretation of p-values by developing better evidence-based education and user-centered statistical software should be a top priority. Telling researchers which statistic they should use has distracted us from examining more important questions, such as asking researchers what they want to know when they do scientific research. Before we can improve our statistical inferences, we need to improve our statistical questions.
statistics
metascience
misinterpretation
bayesian, causal, measurement, replication, statistical
peer, publish, publication, review, preregistration
How meaningful are parameter estimates from models of inter-temporal choice?

Decades of work has been dedicated to developing and testing models that characterize how people make inter-temporal choices. Although parameter estimates from these models are often interpreted as indices of latent components of the choice process, little work has been done to examine their reliability. This is problematic, because estimation error can bias conclusions that are drawn from these parameter estimates. We examine the reliability of inter-temporal choice model parameter estimates by conducting a parameter recovery analysis of 11 prominent models. We find that the reliability of parameter estimation varies considerably between models and the experimental designs upon which parameter estimates are based. We conclude that many parameter estimates reported in previous research are likely unreliable and provide recommendations on how to enhance reliability for those wishing to use inter-temporal choice models for measurement purposes.
bias
reliability
variation
network, complex, graph, multiplex, structure
bayesian, causal, measurement, replication, statistical
Estimating the deep replicability of scientific findings using human and artificial intelligence

Replicability tests of scientific papers show that the majority of papers fail replication. Moreover, failed papers circulate through the literature as quickly as replicating papers. This dynamic weakens the literature, raises research costs, and demonstrates the need for new approaches for estimating a study’s replicability. Here, we trained an artificial intelligence model to estimate a paper’s replicability using ground truth data on studies that had passed or failed manual replication tests, and then tested the model’s generalizability on an extensive set of out-of-sample studies. The model predicts replicability better than the base rate of reviewers and comparably as well as prediction markets, the best present-day method for predicting replicability. In out-of-sample tests on manually replicated papers from diverse disciplines and methods, the model had strong accuracy levels of 0.65 to 0.78. Exploring the reasons behind the model’s predictions, we found no evidence for bias based on topics, journals, disciplines, base rates of failure, persuasion words, or novelty words like “remarkable” or “unexpected.” We did find that the model’s accuracy is higher when trained on a paper’s text rather than its reported statistics and that n-grams, higher order word combinations that humans have difficulty processing, correlate with replication. We discuss how combining human and machine intelligence can raise confidence in research, provide research self-assessment techniques, and create methods that are scalable and efficient enough to review the ever-growing numbers of publications—a task that entails extensive human resources to accomplish with prediction markets and manual replication alone.
prediction
science
artificial intelligence
ai
model
article
replication
likelihood
replicability
paper
machine, twitter, learn, technology, application
bayesian, causal, measurement, replication, statistical
Power and precision
June 12, 2020 · · Original resource · blog

One of the common claims of anti-significance-testing reformers is that power analysis is flawed, and that we should be planning for study “precision” instead. I think this is wrong for several reasons that I will outline here. In summary:“Precision” is not itself a primitive theoretical concept. It is an intuition that is manifest through other more basic concepts, and it is those more basic concepts that we must understand.Precision can be thought of as the ability to avoid confusion between closeby regions of the parameter space. When we define power properly, we see that power is directly connected to precision. We don’t replace power with precision; we explain precision using power.Expected CI width (which some associate with “precision”) can depend on the parameter value, except in special cases. Power analysis directs your attention to a specific area of interest, linked to the purpose of the study, and hence overcomes this problem with CI-only concepts of precision.(One-tailed) Power is a flexible way of thinking about precision; confidence intervals (CIs), computed with equal probability in each tail, have difficulties with error trade-offs (asymmetricly-tailed CIs, though possible, would surely confuse people). We should thus keep the concept of power, and explain CIs and precision using confusion/error as the primitives.
statistics
scientific practice
power
theory
humanity, statistic, wrong, communicate, fit
bayesian, causal, measurement, replication, statistical
A Bootstrap Based Between-Study Heterogeneity Test in Meta-Analysis

Meta-analysis combines pertinent information from existing studies to provide an overall estimate of population parameters/effect sizes, as well as quantify and explain the differences between studies. However, testing the between-study heterogeneity is one of the most troublesome topics in meta-analysis research. The existing methods, such as the Q test and likelihood ratio (LR) tests, are criticized for their failure to control the Type I error rate and/or failure to attain enough statistical power. Although better reference distribution approximations have been proposed in the literature, the expression is complicated and the application is limited. We propose a bootstrap based heterogeneity test combining the restricted maximum likelihood (REML) ratio test or Q test with bootstrap procedures, denoted as B-REML-LRT and B-Q respectively. Simulation studies were conducted to examine and compare the performance of the proposed methods with the regular LR tests, the regular Q test, and the improved Q test in both the random-effects meta-analysis and mixed-effects meta-analysis. Based on both Type I error rates and statistical power, B-REML-LRT is recommended when effect sizes are standardized mean differences and Fisher-transformed Pearson's correlations. When effect sizes are natural-logarithm-transformed odds ratios, B-REML-LRT (study-level sample sizes cannot be small) and B-Q are recommended. The improved Q test is recommended when it is applicable. An R package boot.heterogeneity is provided to facilitate the implementation of the proposed method.
heterogeneity
meta-analysis
implementation
r
bayesian, causal, measurement, replication, statistical
patient, hydroxychloroquine, cohort, mortality, observational