Drug-Coated Balloons for Revascularization of Infrapopliteal Arteries: A Meta-Analysis of Randomized Trials.

S. Cassese,G. Ndrepepa,F. Liistro,F. Fanelli,S. Kufner,I. Ott,K. Laugwitz,H. Schunkert,A. Kastrati,M. Fusaro

DOI: https://doi.org/10.1016/j.jcin.2016.02.011

2016-05-23

Abstract:

What problem does this paper attempt to address?

Consensus-based guidance for conducting and reporting multi-analyst studies

Balazs Aczel,Barnabas Szaszi,Gustav Nilsonne,Olmo R van den Akker,Casper J Albers,Marcel ALM van Assen,Jojanneke A Bastiaansen,Daniel Benjamin,Udo Boehm,Rotem Botvinik-Nezer,Laura F Bringmann,Niko A Busch,Emmanuel Caruyer,Andrea M Cataldo,Nelson Cowan,Andrew Delios,Noah NN van Dongen,Chris Donkin,Johnny B van Doorn,Anna Dreber,Gilles Dutilh,Gary F Egan,Morton Ann Gernsbacher,Rink Hoekstra,Sabine Hoffmann,Felix Holzmeister,Juergen Huber,Magnus Johannesson,Kai J Jonas,Alexander T Kindel,Michael Kirchler,Yoram K Kunkels,D Stephen Lindsay,Jean-Francois Mangin,Dora Matzke,Marcus R Munafò,Ben R Newell,Brian A Nosek,Russell A Poldrack,Don van Ravenzwaaij,Jörg Rieskamp,Matthew J Salganik,Alexandra Sarafoglou,Tom Schonberg,Martin Schweinsberg,David Shanks,Raphael Silberzahn,Daniel J Simons,Barbara A Spellman,Samuel St-Jean,Jeffrey J Starns,Eric Luis Uhlmann,Jelte Wicherts,Eric-Jan Wagenmakers,Noah N N van Dongen,Ann Dreber

DOI: https://doi.org/10.7554/eLife.72185

IF: 7.7

2021-11-09

eLife

Abstract:Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research.

biology
Toward a more credible assessment of the credibility of science by many-analyst studies

Katrin Auspurg,Josef Brüderl

DOI: https://doi.org/10.1073/pnas.2404035121

2024-09-17

Abstract:We discuss a relatively new meta-scientific research design: many-analyst studies that attempt to assess the replicability and credibility of research based on large-scale observational data. In these studies, a large number of analysts try to answer the same research question using the same data. The key idea is the greater the variation in results, the greater the uncertainty in answering the research question and, accordingly, the lower the credibility of any individual research finding. Compared to individual replications, the large crowd of analysts allows for a more systematic investigation of uncertainty and its sources. However, many-analyst studies are also resource-intensive, and there are some doubts about their potential to provide credible assessments. We identify three issues that any many-analyst study must address: 1) identifying the source of variation in the results; 2) providing an incentive structure similar to that of standard research; and 3) conducting a proper meta-analysis of the results. We argue that some recent many-analyst studies have failed to address these issues satisfactorily and have therefore provided an overly pessimistic assessment of the credibility of science. We also provide some concrete guidance on how future many-analyst studies could provide a more constructive assessment.
Real Effect or Bias? Good Practices for Evaluating the Robustness of Evidence From Comparative Observational Studies Through Quantitative Sensitivity Analysis for Unmeasured Confounding

Douglas Faries,Chenyin Gao,Xiang Zhang,Chad Hazlett,James Stamey,Shu Yang,Peng Ding,Mingyang Shan,Kristin Sheffield,Nancy Dreyer

DOI: https://doi.org/10.1002/pst.2457

2024-12-06

Pharmaceutical Statistics

Abstract:The assumption of "no unmeasured confounders" is a critical but unverifiable assumption required for causal inference yet quantitative sensitivity analyses to assess robustness of real‐world evidence remains under‐utilized. The lack of use is likely in part due to complexity of implementation and often specific and restrictive data requirements for application of each method. With the advent of methods that are broadly applicable in that they do not require identification of a specific unmeasured confounder—along with publicly available code for implementation—roadblocks toward broader use of sensitivity analyses are decreasing. To spur greater application, here we offer a good practice guidance to address the potential for unmeasured confounding at both the design and analysis stages, including framing questions and an analytic toolbox for researchers. The questions at the design stage guide the researcher through steps evaluating the potential robustness of the design while encouraging gathering of additional data to reduce uncertainty due to potential confounding. At the analysis stage, the questions guide quantifying the robustness of the observed result and providing researchers with a clearer indication of the strength of their conclusions. We demonstrate the application of this guidance using simulated data based on an observational fibromyalgia study, applying multiple methods from our analytic toolbox for illustration purposes.

pharmacology & pharmacy,statistics & probability
Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty

Nate Breznau,Eike Mark Rinke,Alexander Wuttke,Hung H V Nguyen,Muna Adem,Jule Adriaans,Amalia Alvarez-Benjumea,Henrik K Andersen,Daniel Auer,Flavio Azevedo,Oke Bahnsen,Dave Balzer,Gerrit Bauer,Paul C Bauer,Markus Baumann,Sharon Baute,Verena Benoit,Julian Bernauer,Carl Berning,Anna Berthold,Felix S Bethke,Thomas Biegert,Katharina Blinzler,Johannes N Blumenberg,Licia Bobzien,Andrea Bohman,Thijs Bol,Amie Bostic,Zuzanna Brzozowska,Katharina Burgdorf,Kaspar Burger,Kathrin B Busch,Juan Carlos-Castillo,Nathan Chan,Pablo Christmann,Roxanne Connelly,Christian S Czymara,Elena Damian,Alejandro Ecker,Achim Edelmann,Maureen A Eger,Simon Ellerbrock,Anna Forke,Andrea Forster,Chris Gaasendam,Konstantin Gavras,Vernon Gayle,Theresa Gessler,Timo Gnambs,Amélie Godefroidt,Max Grömping,Martin Groß,Stefan Gruber,Tobias Gummer,Andreas Hadjar,Jan Paul Heisig,Sebastian Hellmeier,Stefanie Heyne,Magdalena Hirsch,Mikael Hjerm,Oshrat Hochman,Andreas Hövermann,Sophia Hunger,Christian Hunkler,Nora Huth,Zsófia S Ignácz,Laura Jacobs,Jannes Jacobsen,Bastian Jaeger,Sebastian Jungkunz,Nils Jungmann,Mathias Kauff,Manuel Kleinert,Julia Klinger,Jan-Philipp Kolb,Marta Kołczyńska,John Kuk,Katharina Kunißen,Dafina Kurti Sinatra,Alexander Langenkamp,Philipp M Lersch,Lea-Maria Löbel,Philipp Lutscher,Matthias Mader,Joan E Madia,Natalia Malancu,Luis Maldonado,Helge Marahrens,Nicole Martin,Paul Martinez,Jochen Mayerl,Oscar J Mayorga,Patricia McManus,Kyle McWagner,Cecil Meeusen,Daniel Meierrieks,Jonathan Mellon,Friedolin Merhout,Samuel Merk,Daniel Meyer,Leticia Micheli,Jonathan Mijs,Cristóbal Moya,Marcel Neunhoeffer,Daniel Nüst,Olav Nygård,Fabian Ochsenfeld,Gunnar Otte,Anna O Pechenkina,Christopher Prosser,Louis Raes,Kevin Ralston,Miguel R Ramos,Arne Roets,Jonathan Rogers,Guido Ropers,Robin Samuel,Gregor Sand,Ariela Schachter,Merlin Schaeffer,David Schieferdecker,Elmar Schlueter,Regine Schmidt,Katja M Schmidt,Alexander Schmidt-Catran,Claudia Schmiedeberg,Jürgen Schneider,Martijn Schoonvelde,Julia Schulte-Cloos,Sandy Schumann,Reinhard Schunck,Jürgen Schupp,Julian Seuring,Henning Silber,Willem Sleegers,Nico Sonntag,Alexander Staudt,Nadia Steiber,Nils Steiner,Sebastian Sternberg,Dieter Stiers,Dragana Stojmenovska,Nora Storz,Erich Striessnig,Anne-Kathrin Stroppe,Janna Teltemann,Andrey Tibajev,Brian Tung,Giacomo Vagni,Jasper Van Assche,Meta van der Linden,Jolanda van der Noll,Arno Van Hootegem,Stefan Vogtenhuber,Bogdan Voicu,Fieke Wagemans,Nadja Wehl,Hannah Werner,Brenton M Wiernik,Fabian Winter,Christof Wolf,Yuki Yamada,Nan Zhang,Conrad Ziller,Stefan Zins,Tomasz Żółtak

DOI: https://doi.org/10.1073/pnas.2203150119

Abstract:This study explores how researchers' analytical choices affect the reliability of scientific findings. Most discussions of reliability problems in science focus on systematic biases. We broaden the lens to emphasize the idiosyncrasy of conscious and unconscious decisions that researchers make during data analysis. We coordinated 161 researchers in 73 research teams and observed their research decisions as they used the same data to independently test the same prominent social science hypothesis: that greater immigration reduces support for social policies among the public. In this typical case of social science research, research teams reported both widely diverging numerical findings and substantive conclusions despite identical start conditions. Researchers' expertise, prior beliefs, and expectations barely predict the wide variation in research outcomes. More than 95% of the total variance in numerical results remains unexplained even after qualitative coding of all identifiable decisions in each team's workflow. This reveals a universe of uncertainty that remains hidden when considering a single study in isolation. The idiosyncratic nature of how researchers' results and conclusions varied is a previously underappreciated explanation for why many scientific hypotheses remain contested. These results call for greater epistemic humility and clarity in reporting scientific findings.
Many-analysts religion project: reflection and conclusion

Suzanne Hoogeveen,Alexandra Sarafoglou,Michiel van Elk,Eric-Jan Wagenmakers

DOI: https://doi.org/10.1080/2153599X.2022.2070263

2022-07-06

Religion, Brain and Behavior

Abstract:In the main article on the Many-Analysts Religion Project (MARP) the results of the 120 analysis teams were summarized by taking each team's reported effect size and subjective assessment of the relation between religiosity and well-being, and the moderating role of cultural norms on this relation (Hoogeveen et al., 2022 ). The many-analysts approach allowed us to appraise the uncertainty of the outcomes, which has been identified as one of the pillars of good statistical practice (Wagenmakers et al., 2021 ). A downside of this approach, however, is that a fine-grained consideration of the details and nuances of the results becomes difficult. Summaries of the individual approaches are documented in the teams' OSF project folders, but time and space did not permit the inclusion of details on each of the individual analysis pipelines in the main article.
Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study

Thom Benjamin Volker,Irene Klugkist

DOI: https://doi.org/10.48550/arXiv.2312.15032

2023-12-23

Abstract:Scientific claims gain credibility by replicability, especially if replication under different circumstances and varying designs yields equivalent results. Aggregating results over multiple studies is, however, not straightforward, and when the heterogeneity between studies increases, conventional methods such as (Bayesian) meta-analysis and Bayesian sequential updating become infeasible. *Bayesian Evidence Synthesis*, built upon the foundations of the Bayes factor, allows to aggregate support for conceptually similar hypotheses over studies, regardless of methodological differences. We assess the performance of Bayesian Evidence Synthesis over multiple effect and sample sizes, with a broad set of (inequality-constrained) hypotheses using Monte Carlo simulations, focusing explicitly on the complexity of the hypotheses under consideration. The simulations show that this method can evaluate complex (informative) hypotheses regardless of methodological differences between studies, and performs adequately if the set of studies considered has sufficient statistical power. Additionally, we pinpoint challenging conditions that can lead to unsatisfactory results, and provide suggestions on handling these situations. Ultimately, we show that Bayesian Evidence Synthesis is a promising tool that can be used when traditional research synthesis methods are not applicable due to insurmountable between-study heterogeneity.

Methodology
How the Post-Data Severity Converts Testing Results into Evidence for or against Pertinent Inferential Claims

Aris Spanos

DOI: https://doi.org/10.3390/e26010095

IF: 2.738

2024-01-23

Entropy

Abstract:The paper makes a case that the current discussions on replicability and the abuse of significance testing have overlooked a more general contributor to the untrustworthiness of published empirical evidence, which is the uninformed and recipe-like implementation of statistical modeling and inference. It is argued that this contributes to the untrustworthiness problem in several different ways, including [a] statistical misspecification, [b] unwarranted evidential interpretations of frequentist inference results, and [c] questionable modeling strategies that rely on curve-fitting. What is more, the alternative proposals to replace or modify frequentist testing, including [i] replacing p-values with observed confidence intervals and effects sizes, and [ii] redefining statistical significance, will not address the untrustworthiness of evidence problem since they are equally vulnerable to [a]–[c]. The paper calls for distinguishing between unduly data-dependant 'statistical results', such as a point estimate, a p-value, and accept/reject H0, from 'evidence for or against inferential claims'. The post-data severity (SEV) evaluation of the accept/reject H0 results, converts them into evidence for or against germane inferential claims. These claims can be used to address/elucidate several foundational issues, including (i) statistical vs. substantive significance, (ii) the large n problem, and (iii) the replicability of evidence. Also, the SEV perspective sheds light on the impertinence of the proposed alternatives [i]–[iii], and oppugns [iii] the alleged arbitrariness of framing H0 and H1 which is often exploited to undermine the credibility of frequentist testing.

physics, multidisciplinary
A survey of experts to identify methods to detect problematic studies: Stage 1 of the INSPECT-SR Project

Jack Wilkinson,Calvin Heal,George A Antoniou,Ella Flemyng,Alison Avenell,Virginia Barbour,Esmee M Bordewijk,Nicholas J L Brown,Mike Clarke,Jo Dumville,Steph Grohmann,Lyle C Gurrin,Jill A Hayden,Kylie E Hunter,Emily Lam,Toby Lasserson,Tianjing Li,Sarah Lensen,Jianping Liu,Andreas Lundh,Gideon Meyerowitz-Katz,Ben W Mol,Neil E O'Connell,Lisa Parker,Barbara Redman,Anna Lene Seidler,Kyle Sheldrick,Emma Sydenham,Darren L Dahly,Madelon van Wely,Lisa Bero,Jamie J Kirkham

DOI: https://doi.org/10.1016/j.jclinepi.2024.111512

2024-08-31

Abstract:Background: Randomised controlled trials (RCTs) inform healthcare decisions. Unfortunately, some published RCTs contain false data, and some appear to have been entirely fabricated. Systematic reviews are performed to identify and synthesise all RCTs which have been conducted on a given topic. This means that any of these 'problematic studies' are likely to be included, but there are no agreed methods for identifying them. The INSPECT-SR project is developing a tool to identify problematic RCTs in systematic reviews of healthcare-related interventions. The tool will guide the user through a series of 'checks' to determine a study's authenticity. The first objective in the development process is to assemble a comprehensive list of checks to consider for inclusion. Methods: We assembled an initial list of checks for assessing the authenticity of research studies, with no restriction to RCTs, and categorised these into five domains: Inspecting results in the paper; Inspecting the research team; Inspecting conduct, governance, and transparency; Inspecting text and publication details; Inspecting the individual participant data. We implemented this list as an online survey, and invited people with expertise and experience of assessing potentially problematic studies to participate through professional networks and online forums. Participants were invited to provide feedback on the checks on the list, and were asked to describe any additional checks they knew of, which were not featured in the list. Results: Extensive feedback on an initial list of 102 checks was provided by 71 participants based in 16 countries across five continents. Fourteen new checks were proposed across the five domains, and suggestions were made to reword checks on the initial list. An updated list of checks was constructed, comprising 116 checks. Many participants expressed a lack of familiarity with statistical checks, and emphasized the importance of feasibility of the tool. Conclusions: A comprehensive list of trustworthiness checks has been produced. The checks will be evaluated to determine which should be included in the INSPECT-SR tool.
What do meta-analysts need in primary studies? Guidelines and the SEMI checklist for facilitating cumulative knowledge

Belén Fernández-Castilla,Sameh Said-Metwaly,Rodrigo S. Kreitchmann,Wim Van Den Noortgate

DOI: https://doi.org/10.3758/s13428-024-02373-9

IF: 5.953

2024-04-17

Behavior Research Methods

Abstract:Meta-analysis is often recognized as the highest level of evidence due to its notable advantages. Therefore, ensuring the precision of its findings is of utmost importance. Insufficient reporting in primary studies poses challenges for meta-analysts, hindering study identification, effect size estimation, and meta-regression analyses. This manuscript provides concise guidelines for the comprehensive reporting of qualitative and quantitative aspects in primary studies. Adhering to these guidelines may help researchers enhance the quality of their studies and increase their eligibility for inclusion in future research syntheses, thereby enhancing research synthesis quality. Recommendations include incorporating relevant terms in titles and abstracts to facilitate study retrieval and reporting sufficient data for effect size calculation. Additionally, a new checklist is introduced to help applied researchers thoroughly report various aspects of their studies.

psychology, experimental, mathematical
Methods proposed for monitoring the implementation of evidence-based research: a cross-sectional study

Livia Puljak,Małgorzata M Bala,Joanna Zając,Tomislav Meštrović,Sandra Buttigieg,Mary Yanakoulia,Matthias Briel,Carole Lunny,Wiktoria Lesniak,Tina Poklepović Peričić,Pablo Alonso-Coello,Mike Clarke,Benjamin Djulbegovic,Gerald Gartlehner,Konstantinos Giannakou,Anne-Marie Glenny,Claire Glenton,Gordon Guyatt,Lars G Hemkens,John P A Ioannidis,Roman Jaeschke,Karsten Juhl Jørgensen,Carolina Castro Martins-Pfeifer,Ana Marušić,Lawrence Mbuagbaw,Jose Francisco Meneses Echavez,David Moher,Barbara Nussbaumer-Streit,Matthew J Page,Giordano Pérez-Gaxiola,Karen A Robinson,Georgia Salanti,Ian J Saldanha,Jelena Savović,James Thomas,Andrea C Tricco,Peter Tugwell,Joost van Hoof,Dawid Pieper,Małgorzata M. Bala,Lars G. Hemkens,John P.A. Ioannidis,Matthew J. Page,Karen A. Robinson,Ian J. Saldanha,Andrea C. Tricco

DOI: https://doi.org/10.1016/j.jclinepi.2024.111247

IF: 7.407

2024-04-01

Journal of Clinical Epidemiology

Abstract:OBJECTIVES: Evidence-based research (EBR) is the systematic and transparent use of prior research to inform a new study so that it answers questions that matter in a valid, efficient, and accessible manner. This study surveyed experts about existing (e.g., citation analysis) and new methods for monitoring EBR and collected ideas about implementing these methods.STUDY DESIGN AND SETTING: We conducted a cross-sectional study via an online survey between November 2022 and March 2023. Participants were experts from the fields of evidence synthesis and research methodology in health research. Open-ended questions were coded by recurring themes; descriptive statistics were used for quantitative questions.RESULTS: Twenty-eight expert participants suggested that citation analysis should be supplemented with content evaluation (not just what is cited but also in which context), content expert involvement, and assessment of the quality of cited systematic reviews. They also suggested that citation analysis could be facilitated with automation tools. They emphasized that EBR monitoring should be conducted by ethics committees and funding bodies before the research starts. Challenges identified for EBR implementation monitoring were resource constraints and clarity on responsibility for EBR monitoring.CONCLUSION: Ideas proposed in this study for monitoring the implementation of EBR can be used to refine methods and define responsibility but should be further explored in terms of feasibility and acceptability. Different methods may be needed to determine if the use of EBR is improving over time.

public, environmental & occupational health,health care sciences & services
An evidence-based method for assessing the value of a search tool: a pilot study

Donald Stanley Pearson,Stevo Roksandic,Jill Kilanowski

DOI: https://doi.org/10.5195/jmla.2018.287

Abstract:Objective: The objective of this study was to develop an evidence-based method with a set of metrics that could be used to assess an information search tool. Methods: This pilot study analyzed a two-group convenience sample of graduate nursing students and resident physicians. The intervention group received ten minutes of instruction on a familiar search tool (eSearcher). Each group was provided one prompt to search for clinical guidelines on a given topic within their scope of practice and asked to find the best result using only eSearcher (intervention group) or specifically excluding eSearcher (comparison group). Three measurements of search results were employed: time elapsed to complete the search, an accuracy score, and a participant-reported score of confidence in the result. Results: Forty-two students participated in this study (23 graduate nursing students and 19 resident physicians). The intervention group consisted of 22 participants (12 graduate nursing students and 10 resident physicians), and the comparison group consisted of 20 participants (11 graduate nursing students and 9 resident physicians). The intervention group had lower mean ranks in both accuracy and confidence compared to the comparison (not eSearcher) group, although these differences were not significant. However, the intervention (eSearcher) group had significantly longer search times compared to the comparison (not eSearcher) group. Discussion: These findings provided new insights into the performance of the search tool and how users felt about their search experience. The quantitative evidence gained from this study led directly to an informed decision to explore other options for search tools. The evidence-based methods and process developed in this pilot study will enable similar studies to test other student groups and other search tools, leading to better informed purchasing and instructional decisions.
Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Tim Woelfle,Julian Hirt,Perrine Janiaud,Ludwig Kappos,John P.A. Ioannidis,Lars G. Hemkens

DOI: https://doi.org/10.1016/j.jclinepi.2024.111533

IF: 7.407

2024-09-13

Journal of Clinical Epidemiology

Abstract:Background It is unknown whether large language models (LLMs) may facilitate time- and resource-intensive text-related processes in evidence appraisal. Objectives To quantify the agreement of LLMs with human consensus in appraisal of scientific reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews and design of clinical trials (PRECIS-2). To identify areas, where human-AI collaboration would outperform the traditional consensus process of human raters in efficiency. Design Five LLMs (Claude-3-Opus, Claude-2, GPT-4, GPT-3.5, Mixtral-8x22B) assessed 112 systematic reviews applying the PRISMA and AMSTAR criteria, and 56 randomized controlled trials applying PRECIS-2. We quantified agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs approach; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM. Results Individual human rater accuracy was 89% for PRISMA and AMSTAR, and 75% for PRECIS-2. Individual LLM accuracy was ranging from 63% (GPT-3.5) to 70% (Claude-3-Opus) for PRISMA, 53% (GPT-3.5) to 74% (Claude-3-Opus) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 75-88% for PRISMA (4-74% deferred), 74-89% for AMSTAR (6-84% deferred), and 64-79% for PRECIS-2 (29-88% deferred). Human-AI collaboration resulted in the best accuracies from 89-96% for PRISMA (25/35% deferred), 91-95% for AMSTAR (27/30% deferred), and 80-86% for PRECIS-2 (76/71% deferred). Conclusions Current LLMs alone appraised evidence worse than humans. Human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR) but not for complex tasks such as PRECIS-2.

public, environmental & occupational health,health care sciences & services
Evidence Profiles for Validity Threats in Program Comprehension Experiments

Marvin Muñoz Barón,Marvin Wyrich,Daniel Graziotin,Stefan Wagner

DOI: https://doi.org/10.1109/ICSE48619.2023.00162

2023-01-25

Abstract:Searching for clues, gathering evidence, and reviewing case files are all techniques used by criminal investigators to draw sound conclusions and avoid wrongful convictions. Similarly, in software engineering (SE) research, we can develop sound methodologies and mitigate threats to validity by basing study design decisions on evidence. Echoing a recent call for the empirical evaluation of design decisions in program comprehension experiments, we conducted a 2-phases study consisting of systematic literature searches, snowballing, and thematic synthesis. We found out (1) which validity threat categories are most often discussed in primary studies of code comprehension, and we collected evidence to build (2) the evidence profiles for the three most commonly reported threats to validity. We discovered that few mentions of validity threats in primary studies (31 of 409) included a reference to supporting evidence. For the three most commonly mentioned threats, namely the influence of programming experience, program length, and the selected comprehension measures, almost all cited studies (17 of 18) did not meet our criteria for evidence. We show that for many threats to validity that are currently assumed to be influential across all studies, their actual impact may depend on the design and context of each specific study. Researchers should discuss threats to validity within the context of their particular study and support their discussions with evidence. The present paper can be one resource for evidence, and we call for more meta-studies of this type to be conducted, which will then inform design decisions in primary studies. Further, although we have applied our methodology in the context of program comprehension, our approach can also be used in other SE research areas to enable evidence-based experiment design decisions and meaningful discussions of threats to validity.

Software Engineering
Practitioner-generated blog posts as evidence for software engineering research: attitudinal survey and preliminary checklist

Austen Rainer,Ashley Williams

DOI: https://doi.org/10.48550/arXiv.2103.01845

2021-03-03

Abstract:Background: Blog posts are frequently used by software practitioners to share information about their practice. Blog posts therefore provide a potential source of evidence for software engineering (SE) research. The use of blog posts as evidence for research appears contentious amongst some SE researchers. Objective: To better understand the actual and perceived value of blog posts as evidence for SE research, and to develop guidance for SE researchers on the use of blog posts as evidence. Method: We further analyse responses from a previously conducted attitudinal survey of 44 software engineering researchers. We conduct a heatmap analysis, simple statistical analysis, and a thematic analysis. Results: We find no clear consensus from respondents on researchers' attitudes to the credibility of blog posts, or on a standard set of criteria to evaluate blog-post credibility. We show that some of the responses to the survey exhibit characteristics similar to the content of blog posts, e.g., asserting prior beliefs as claims, with no citations and little supporting rationale. We illustrate our insights with ~60 qualitative examples from the survey ~40% of the total responses. We complement our quantitative and qualitative analyses with preliminary checklists to guide SE researchers. Conclusion: Blog posts are relevant to research because they are written by software practitioners describing their practice and experience. But evaluating the credibility of blog posts, so as to select the higher-quality content, remains an ongoing challenge. The quantitative and qualitative results, with the proposed checklists, are intended to stimulate reflection and action in the research community on the role of blog posts as evidence in software engineering research. Finally, our findings on researchers' attitudes to blog posts also provide more general insights into researchers' values for SE research.

Software Engineering
Bayesian evidence synthesis as a flexible alternative to meta-analysis: A simulation study and empirical demonstration

Elise van Wonderen,Mariëlle Zondervan-Zwijnenburg,Irene Klugkist

DOI: https://doi.org/10.3758/s13428-024-02350-2

IF: 5.953

2024-03-28

Behavior Research Methods

Abstract:Synthesizing results across multiple studies is a popular way to increase the robustness of scientific findings. The most well-known method for doing this is meta-analysis. However, because meta-analysis requires conceptually comparable effect sizes with the same statistical form, meta-analysis may not be possible when studies are highly diverse in terms of their research design, participant characteristics, or operationalization of key variables. In these situations, Bayesian evidence synthesis may constitute a flexible and feasible alternative, as this method combines studies at the hypothesis level rather than at the level of the effect size. This method therefore poses less constraints on the studies to be combined. In this study, we introduce Bayesian evidence synthesis and show through simulations when this method diverges from what would be expected in a meta-analysis to help researchers correctly interpret the synthesis results. As an empirical demonstration, we also apply Bayesian evidence synthesis to a published meta-analysis on statistical learning in people with and without developmental language disorder. We highlight the strengths and weaknesses of the proposed method and offer suggestions for future research.

psychology, experimental, mathematical
Identifying priority questions regarding rapid systematic reviews' methods: protocol for an eDelphi study

Ariany M Vieira,Geneviève Szczepanik,Chiara de Waure,Andrea C Tricco,Sandy Oliver,Jovana Stojanovic,Paula A B Ribeiro,Danielle Pollock,Elie A Akl,John Lavis,Tanja Kuchenmuller,Peter Bragge,Laurenz Langer,Simon Bacon

DOI: https://doi.org/10.1136/bmjopen-2022-069856

IF: 3.006

2023-07-07

BMJ Open

Abstract:Introduction: Rapid systematic reviews (RRs) have the potential to provide timely information to decision-makers, thus directly impacting healthcare. However, consensus regarding the most efficient approaches to performing RRs and the presence of several unaddressed methodological issues pose challenges. With such a large potential research agenda for RRs, it is unclear what should be prioritised. Objective: To elicit a consensus from RR experts and interested parties on what are the most important methodological questions (from the generation of the question to the writing of the report) for the field to address in order to guide the effective and efficient development of RRs. Methods and analysis: An eDelphi study will be conducted. Researchers with experience in evidence synthesis and other interested parties (eg, knowledge users, patients, community members, policymaker, industry, journal editors and healthcare providers) will be invited to participate. The following steps will be taken: (1) a core group of experts in evidence synthesis will generate the first list of items based on the available literature; (2) using LimeSurvey, participants will be invited to rate and rank the importance of suggested RR methodological questions. Questions with open format responses will allow for modifications to the wording of items or the addition of new items; (3) three survey rounds will be performed asking participants to re-rate items, with items deemed of low importance being removed at each round; (4) a list of items will be generated with items believed to be of high importance by ≥75% of participants being included and (5) this list will be discussed at an online consensus meeting that will generate a summary document containing the final priority list. Data analysis will be performed using raw numbers, means and frequencies. Ethics and dissemination: This study was approved by the Concordia University Human Research Ethics Committee (#30015229). Both traditional, for example, scientific conference presentations and publication in scientific journals, and non-traditional, for example, lay summaries and infographics, knowledge translation products will be created.
Tools for assessing the methodological limitations of a QES—a short note

Heid Nøkleby,Heather Melanie R. Ames,Lars Jørun Langøien,Christine Hillestad Hestevik

DOI: https://doi.org/10.1186/s13643-024-02511-6

2024-04-07

Systematic Reviews

Abstract:The increasing prevalence and application of qualitative evidence syntheses (QES) in decision-making processes underscore the need for robust tools to assess the methodological limitations of a completed QES. This commentary discusses the limitations of three existing tools and presents the authors' efforts to address this gap. Through a simple comparative analysis, the three tools are examined in terms of their coverage of essential topic areas. The examination finds that existing assessment tools lack comprehensive coverage, clarity, and grounding in qualitative research principles. The authors advocate for the development of a new collaboratively developed evidence-based tool rooted in qualitative methodology and best practice methods. The conclusion emphasizes the necessity of a tool that can provide a comprehensive judgement on the methodological limitations of a QES, addressing the needs of end-users, and ultimately enhancing the trustworthiness of QES findings in decision-making processes.

medicine, general & internal
Conflict diagnostics for evidence synthesis in a multiple testing framework

Anne M. Presanis,David Ohlssen,Kai Cui,Magdalena Rosinska,Daniela De Angelis

DOI: https://doi.org/10.48550/arXiv.1702.07304

2017-09-14

Abstract:Evidence synthesis models that combine multiple datasets of varying design, to estimate quantities that cannot be directly observed, require the formulation of complex probabilistic models that can be expressed as graphical models. An assessment of whether the different datasets synthesised contribute information that is consistent with each other, and in a Bayesian context, with the prior distribution, is a crucial component of the model criticism process. However, a systematic assessment of conflict suffers from the multiple testing problem, through testing for conflict at multiple locations in a model. We demonstrate the systematic use of conflict diagnostics, while accounting for the multiple hypothesis tests of no conflict at each location in the graphical model. The method is illustrated by a network meta-analysis to estimate treatment effects in smoking cessation programs and an evidence synthesis to estimate HIV prevalence in Poland.

Methodology
Quantifying convergence and consistency

Nicholas J. Matiasz,Justin Wood,Alcino J. Silva

DOI: https://doi.org/10.1111/ejn.16561

IF: 3.698

2024-10-17

European Journal of Neuroscience

Abstract:The authors discuss the cumulative evidence index (CEI), a Bayesian metric that quantifies both the consistency and convergence of evidence across different types of studies, emphasizing the greater epistemological weight typically attributed to convergence. The CEI addresses the reproducibility crisis by showing how convergent evidence across multiple study types can advance scientific consensus, even when individual studies fail to yield reproducible results. The reproducibility crisis highlights several unresolved issues in science, including the need to develop measures that gauge both the consistency and convergence of data sets. While existing meta‐analytic methods quantify the consistency of evidence, they do not quantify its convergence: the extent to which different types of empirical methods have provided evidence to support a hypothesis. To address this gap in meta‐analysis, we and colleagues developed a summary metric—the cumulative evidence index (CEI)—which uses Bayesian statistics to quantify the degree of both consistency and convergence of evidence regarding causal hypotheses between two phenomena. Here, we outline the CEI's underlying model, which quantifies the extent to which studies of four types—positive intervention, negative intervention, positive non‐intervention and negative non‐intervention—lend credence to any of three types of causal relations: excitatory, inhibitory or no‐connection. Along with p‐values and other measures, the CEI can provide a more holistic perspective on a set of evidence by quantitatively expressing epistemic principles that scientists regularly employ qualitatively. The CEI can thus address the reproducibility crisis by formally demonstrating how convergent evidence across multiple study types can yield progress toward scientific consensus, even when an individual type of study fails to yield reproducible results.

neurosciences
Evaluation of Nine Consensus Indices in Delphi Foresight Research and Their Dependency on Delphi Survey Characteristics: A Simulation Study and Debate on Delphi Design and Interpretation

Stanislav Birko,Edward S Dove,Vural Özdemir

DOI: https://doi.org/10.1371/journal.pone.0135162

IF: 3.7

2015-08-13

PLoS ONE

Abstract:The extent of consensus (or the lack thereof) among experts in emerging fields of innovation can serve as antecedents of scientific, societal, investor and stakeholder synergy or conflict. Naturally, how we measure consensus is of great importance to science and technology strategic foresight. The Delphi methodology is a widely used anonymous survey technique to evaluate consensus among a panel of experts. Surprisingly, there is little guidance on how indices of consensus can be influenced by parameters of the Delphi survey itself. We simulated a classic three-round Delphi survey building on the concept of clustered consensus/dissensus. We evaluated three study characteristics that are pertinent for design of Delphi foresight research: (1) the number of survey questions, (2) the sample size, and (3) the extent to which experts conform to group opinion (the Group Conformity Index) in a Delphi study. Their impacts on the following nine Delphi consensus indices were then examined in 1000 simulations: Clustered Mode, Clustered Pairwise Agreement, Conger's Kappa, De Moivre index, Extremities Version of the Clustered Pairwise Agreement, Fleiss' Kappa, Mode, the Interquartile Range and Pairwise Agreement. The dependency of a consensus index on the Delphi survey characteristics was expressed from 0.000 (no dependency) to 1.000 (full dependency). The number of questions (range: 6 to 40) in a survey did not have a notable impact whereby the dependency values remained below 0.030. The variation in sample size (range: 6 to 50) displayed the top three impacts for the Interquartile Range, the Clustered Mode and the Mode (dependency = 0.396, 0.130, 0.116, respectively). The Group Conformity Index, a construct akin to measuring stubbornness/flexibility of experts' opinions, greatly impacted all nine Delphi consensus indices (dependency = 0.200 to 0.504), except the Extremity CPWA and the Interquartile Range that were impacted only beyond the first decimal point (dependency = 0.087 and 0.083, respectively). Scholars in technology design, foresight research and future(s) studies might consider these new findings in strategic planning of Delphi studies, for example, in rational choice of consensus indices and sample size, or accounting for confounding factors such as experts' variable degrees of conformity (stubbornness/flexibility) in modifying their opinions.

Drug-Coated Balloons for Revascularization of Infrapopliteal Arteries: A Meta-Analysis of Randomized Trials.

Consensus-based guidance for conducting and reporting multi-analyst studies

Toward a more credible assessment of the credibility of science by many-analyst studies

Real Effect or Bias? Good Practices for Evaluating the Robustness of Evidence From Comparative Observational Studies Through Quantitative Sensitivity Analysis for Unmeasured Confounding

Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty

Many-analysts religion project: reflection and conclusion

Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study

How the Post-Data Severity Converts Testing Results into Evidence for or against Pertinent Inferential Claims

A survey of experts to identify methods to detect problematic studies: Stage 1 of the INSPECT-SR Project

What do meta-analysts need in primary studies? Guidelines and the SEMI checklist for facilitating cumulative knowledge

Methods proposed for monitoring the implementation of evidence-based research: a cross-sectional study

An evidence-based method for assessing the value of a search tool: a pilot study

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Evidence Profiles for Validity Threats in Program Comprehension Experiments

Practitioner-generated blog posts as evidence for software engineering research: attitudinal survey and preliminary checklist

Bayesian evidence synthesis as a flexible alternative to meta-analysis: A simulation study and empirical demonstration

Identifying priority questions regarding rapid systematic reviews' methods: protocol for an eDelphi study

Tools for assessing the methodological limitations of a QES—a short note

Conflict diagnostics for evidence synthesis in a multiple testing framework

Quantifying convergence and consistency

Evaluation of Nine Consensus Indices in Delphi Foresight Research and Their Dependency on Delphi Survey Characteristics: A Simulation Study and Debate on Delphi Design and Interpretation