## Meta-analysis: Can We Extract Gold from the Biomedical Literature?

• Articles

Deborah V. Dawson and Derek R. Blanchette

Scientific investigation does not march to its conclusions on a straight and open road. Studies may be inconclusive, or even produce conflicting results. They may have defects in design or execution that leave their conclusions open to question. The body of completed studies addressing a particular question may vary considerably in critical aspects such as design, population, measurement, or clinical/experimental protocols, all of which may cloud the interpretation of results and make meaningful comparisons difficult. This process is further complicated by the sheer magnitude of the ever-expanding scientific literature.

Readers frequently turn to review articles to obtain an overview of the available literature, and to attempt to get a grasp on the level of consensus—and controversy—associated with a particular line of scientific inquiry. Over the last four decades, biomedical research has increasingly turned to the systematic review format to synthesize and provide an overview of the available primary research pertinent to a particular scientific question. Such reviews are associated with careful delineation of a research question, followed by development of a detailed research protocol before commencing the review, including explicit specification of inclusion criteria.

This is followed by a comprehensive search to identify relevant studies that meet those criteria, and the application of rigorous methods for collection and evaluation of the available information, resulting in the research synthesis. Such procedures have been extensively codified by bodies such as the Cochrane Collaboration. The goal is to limit bias, and thus improve the reliability and accuracy of conclusions.

The intention is also that the comprehensive systematic review approach will help researchers avoid some of the pitfalls of the more-traditional “narrative” type of review. While traditional-style reviews have undoubtedly been of value, particularly when constructed by skillful, knowledgeable authors with a balanced outlook, they may suffer from inadequate literature searches and may be inefficient when reviewing a large number of studies. Further, by their very nature, they may involve selection bias—the reviewers’ choices may reflect personal biases, focus on the best (or worst!) studies, or tend to highlight conflicts rather than resolve them.

In contrast, the systematic review aims to provide a comprehensive representation of all of the relevant literature. In some instances, a systematic review is accompanied by a quantitative synthesis of the results of the identified studies; this statistical treatment is termed a *meta-analysis*.

#### What Is Meta-analysis?

The term meta-analysis originated in 1975 with Gene Glass, whose efforts at research synthesis focused on psychotherapy outcomes. The “meta” does not refer to change (as in *metamorphosis*), but is more in the spirit of transcendence (as in *metaphysics*)—a going beyond analysis, producing a study of studies. When researchers seek to summarize the available literature through a systematic review with a quantitative synthesis, this constitutes a meta-analysis. Therefore, the rigorous search and synthesis protocols of the systematic review are considered part and parcel of the meta-analysis.

The late Ingram Olkin, who made significant contributions to meta-analysis, was known to say that the statistical portion of most meta-analyses was quite straightforward and almost minor—that the real effort was to be thorough and meticulous in the initial and critical tasks of identifying the question, developing protocols, and completing the search and abstraction activities.

#### Why Meta-analysis?

What can the reader reasonably expect to gain from digesting a completed meta-analysis? As mentioned, meta-analysis can assist in answering a specific scientific question by summarizing data from multiple studies. It may be, for example, that multiple small studies are each inconclusive, but when combined via a meta-analytic synthesis, can provide a definitive answer. Similarly, individual studies may not have a sample size large enough to compare results in subgroups based on sex or ethnicity, but this may be possible when the information in these studies is combined via meta-analysis.

Meta-analysis can be helpful when studies disagree, by showing where the weight of the evidence lies, and can help resesarchers explore the nature of inconsistencies. In fact, a particularly useful feature of meta-analysis is its ability to examine and explore the heterogeneity in study outcomes—it may provide some useful information about the sources of variability among study findings.

Finally, meta-analysis can help identify gaps and problems in the published literature, and provide information for the planning of future, more-definitive, studies.

#### Examining Heterogeneity in Meta-analysis

Evaluation of variability is at the heart of meta-analysis. The very interpretation of the results of meta-analysis hinges on how heterogeneity among those studies is conceptualized. On the one hand, if all of the studies are measuring the same thing—the difference in functional gain between the particular treatment and placebo—and there is one right answer, one number corresponds to that average difference. Then, studies may vary in their estimate of that one right answer, but that variation is basically random noise. The combined estimate from the meta-analysis is thus an estimate of that one fixed number.

This conceptualization is referred to as the *fixed effects model*. But what if the mean difference itself is truly different in some studies? What if the impact of treatment really does differ for men and women, for older or younger subjects, or for delivery by one method or another?

In such cases, we would have a *distribution* of mean treatment differences in the component studies, and the combined estimate from the meta-analysis represents the mean difference in treatment effect in that population of studies. The variability associated with that estimate would have two components: the random noise component and the component associated with study-to-study variability.

Study-to-study variation might arise from differences in study protocols, populations, approaches to measurement, and so forth. This conceptualization is called the *random effects model*.

The meta-analytic method used will differ depending on the conceptual framework—which of these two situations is believed to reflect reality? It is possible to formally test whether there is significant study-to-study heterogeneity within the framework of meta-analysis.

If there is strong evidence of heterogeneity among eight component studies, it would be reasonable to adopt the random effects approach. However, often there are cogent reasons to suspect heterogeneity among the studies identified from the literature—for example, in surgical studies, different groups of investigators may use varying surgical protocols, or use different definitions of what constitutes a good outcome. In such instances, meta-analysts often elect to adopt a random effects model. It will be a somewhat more-conservative approach, and will tend to yield wider confidence intervals for the effect estimate. It also protects from using an inappropriate model where heterogeneity goes undetected.

Other explorations of heterogeneity are also possible with meta-analysis. It is possible to see which studies make the greatest contributions to heterogeneity. Identification of these studies may provide some interesting information, and potentially clues to the source of the heterogeneity.

For example, suppose that in a meta-analysis of dental treatments, one particular study provides the largest contribution to heterogeneity by far. Upon closer inspection, it is discovered that this study included molars and premolars, while the others were confined to only a single tooth type—the first molar. This would suggest that treatment effects may differ with tooth type, representing a potential explanation for the observed heterogeneity.

It is also possible to formally compare subgroups of studies to see whether the effect estimates differ significantly among subgroups of studies defined by surgical protocol, population type, and so forth. Such explorations can provide useful suggestions for further research.

#### Combining and Displaying the Results of Individual Studies

After formulating the research question, the investigator formalizes the hypothesis to be tested and selects the appropriate statistical measure to reflect that hypothesis. The research then abstracts the estimate of this quantity from each of the various studies identified through the search protocol.

For example, if the intent is to compare the effect of two treatments, as measured by a quantitative outcome, the mean difference between treatment responses is the effect estimate of interest. We would require the difference in sample averages, and its associated standard error, obtained as a measure of variability, for each of the component studies. Other types of investigations will use other summary measures.

Regardless of whether the effect estimate of interest is the difference between treatment responses, a measure of risk (such as an odds ratio), or a measure of association (such as a correlation coefficient), these quantities and their associated standard errors must be identified accurately and recorded from the extant literature. They are then combined via meta-analytic techniques to obtain an overall estimate of the effect, as well as an estimate of the variability of this combined estimate, a pooled standard error.

In the absence of heterogeneity, a fixed effects model may be used; when there is evidence of heterogeneity in effect sizes among studies, or if there is reason to suspect heterogeneity, that calls for using the methods associated with the random effects model. However, it may be noted that, in practice, power to test for heterogeneity may be limited, providing further argument for the use of the random effects model. If the data suggest little heterogeneity, the random effects model would be expected to give results not too different from those of the fixed effects model in the majority of cases. The growing emphasis on the random effects approach in the literature may reflect these considerations.

Whichever model is used, the results can be used to obtain confidence intervals, and to formally test hypotheses. The combined results are often presented along with the results from the individual component studies in a graphical display known as a *forest plot*.

Figure 1 provides such a plot. In this example, based on synthetic data, the quantitative outcome is a measure of gain in function after injury. In each of the eight component studies, there are two independent groups of subjects—those randomized to the treatment group and those randomized to receive a placebo. The effect estimate of interest is the mean difference in outcome (gain in function) between treatment and placebo, which is calculated as (treatment mean—placebo mean) and for each study, its estimate is denoted by a square. A value of zero corresponds to no effect of treatment relative to placebo.

For each of the eight studies in Figure 1, the effect estimate is provided along with a 95% confidence interval, indicated by a horizontal bar. The interpretation of a such a confidence interval is that there is 95% confidence that the true value of the mean difference in treatment effect for that study has been captured within the limits of the interval.

We notice that the mean differences for the different studies vary considerably. In this example, the majority of the studies show results favoring treatment, but one favors placebo, however slightly; another shows that, on average, there is no difference between treatment and placebo. In some instances, the 95% confidence interval for a particular study straddles the zero line, implying there is no evidence of a statistically significant difference between treatment and placebo based upon that particular study. The widths of the confidence intervals also vary considerably, reflecting the precision associated with each study.

Larger studies tend to have considerably greater precision and thus narrower confidence intervals, compared to smaller studies.

Finally, the size of the squares denoting the effect estimate (the mean difference in functional gain between treatment and placebo) varies among the studies; the size of the square is inversely proportional to the standard error—studies with greater precision (smaller standard errors) have larger squares and more weight in the meta-analysis. Studies with less precision (greater standard errors) have smaller squares and are weighted less heavily.

If the formal test of study-to-study variability is performed for the example given in Figure 1, there is strong evidence of heterogeneity among the eight component studies, and the random effects approach is used to estimate the combined effect size and its standard error. At the bottom of the figure, the diamond shape conveys information about the overall estimate of the mean difference in gain in function between treatment and placebo. The diamond is centered at the overall effect estimate and its breadth reflects the corresponding meta-analytic 95% confidence interval, which are both derived from the combined information from all available studies, weighted as just described, based upon the random effects model.

In this instance, we see that patients receiving treatment had greater gain in function than those on placebo—an average of 16.9 points more. This is our estimate, based upon all available data of the average effect size among this population of studies. Its associated standard error includes components representing both random variation and study-to-study variability. The resulting 95% confidence interval for the combined estimate is 11.9 to 22.0. This interval does *not* include the value zero that corresponds to the case where there is no difference between treatment and placebo.

The meta-analysis can also provide a *P*-value associated with the formal statistical test for treatment effect, again based on the combined data from the eight studies. In this case, *P* < 0.0001, and this very small value provides strong evidence that the treatment is associated with greater gain in function than the placebo. The forest plot has provided insight into the variation among the component studies, as well as a representation of the combined results of the meta-analysis, which strongly supports the superiority of treatment relative to placebo.

#### Meta-analysis of Individual Participant-level Data

Another context in which meta-analytic techniques may be applied is when a meta-analysis is conducted at the level of the individual participant data, rather than based upon the aggregate results taken from the component studies. These aggregate results were, of course, derived from the individual participant data, so, in a sense, the meta-analyst is going back to the original source material.

The goal of the meta-analysis remains the same: to answer a particular question by summarizing the available evidence. However, the use of the raw, individual level data in the meta-analysis may permit adjustment for confounding factors, and may increase the power to detect differential treatment effects for subgroups—provided that the necessary information is available in the component studies. It may provide a way to impose consistency in inclusion/exclusion criteria across all studies, and provides the opportunity to assess the validity of model assumptions associated with statistical methods.

Further, meta-analysis at the individual level allows the analyst to use approaches that take account of the correlation among multiple outcomes, such as when subjects are followed longitudinally. (It is crucial in such meta-analyses that the within-study clustering of the data is recognized and retained in the conduct of the analysis; it would be quite inappropriate to simply pool data from all individuals and treat them as if they came from a single study.)

#### Addressing Concerns About Bias

The possibility of publication bias is a serious threat to the validity of a meta-analysis. Studies showing no effect of treatment, or no difference between treatments, are less likely to be published, particularly when such “negative” studies have limited sample sizes. Even when such studies do come to publication, it may be in the form of a shorter, less-detailed communication, which may or may not contain sufficient information to permit inclusion in a meta-analysis.

How can one claim to represent the whole of the research evidence that has been produced if a proportion of that evidence never gets to publication? A number of methods, both analytic and graphical (e.g., funnel plots), have been developed to try to identify publication bias. It may, however, be difficult to apply them when the number of publications in the relevant literature is limited.

Meta-analysts also strive to address the potential problem of publication bias by searching for relevant abstracts, scrupulously including articles in any language, and directly contacting authors to try to fill in gaps in the published literature. They may also look for relevant information in the so-called “gray literature”—literature outside the traditional publishing and distribution channels. Examples might include theses and dissertations, technical reports, unpublished studies, or studies published outside broadly available mainstream journals.

It is similarly important to consider the possibility of other types of bias, including bias in the component studies contributing to the meta-analysis. Suboptimal conduct of research may lead to bias—for example, because of differential attrition among treatment groups, biased selection of subjects, inadequate randomization protocols, or not masking subjects or evaluators with respect to which treatment was assigned. The quality of the component studies is critical to the validity of the meta-analysis.

#### The Movement Toward Quality Assessment

In recent years, there has been an increasing emphasis on quality assessment—in terms of both the quality of the meta-analysis itself, and the quality of the component studies. Both have bearing on the validity and usefulness of the meta-analysis. A number of checklists intended to address quality of the meta-analysis have been developed for use by authors, editors, reviewers, decision-makers—and readers.

These checklists are worth consulting by the reader of the scientific literature, since they provide a useful guide to the characteristics possessed by a sound and well-conducted meta-analysis.

Typically, a checklist will include items to help assess the quality and rigor associated with the different phases of the meta-analysis—the search protocol, selection of inclusion criteria, abstraction procedures, analytic approach, and efforts to avoid publication bias. One such checklist is Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA), which has 27 items targeting aspects ranging from the formulation of the research question through the methodology and synthesis of results.

Other checklists are specifically tailored to assist in evaluating either meta-analyses of clinical trials, such as QUOROM (Quality of Reporting of Meta-analyses), or those intended for use with observational studies, such as Meta-analysis Of Observational Studies in Epidemiology (MOOSE).

Equally important is the quality of the component studies. For this reason, increasing scrutiny has been brought to bear on the conduct of these contributing studies. In the case of a clinical trial, appropriate questions to ask include: Was the study randomized? Was the person assessing the clinical outcome blinded about the treatment applied? Were there any conflicts of interest?

One instrument that tries to look at the potential impact of such features in a systemic manner is the Cochrane Collaboration tool for assessing risk of bias in randomized trials. These considerations are paramount, since one can hardly expect a meta-analysis, however carefully and conscientiously conducted, to overcome the deficiencies and biases of the studies that contribute to it.

#### Consensus?

Not everyone is entirely enamored of meta-analysis. Titles in the literature addressing meta-analysis include phrases such as “statistical alchemy” and even “meta-analysis/shmeta-analysis.” Clearly, as discussed, the meta-analyst cannot expect to be a magician, transcending all shortcomings of the studies in the extant literature.

Nevertheless, provided that the potential pitfalls of meta-analysis—such as poorly conducted component studies and the risk of publication bias—are recognized, meta-analysis can be trusted and useful. It can provide a quantitative synthesis of the existing literature with a measure of objectivity. In many cases, being armed with the additional power of its combined sample sizes will not only provide consensus and resolve conflicts in the published literature, but may even make it possible to answer questions that cannot be addressed by any of the single studies, such as inferences related to subgroups, or dose-response effects.

For all of these reasons, meta-analysis has become a fundamental tool in efforts to attain evidence-based clinical practice. Finally, even in relative failure, where consensus is not reached or the painstaking methodology of meta-analysis reveals a less-than-pretty state of the current science, meta-analysis has its contributions to make: It makes us cognizant of the failings of the available evidence, whether through deficiencies in sample size, design or execution. Where evidence is scant and there are gaps in the literature, it provides suggestions for next steps.

#### Further Reading

Borenstein, M., Hedges, L., Higgins, J., and Rothstein, H. 2009. *Introduction to Meta-Analysis*. Chichester: Wiley & Sons, Ltd.

Brusselaers, N. 2015. How to teach the fundamentals of meta-analyses, *Annals of Epidemiology, 25*:948–54.

Dawson, D.V., Pihlstrom, B.L., and Blanchette, D.R. 2016. Understanding and evaluating meta-analysis, *Journal of the American Dental Association, 147 :264–70.*

Sutton, A.J., Abrams, K.R. Jones, D.R. Sheldon, T.A., and Song, F. 2000. *Methods for Meta-Analysis in Medical Research*. Chichester: Wiley & Sons Ltd.

#### About the Authors

Deborah V. Dawsonis a professor of biostatistics in the Iowa Institute for Oral Health Research, and the first recipient of the Morris Bernstein Professorship in Dentistry. She served for 16 years as director of biostatistics and research design at the University of Iowa College of Dentistry. Dawson received her ScM in biostatistics from Johns Hopkins University and her PhD in biostatistics from the University of North Carolina at Chapel Hill. Her research interests focus on dental caries and enamel defects, gene-environment interaction, meta-analysis, and the history of statistics—especially the origins of the discipline of biometry.

Derek R. Blanchetteis a statistician in the Division of Biostatistics and Computational Biology at the University of Iowa College of Dentistry. He received his MS in biostatistics from the University of Iowa College of Public Health and is accredited by the American Statistical Association. His interests include statistical computing, data visualization, and meta-analysis.

* An earlier version of this article gave credit to Seymour Glass for the term meta-analysis. This has since been corrected.*

**Tagged as:**meta-analysis, scientific investigation, systematic review