Ten Pitfalls to Avoid When Evaluating the Primary Literature


Written By

Reviewed By

Critically evaluating the primary literature and applying the information to patient care is vital to ensuring optimal patient outcomes.  Unfortunately, the foundational knowledge and skills that most of us acquire during our formal education and post-graduate training programs are unlikely to fully prepare us for the challenges and intricacies of interpreting the evolving methods used in clinical drug studies today.  Like the development of any skill, it requires practice and refinement over time.  In this TOP TEN list, we reflect on some important concepts that can get overlooked or misinterpreted. These are not in order of importance.  Indeed, our list is subjective and open to debate.  We would love to hear from others willing to share their experience and wisdom.


Pitfall 1: Confusing “non-significant” with “inconclusive” results

Interpreting the results of a superiority trial requires a basic understanding of statistical inference.  In this type of study design, one assumes that the null hypothesis (i.e. there is no difference between groups) is correct.  It is only when the results of the trial indicate that an observed difference between the groups has a very low probability of being due to chance alone — by convention, p < 0.05 — that the null hypothesis is rejected.  In this situation, we conclude that there is a statistically significant difference between the groups studied, and discussion can then focus on whether the difference is clinically meaningful.  Most trainees and practitioners grasp this concept without much difficulty.


On the other hand, how should we interpret results when the p value is > 0.05?  Many practitioners assume this means there is no difference between groups and may conclude that the two interventions are equivalent. This is not an accurate interpretation.  A p > 0.05 merely indicates we should not, because we do not have sufficient confidence, reject the null hypothesis.  This could be due to either no true difference between treatments or a lack of adequate power to detect a meaningful difference.  A study adequately powered to examine whether a meaningful difference exists for a specific endpoint and which results in a p >0.05 is correctly interpreted as “non-significant.”  On the other hand, an underpowered study might not detect a meaningful difference between groups.  In this case, a p > 0.05 should be considered “inconclusive.”  The lack of power may have led to a type II error.


To illustrate this concept, one only needs to look at recent controversies over the appropriate targets for systolic blood pressure in patients with diabetes.  In the Action to Control Cardiovascular Risk in Diabetes (ACCORD) Blood Pressure trial,1 patients with type 2 diabetes and increased cardiovascular risk were randomized to an intensive therapy (goal SBP<120mmHg) or to standard therapy (goal SBP<140mmHg).  The sample size of 4200 participants was designed to give the study a power of 94% to detect a difference in the primary endpoint of first occurrence of a major cardiovascular event.  Unfortunately, this was based on the assumption that the incidence of the primary endpoint would occur at a rate of 4% per year in the control group, which turned out to be grossly overestimated compared to the actual event rate of only 2.09%.  Despite a 12% reduction in the risk of adverse cardiovascular events with intensive treatment, this did not result in a statistically significant difference between the two interventions (p=0.20).  However, the 95% confidence interval suggests that there may be as much as a 27% reduction in cardiovascular events with intensive management.  Moreover, the pre-specified secondary endpoint of stroke was significantly reduced by 41%.  These findings indicate that the results of the ACCORD-BP trial are inconclusive due to lack of sufficient power, and that there may be a clinically meaningful benefit from intensive blood pressure management in patients with type 2 diabetes.  This interpretation is supported by the results of the more recently published Systolic Blood Pressure Intervention Trial (SPRINT)2, which demonstrated significant benefits with intensive blood pressure management in non-diabetic patients at high cardiovascular risk. 


Pitfall 2: Overlooking the effect of alpha spending functions on Type 1 error rates

In superiority trials, type I errors occur when investigators incorrectly conclude that there is a difference between an intervention and a control when, in fact, there is no difference.  Because most clinical drug trials, by convention, use a p value (aka alpha level) of 0.05 to determine statistical significance, the risk of committing a type I error is 5%.  What often gets overlooked is that the overall type I error rate for a study is dependent on the number of endpoints being examined and the number of times that the investigators analyze the data for each endpoint.  A study with 20 different endpoints that uses an alpha level of 0.05 for each would be expected to have at least one statistically significant result simply by chance alone.  Similarly, analyzing the data for one endpoint at different times over the course of a trial – for example, during interim analyses — also increases the chance of committing a type I error. This concept is known as the alpha spending function. In simple terms, this means that investigators have a total error rate of 5% (or 0.05) to “spend” any way they want over the course of a trial.  However, every time they analyze the data for an endpoint, their alpha level must be decreased by a specified amount to account for the number of times the data is analyzed.  Several statistical methods are available to adjust for this problem which help keep the overall type I error rate at the 5%.  Studies that omit these adjustments run an increased risk of erroneously concluding that a finding is statistically significant when in fact there is no true difference.


The Myocardial Ischemia Reduction with Aggressive Cholesterol Lowering (MIRACL) study3 is one example where the issue of alpha spending functions and potential type I errors created considerable controversy.  The MIRACL study sought to determine whether or not early initiation of a statin would reduce the risk of cardiovascular events in patients after an acute coronary syndrome (ACS).  The design of the trial included three planned interim analyses, and the alpha level was set to 0.049 in order to keep the overall type I error rate at 5%.  The final results indicated that atorvastatin 80mg/day was associated with a 16% reduction in the risk of the primary endpoint compared to placebo (p=0.048, 95% CI 0.70-1.00).  This finding was arguably borderline from a statistical standpoint, since the confidence interval did not exclude the possibility of no difference between the groups.  However, the difficulty in interpreting the data was compounded by conflicting reports over whether or not an additional unplanned interim analysis had been performed.  If true, this would have required an even lower alpha level to accommodate the alpha spending function.  In such a case, the result would  have failed to reach statistical significance.  Although there is little doubt that high-intensity statins are beneficial in patients who have experienced an ACS, the need to initiate the agent within the first 24-96 hours based on the results of the MIRACL trial is still up for debate. 


Pitfall 3: Assuming that active controls are valid comparators

When deciding on the potential superiority or non-inferiority of a new agent compared to the currently available treatment, it is typically assumed that the comparator (active control) is the standard of care at the time the study was conducted.  The “standard of care” is not only the intervention (e.g. drug) that is widely accepted as the “best” treatment, but it must be given at the recommended intensity (e.g. dose) and best practices for patient management must also be followed.  When this basic premise is violated, interpretation of the relative efficacy of the new agent may be difficult to determine.  Nowhere is this more evident than in the phase 3 clinical trials that investigated the efficacy of the direct oral anticoagulants (DOACs) for atrial fibrillation.  Each of these agents was compared to an active control consisting of warfarin titrated to an INR of 2.0-3.0.  Overall, the DOACs demonstrated either non-inferiority or superiority to warfarin with regard to the risk of stroke or systemic embolism. However, subsequent analyses of time in therapeutic range (TTR) for patients receiving warfarin in these trials raised concerns regarding the validity of the comparator group. In the RELY4 trial, high dose dabigatran demonstrated superiority over warfarin for the primary endpoint.  Although the overall TTR in the control group was 64%, there was significant variability between study sites.  When quartile TTR data were analyzed, it was found that the benefits of dabigatran became non-significant as TTR improved, leading to questions about the advantage of dabigatran compared to well-controlled warfarin.5  Similarly, in the ROCKET-AF6 trial, rivaroxaban was found to be non-inferior to warfarin despite an overall TTR of only 55% in the control group.  This suggests that rivaroxaban is no better than poorly-controlled warfarin.  These examples serve to remind us that the active control group must be considered when assessing the relative efficacy of the intervention.


Pitfall 4:  Relying on surrogate markers as substitutes for POEMs

In an ideal world, the determination of safety and efficacy in all clinical drug trials would be based on the use of endpoints that are clinically important to patients.  These types of measures, which have been described as patient-oriented endpoints that matter (POEMs), include reduced mortality, improved functioning, and better quality of life.  However, studies that include POEMs as primary endpoints require significant investments in time and resources.  In other words, these trials are often very expensive to conduct.  As an alternative, investigators often use surrogate outcomes — biomarkers or other physical findings that correlate with the development of the condition of interest.  Surogate markers include A1c levels for diabetes complications, cholesterol levels for cardiovascular disease, and spirometric findings for pulmonary disease exacerbations.  Surrogate outcomes may be useful as substitutes for POEMs when conducting early drug studies, but they can be misleading.  One classic example is the Cardiac Arrhythmia Suppression Trial (CAST)7.  Prior to the publication of the CAST results in 1991, antiarrhythmic agents such as encainide and flecainide were widely used because they could suppress premature ventricular contractions (PVCs) in patients.  PVCs commonly occur following a myocardial infarction.  It was observed that frequent PVCs were associated with an increased risk of sudden cardiac death.  These class Ic antiarrhythmic agents suppressed PVCs by more than 80% so it was assumed they saved lives.  It was not until the CAST study was terminated early due to a nearly three-fold excess in mortality in those patients who received a class IC antiarrhythmic that this faulty assumption was recognized.  When evaluating studies that use surrogate markers, practitioners should consider whether or not the surrogate has been well validated.  Even then, a surrogate may not correlate with the desired POEM.


Pitfall 5: Applying the results of observational and non-randomized studies as high-level evidence

Although randomized controlled trials (RCTs) are generally considered the “gold standard” for determining the efficacy and safety of drugs, they do have some limitations that may lead investigators to use alternative study designs.  For example, cohort and case-control trials enable the examination of outcomes in large numbers of patients over long periods of time.  These types of study designs can help identify drug safety issues that may occur too rarely in RCTs to make any meaningful conclusions.  Despite their widespread use in these situations, evidence suggests that the results from these types of study designs are often contradicted by well-designed RCTs.  One classic example of this phenomenon involves the use of hormone replacement therapy in post-menopausal women.  The Nurses’ Health Study8 was a prospective cohort trial that investigated the use of hormone therapy in over 120,000 patients over 10 years.  The results indicated that current users of post-menopausal hormone therapy had half the risk of major coronary disease as patients who had never used estrogen.  Coupled with the biological plausibility of reduced cardiovascular outcomes from hormonal therapy and the beneficial effects observed on surrogate outcomes in these patients, hormone replacement therapy was widely used to reduce cardiovascular events in post-menopausal women in the 1990’s.  It wasn’t until the results of the Women’s Health Initiative9, a well-designed RCT comparing hormonal therapy to placebo, that this practice was widely discredited.  Many other examples of contradicted results have been published and practitioners should be aware of this phenomenon when interpreting the results of studies with observational or non-randomized designs.10


Pitfall 6: Interpreting the results of sub-group and post-hoc analyses as anything other than “hypothesis generating”

In a well-designed RCT, the study is powered to detect a meaningful difference in the occurance of a primary endpoint (or a composite endpoint), comparing two or more treatment groups.  Secondary endpoints, including subgroup- and post-hoc analyses, may be useful for identifying interesting trends in the data, but caution should be exercised when interpreting these secondary findings due to a lack of power or violation of basic assumptions underlying the principles of statistical inference.  An example of this can be found in the Prospective Randomized Amlodipine Survival Evaluation (PRAISE)11 trial, which compared amlodipine to placebo in patients with chronic heart failure.


The results of the PRAISE trial indicated no differences between amlodipine and placebo for a composite primary outcome in patients with severe heart failure. However, investigators noted a significant reduction in the primary composite endpoint and death for patients who were deemed to have non-ischemic heart failure. This finding prompted the PRAISE-2 study which attempted to determine if patients with heart failure but without cardiac ischemia would benefit from amlodipine.12 PRAISE-2 study found no differences in outcomes between amlodipine and placebo. This example reminds us that significant findings found in subgroups of patients may not be validated in subsequent trials.


Conversely, when significant differences are not found in subgroup analyses, that does not inherently mean that no differences exist. In the landmark 4S trial, in a predefined subgroup analysis, the risk of death in women who took simvastatin was not significantly different when compared to the risk of death in women who took placebo  (RR 1.12; 95% CI 0.65 – 1.93). However, less than 20% of the patients in this study were female and only 57 women experienced this primary outcome.  Thus, the trial was not adequately powered to detect a difference in the primary outcome of death in the subpopulation (i.e. women).13 Subsequent statin trials14 have continued to underrepresent females.  This has led to some concerns about the benefits of statin use in women.  However, the weight of evidence suggests (but doesn’t conclusively prove) that statins are effective for cardiovascular event reduction in both sexes.


Pitfall 7: Ignoring the possible “regession to the mean” when a study is stopped early

Contemporary studies involving large numbers of patients often incorporate interim analyses at specified points over the course of the trial.  The purpose of these interim analyses is usually to provide information to the Data Monitoring and Safety Board (DMSB) that can be used for determining the need for early termination of the trial due to either overwhelming evidence of efficacy, safety issues, or futility.  Although the decision to stop a study early is usually based on pre-defined stopping rules, there is always the possibility that continuing the trial could change the results.  One well-described reason why the results might change over time is the phenomenon known as “regression to the mean”.  This refers to the observation that early differences in outcomes between groups in a clinical trial often disappear over time as the trial continues and more data points are collected.


The recently publised LIGHT trial15 illustrates this point.  This was a non-inferiority trial designed to compare the combination of naltrexone-bupropion to placebo on the risk of major adverse cardiovascular events (MACE).  At the first interim analysis, when approximately 25% of the planned events were expected to have occurred, the results indicated that naltrexone-bupropion was associated with a statistically significant 41% reduction in MACE.  Despite what appears to be an overwhelming benefit, the result did not exceed the boundary required for early termination of the trial.  Unfortunately, the manufacturer prematurely released the results to the public by filing a patent application for the product.  This potentially introduced bias into the conduct of the trial and so the investigators decided to terminate the study and report the results for all patients who had been enrolled up to that point.  In the final analysis, which included approximately 64% of the planned events, the risk of MACE with naltrexone-bupropion was reduced by a mere 5% when compared to the placebo — a result that was not statistically significant.  Had this trial been stopped after the initial interim analysis, prior to the “regression to the mean”, many practitioners might have routinely recommended this combination therapy to their obese patients based on an erroneous belief that it would reduce cardiovascular event rates.  In addition to the potential for overestimating treatment benefits, other issues that may arise from early termination of a trial include underestimation of adverse event rates, particularly for effects that take longer periods of time to emerge, as well as permanent forfeiture of important data on subgroups.  Practitioners should carefully evaluate whether clear stopping rules were followed when interpreting the results of a trial that was terminated early.


Pitfall 8: Forgetting that clinically important results are in the eye of the beholder

In 2015, the results of the IMPROVE-IT trial16, a 6 year-long study with over 18,000 patients, was published, revealing a statistically significant reduction in a composite cardiovascular outcome including cardiovascular mortality, major cardiovascular events, and nonfatal stroke using simvastatin and ezetimibe compared to simvastatin alone. With an NNT of 50, indicating a 2% ARR, the face value results suggest this combination improves patient outcomes. However, a closer evaluation of the results shows some glaring limitations.


To begin, the number of enrolled subjects was massive and the treatment duration was considerably longer than most studies.  Thus, this study had considerable power to detect relatively small differences between the treatments.  In our opinion, a treatment that reduces cardiovascular event rates by a mere 2% (ARR) over 6 years is unimpressive.   Furthermore, for the primary outcome, the HR = 0.94 and the 95% confidence interval is 0.89 to 0.99. Inspection of the individual components of the primary outcome reveal no differences in death from cardiovascular causes or stroke, and the difference in combined major coronary events is not even reported.


While the study indicates that ezetimibe and simvastatin may be a benefit over simvastatin alone, critical analysis of the results leads us to conclude that the difference is not clinically meaningful.  But, hey, everyone is entitled to an opinion!  Bottom line – whether the results of a “positive” clinical trial are clinically meaningful to you (and your patients) will depend on your perspective, values, and available alternatives.


Pitfall 9: Faling to examine the disposition of patients in a clinical trial

During the course of any large trial, investigators will inevitably be faced with situations that affect how the data is handled for individual patients.  There may be subjects who violate the study protocol.  There may be crossovers between the different treatment groups.  Some patients will be lost to follow-up.  The manner in which the investigators choose to handle these issues in the analysis may have an impact on the study’s conclusions.  The most commonly-reported methods for handling patient drop outs involves the choice between an intention-to-treat (ITT) or per-protocol analysis (PP).  An ITT analysis, in which the results are determined based on the patient’s original assignment group, is usually preferred for superiority studies because it provides a more conservative and real-world approximation of the efficacy of a drug when used in actual clinical practice.  The PP analysis, on the other hand, includes data only from those patients who took the medication in the protocol specified manner and for the protocol specified duration.  A PP analysis provides a better estimate of the true efficacy of an intervention under ideal conditions, but it tends to overestimate benefits. 


To illustrate how an ITT and PP analysis might arrive at very difficent conclusions, let’s look at a recent study that evaluated the effects of early introduction of allergenic foods in breastfed infants.  The Enquiring About Tolerance (EAT) study17 investigated the effects on development of food allergies in 3-month-old, breast-fed infants who were randomized to receive early introduction of allergenic foods or to standard care including continued breastfeeding to the age of six months.   At the conclusion of the trial, the ITT analysis demonstrated no difference in the primary outcome between the groups (5.6% in the early-introduction group vs. 7.1% in the standard care group, p=0.23).  In the PP analysis, which included only those participants who were adherent to the regimen, a statistically significant reduction in the risk of food allergies (2.4% in the early-introduction group vs. 7.3% in the standard care group, p=0.01) was demonstrated.  One might be tempted to conclude from these results that the early introduction of allergenic foods is beneficial in reducing the risk of food allergies.  However, as the authors of the study correctly pointed out, there are several potential flaws with that assumption.  Chief among these is the possibility of reverse causality, in which patients who developed early food allergies would have been less adherent to the assigned regimen. This could have led to more favorable-appearing results for the early introduction group in the PP analysis because of post-hoc selection bias.


Pitfall 10: Disregarding the possibility of publication bias, data manipulation, and statistical errors

When evaluating studies that are published in peer-reviewed journals, the tendency is to accept without question that the presented information is an accurate, complete, and honest summary of the data.  Unfortunately, multiple investigations cast doubt on this assumption.   A growing body of evidence suggests that errors, both intentional and unintentional, are common and widespread in the biomedical literature.  For example, a recent analysis18 compared the results for studies published in high-impact journals with the results from those same studies posted to ClinicalTrials.gov.  There was a discordance between the results for the primary endpoint in 16% of the reviewed trials, and the differences would have led to changes in the trial interpretation in 7% of the trials.  Similarly, a review of the psychology literature19 found that 18% of the statistical results were reported inaccurately, and that 15% of the 281 reviewed articles contained at least one erroneous conclusion based on an incorrectly calculated statistical test.  Obviously, this makes it difficult for practitioners to draw firm conclusions about the efficacy of an intervention.


Publication bias is another problem that has the potential to alter a practitioner’s view of the effectiveness of a drug or even an entire class of medications through selective reporting of positive results across multiple studies.  One landmark study 20 examined the findings of published trials for the 12 antidepressant agents approved in the US from 1987 to 2004.  They collected data on all submissions to the FDA through the agency’s clinical trials registry, and then compared that information to the results reported in the published literature on those trials.  There was strong evidence for bias towards the publication of positive results. Out of 74 registered trials, only 38 (51%) had been considered positive trials by the FDA, while the remaining 36 (49%) trials were considered either negative or inconclusive.  The investigators found that 97% of the positive trials were published, while 92% of the negative or inconclusive trials were either not published or were published in a manner that suggested a positive outcome.  Overall, studies deemed positive by the FDA in this analysis were 12 times more likely to have been published than studies that were either negative or inconclusive.  Thus publication bias may lead to exaggerated beliefs about the effectiveness of individual drugs and entire drug classes. Calls for greater transparency in all phases of the clinical trial process have been widespread and loud.


Lucas G Hill's picture

Thank you for this incredibly detailed and practical review. As I listened to the podcast, I found the discussion of power to be particularly thought-provoking. Students are generally taught that power should be calculated a priori, and that the metric for meeting power is sample size. That is clearly false, but I don't know what else to do as a clinician and educator. Should I search for the equation and calculate power post-hoc? If so, why isn't post-hoc power calculation a standard reporting expectation in RCTs? It would seem to be the most accurate representation of the study's actual ability to rule out a significant difference.

Dr. Hill–Those are all great questions! The issue of post-hoc power is a topic of controversy that has been hotly debated in the literature. Although it seems reasonable and is often done, this practice has been discredited by many (see Pharmacotherapy 2001;21:405-9). In fact, it has been shown that post-hoc power can be determined entirely by the p value from the reported results, and it is completely independent of the study’s methods. This can be difficult to address, but we recommend the following steps: (1) Review the underlying assumptions of the power (sample size and treatment effects); (2) if no difference was found but the assumptions were not met, the results are inconclusive ; (3) review the 95% CI to see if the margins may still include a clinically meaningful benefit despite the lack of stat. significance; (4) apply the potential benefit to the absolute risk of your unique patient population; (5) individualize the results. I'd love to hear how others approach this issue!
Dan Gillis's picture

Great response. I find that one of the biggest gaps in student (and sometimes peer) comprehension of drug lit review principles is the difference between the "power" level and outcome assumptions set for a priori sample size determination and the actual power, or lack thereof, of the study at the end of the day based on its actual results. I have read explanations from statisticians (that were admittedly somewhat beyond my comprehension) on why post-hoc calculations of power are not statistically appropriate. I teach that one should compare the actual outcomes to the estimates used for sample size determination to get a qualitative idea of whether the study was potentially underpowered. I also tell students that if you list "met power" as a strength of the study merely because they accrued the target number of patients then that just tells me that you don't understand the concept of statistical power.

A lack of a difference between two treatment arms is not necessarily due to a lack of power -- it could be because of an actual lack of difference. In the example of the ACCORD trial used to explain pitfall #1, the primary outcome was not statistically different between the two groups. There are several explanations: 1) there is no difference; 2) the study was underpowered b/c of the relatively few events; or 3) there is too much variance in the data (due to the low number of events). One can't simply look at the confidence intervals and determine the study is underpowered. While there could have been a 27% decrease in the primary outcome, there also could have been a 6% increase. That's a very large span of possible truth and the answer, as the writers point out is that the results are inconclusive.
Lauren M Caldas's picture

Thank you for this great commentary. I enjoyed the actual examples of studies for these common "pitfalls". I plan to have my APPE students review your commentary before we have journal clubs.