In observational non-randomized settings such as non-interventional trials and registries, patients are channeled into therapies as a consequence of their baseline profile and through bedside discussions between the physician and the patient/patient caregivers. As a consequence there are selection biases associated with the chosen therapies or treatments. Despite this there is a strong interest in such data as they represent data in conventional community clinics as opposed to randomized trials, which, through a large number of inclusion criteria and through the use of academic clinical settings may not readily support the extension of the inferences drawn, to the typical patient. Further, in open label randomized contexts both patients and physicians know the intervention and are acutely aware that one therapy is being tested against another. Subjective biases about relative merit will likely influence results. In non-interventional studies, registries, chart reviews and EMR, both retrospective and prospective, there is usually no awareness of what groups might eventually be compared. With adequate adjustments for differences on baseline profiles between groups these data sources can provide inferences which are just as reliable if not better.

This note will look at some issues encountered in reporting difference between treatment groups in observational contexts. Two features of interest in such analyses of differences between groups are adjustments for covariates and concerns with reporting unplanned multiple p-values. The covariate adjustment method attempts to remove the effect of selection biases when evaluating group differences.

## First the covariate adjustment method.

A covariate is a factor, such as age or gender, which can differ across treatment or other groups we hope to compare. Analysis of covariance as it is usually called, has been around almost as long as statistics. What is usually an issue with this method is the wrong choice of covariates, the omission of a covariate and the absence of a clear separation of the initial analyses to identify covariates from the eventual analysis to study the differences between groups.

To see how covariate adjustment works, consider for instance a comparison between two doublet (combination of two therapies) inductions in a non-randomized setting associated with selection biases. One regimen contains a harsh chemotherapy usually not recommended for the elderly and other contains a milder immunotherapy appropriate for all ages.

Let’s say we are looking at three year %survival. An adjusted analysis using age would first estimate the decrease in %survival with increase in age. We might estimate from our data an 8% drop in three year %survival for every 10-year increase in age.

Then if the chemotherapy group is on average 10 years younger than the group on the immune regimen, we would reduce the unadjusted %survival in the chemotherapy group by 8% to make it comparable to the immune therapy. For instance, if the unadjusted three year %survival in the chemotherapy group was 60% versus 50% in the immunotherapy arm then the adjusted analysis will support a difference of 52% versus 50% which may not be statistically significant. But we would have been very wrong if we had not used age as a covariate and would have concluded in favor of the chemotherapy.

Covariates chosen to remove selection biases should (A) be different across groups being compared AND (B) have plausible effect (based on prior data or on other reasonable scientific basis) on outcomes being studied. This choice of covariates should be done without using any outcome data. In the example above we would check first whether the chemotherapy and immunotherapy treated groups have significantly different age distributions. Only after the covariate set has been determined, should they be used in a separate model evaluating the effects of treatment groups on the survival outcome. Such a fire-wall between the analysis to choose covariates and the eventual inferences about the bias adjusted group differences avoids a criticism of possible gaming of the inferences about differences by the analyst through the choice of covariates.

Another method for adjusting for selection biases is the use of propensity scores which collapse all the variables in (A) into one score which is then used in the analysis. As with using covariates a fire-wall is advocated between obtaining the propensity score and its eventual use in comparing the propensity balanced groups. Some analysts may prefer a propensity based approach to that using separate covariates. In the pharmaceutical context, often one may do propensity score based analysis only as a sensitivity analysis as there is interest in the effects of these covariates (especially by payors) even though they constitute the black box which produces the primary adjusted inferences. Both approaches may still be subject to a degree of unknown bias due to unmeasured covariates and care must be exercised at data collection to capture all pertinent data.

To be thorough in removing selection biases in our groups one may test as many as 40 variables to see if they are different across groups. Perhaps it is an overkill and a clinical team could help us with (B) above and limit the statistical testing. This is a good segue to the topic on conducting repeated statistical comparisons and reporting p-values.

## Reporting unplanned multiple p-values

The identification of baseline differences associated with significant p-values, mentioned above, are conducted to obtain our core analysis results after removing the effects of these differences. Similar p-value calculations arise in other contexts or simply out of curiosity. These analyses are typically just the little nuts and bolts in the larger machinery supporting our core planned and hypothesized analysis and should somehow be de-emphasized. Given the large number of tests some of the p-values are likely to reflect random variation in large data sets and no real effect. If we conduct 20 statistical tests, then the false positive rate is 36% and not the nominal 5%. These results are likely non-replicable. You took 20 shots at a dart board and made one or two bulls eyes and bragged about it. A targeted pre-planned and hypothesized result is very credible and we are likely to see replications of these results in the future. These are like taking one or two shots at the dart board and consistently making the bulls eye.

Another context reporting p-values uses an identical outcome statistical model. This is predictive multivariate modelling where both treatment group and the covariates have equal status in the model and one would call all of these ‘predictors’, whereas in the covariate adjusted analyses, the covariates were accessories subordinate to treatment group in the analysis. Multivariate modelling looks at predictive ability in the presence of other predictors in the model. P-values are often used in this context to gauge the utility of a variable in predicting outcome and in initial univariate (relating outcome to one predictor at a time) screening analyses. As in the covariate selection stage of a covariate adjusted analysis, one must be somewhat vary of reported p-values. As noted earlier the p-value for the targeted pre-planned hypothesis about differences in treatment groups in the covariate adjusted analysis is credible and the inference based on it is not likely to be spurious.

The excerpt below reports on the ban of p-values in a psychology journal. A lot of disdain for p-values likely comes from erratic signals, which can neither be interpreted or replicated, arising out of the over use of statistical testing. This is the ‘bathwater”. A hit on a well-defined pre-specified inference, presented with estimates and confidence intervals indicative of meaningful effects, is the “baby” we do not want to lose.