Inverted Simulations

Ecological Fallacies: Using Aggregate Inferences for Individual Choice

This page has an illustrative calculator examining highly significant statistical results deriving from planned studies. Such studies have sample sizes computed to detect clinically meaningful non-trivial differences between groups with adequate power. The default graphic and calculator provide the separation between the distributions in simulations, both in the aggregate and in the individual, when you see 5 zeroes in the p-value – a one in a million chance that the null hypothesis of no difference between groups is supported  as opposed to the alternate. Further, we note that the aggregate differences between groups corresponding to these p-values would be deemed clinically significant. The histograms in our graphic, however, indicate considerable overlap in individual measures. We computed a flipped proportion of individuals in the group which is inferior in the aggregate, yet having better outcome than individuals in the better group in the aggregate, at 30% – not inconsiderable and likely much larger than most people would infer from the p-value or the clinically meaningful aggregate difference. A discussion of this ecological fallacy in aggregate statistical inferences is at this page, where we note that even with a complex baseline profile including therapy one often obtains a concordant proportion of about 70%,  implying discordance of about 30% (details are at this hyperlinked article –  Srinivasan, S. 2018. “Inverted Simulations Demonstrating Strong Ecological Fallacies in Cohort Studies.” Journal of Mathematics and System Science 8 (5): 119-39). Our discordant or flipped proportion is inspired by the Concordance Index (C-Index) attributed to Professor Harrell at Vanderbilt. A calculator providing post-hoc assessments of reported study results including an assessment of this flipped proportions is at this page.

Despite ecological fallacies results in the aggregate continue to be relevant. This can be a first step towards personalized approaches followed by the use of additional individual information provided freely by the individual, to gauge if one might be in the 30% or so who buck the aggregate trend. Such information, with additional supporting data, may help achieve a personalized solution instead of a push for all towards the option supported in the aggregate. Certain fields of endeavor may be inherently ecological in nature such as the study of infectious diseases, climate change and mass species extinction. Even here ecological perspectives have driven unnecessary and inhumane reactions such as those to the AIDs epidemic, and earlier, the separation of those afflicted with Leprosy to Leper colonies. Actions here, even when justified, often set precedents for action in other contexts where they are not relevant. Even in these ecological settings it is helpful to bring in individualized measures– a recent report by a scientist on NPR about the 70% decline in the insect population in Europe, used as a measure, the frequency of having bugs squashed on one’s car windscreen. Given ecological fallacies, the use of inferences based on aggregate stochastic data, should be nuanced generating few rules which are applicable universally – like legal systems seeking to protect individual rights in free societies (there may still be a few!) while providing some support for some norms which have wide consensual backing.

The inverted simulation and calculator

The interactive calculator has three tabs simulating data in two-arm clinical trials. The first tab is for continuous data such as the reduction in BP. The second has survival data looking at the time to an event such as a death or disease progression. The third tab has binomial data such as the achievement of a response threshold on therapy. For each of these calculators you can change the sample size per group (standard therapy or new therapy) and the number of zeroes in the p-value associated with the difference between the two therapies. The default sample sizes in the three calculators of 85, 176 and 230 allow no more than the usual 5% two-sided type I error and a 10% type II errors (errors defined here). For the continuous calculator we can detect a meaningful difference of 3.5 (half a standard deviation is usually considered meaningful – Norman GR et. al. Medical Care. 2003; 41, 5: 582-592.) between a reduction of 5 for standard therapy (something like a diuretic) and a reduction of 8.5 for a new test therapy (something like a diuretic/beta-blocker combination) for a SD of 7 for BP reductions over time. For the other two default sample sizes we use a hazard ratio of 0.7 and a difference in proportion responding of 15%. Further details on the default sample sizes and the inverted simulation are in this attached document #1. Sample sizes larger than these would be called over-powered and would tend to detect trivial differences smaller than those considered meaningful. Smaller sample sizes would be under-powered and would tend to rule a new therapy ineffective even when it does have that minimal amount of effectiveness in the aggregate. The default number of zeroes in the p-value is 5 – only a 1 in a million-chance supporting the hypothesis of no difference between therapy groups. As noted earlier these default values still leave considerable overlap on the individual data histograms across groups and about an estimated 30% of the standard therapy reductions in BP exceeding that for new therapy – despite that extreme p-value and a difference in reductions between groups of about 50% larger than the clinically meaningful difference of 3.5. This is because the p-values are aggregate inferences which derive from the separation of the peaked notional distributions of the average in the graphic below rather than the distribution of data on individuals. Note that density estimates, such as histograms, are rarely presented even though considerable research has been conducted in this area. See Terrell and Scott (JASA, 80, pp. 209 – 214.). Lo, Mack and Wang (Prob and Rel Fields, 80 (1989), pp. 461-473) look at density estimation in the context of survival data with censoring. Other characterizations of  inter-subject variation such as spaghetti plots with individual longitudinal trends are rarely presented.

When you increase the sample size for any fixed number of zeroes in the p-value, watch the skinny distributions of the average approach each other and get skinnier. The estimate of the percent chance of individual BP reductions for standard therapy exceeding individual BP reductions for new therapy gets in the 35% to 40% range. So ‘bigger data’ helps discriminate smaller differences in the aggregate but does not tell us any more about individual differences – it is likely that big data based ‘significant’ conclusions for stochastic data are no more predictive and often less predictive for the individual. Meta-analyses, combining data from multiple studies, are often considered even better than the constituent complete set of blinded randomized studies testing a hypothesis. They can resolve conclusions about the aggregate when some studies in the mix considered are neutral and some are negative. This however, has the same ‘bigger data’ issue and does not help us any more in evaluating effect in individual subjects. In contrast, if we return to the calculator and reduce the sample size below the default value for any fixed number of zeroes in the p-value, you will see that ‘small’ data may actually be more useful. Try the survival calculator and the binomial calculator tabs as well. As with the continuous calculator you still have individual survival times for the standard therapy better than those for new therapy more than 35% of the time despite the very meaningful estimated ratio of hazards of death of new therapy to standard of close to 0.6, and that one in million p-value supporting the superiority of the new therapy (‘in the aggregate’ – people sometimes leave that part out). We collect all survivals beyond 7 years into the last bar in the histogram revealing a more marked difference. One should consider this and the 65% estimated chance of the new therapy survival being larger than that for standard therapy when making choices. However, with the flip rate of 35% one might consider the standard therapy if it is more tolerable and/or if there are other indicators that one would be in the 35%. There is this quality of life indicator called the EQ-5D index where a zero score is evaluated as a state equivalent to death and a 1 corresponds to health and happiness. Cancer databases often have patients reporting negative indices – presumably a state worse than death. In the binary outcome calculator, the percent of times we have the standard therapy individuals do as well or better than someone on new therapy is as high as 60% – but usually it is possible to break the ‘non-responder’ and ‘responder’ labels into many ordinal grades and when you do that and reassess how often the standard therapy is as good or better, you would get closer to the 30% number.

Edit the blue cells in the spreadsheet and enter your data and the calculations in the bottom box of the spreadsheet and the graphics will refresh.