## Slag, smoke and molten metal

The tapping of molten iron from a blast furnace. You stand transfixed as you see the molten metal flow out – a feast for your senses and a heat you can feel in your bones. I was told the persons tapping the blast furnace used to be the most heavily insured – not much of a life expectancy.

Blast furnaces – huge behemoths powering a different era. The queen bee of an industrial hive of mines, factories and railroads with a voracious appetite for iron ore, limestone and coke. A chemical reaction at more than 1200 degrees in its gut, separates molten iron from a detritus called slag and produces a lot of smoke. And this is where I will bring in my college professor in metallurgy, RK Srikanta Kumaraswamy at IIT Madras, who taught us that there was a lot to learn from the slag and the smoke as well. He had been a superintendent at a blast furnace before he joined the faculty and had a unique mix of theory and practice. A wonderful flair for mixing anecdotes from his industrial experience with conceptual explanations. About seeing the color and consistency of the smoke from the blast furnace in the morning and rushing in to work, to rebalance the mix of raw materials entering the furnace or looking at the slag and inferring the grade of the iron ore used in the previous shift. All this would be neatly tied to a rational explanation in terms of chemical processes inside the blast furnace. Intuitive explanations going into the core of how things work. I hope Mr Kumaraswamy has taught me well. As someone who went on to become a bio-statistician in the US I will try to bring in the ‘smoke and the slag’ of statistics. The emphasis will be on the dominant pervasive inferential approaches used in practice with some levity thrown in (the discussion here has been published and is available at this link).

You see the expression ‘dominant pervasive inferential approaches’ and you are thinking that you would rather eat a plateful of wriggly worms, but you must read on – coming up is a story about a trip to South America to see an eclipse during World War 1 and a joke about physics (is that even possible!!). And here’s a starter and a shocker for some – Kellyanne Conway is a genius!

## Alternative Truths

Kellyanne Conway is a genius. I believe a lot of people don’t like her opinions. But the expression ‘alternate facts’ that she coined is brilliant, and very germane to how inferences are drawn from stochastic data (data associated with noise or a degree of erratic randomness) using statistical analyses. Much of statistics is built on a duality between a true state of nature and experiences deriving from it. Such dualities are not uncommon in the sciences. We are always presuming that something meaningful is behind what we experience, observe and measure. A rather prosaic set of questions may help deem someone as being depressed, whatever that means. Mathematics, the skeletal basis of much of science, abstracts all numbers through the x’s and y’s of algebraic expressions – something we all have trouble with when we move from numbers to algebra in grade school.

Statisticians, refer to this true state of nature as a parameter, typically unknown. Statisticians will go to great lengths distinguishing the average and the mean, the computed standard deviation from an underlying true standard deviation (usually denoted as sigma) and a proportion from a probability. The latter in each of these three tuples is the true state of nature and the former is a computed measure which attempts to get at the true state of nature. We have hypotheses, two, many or from a continuum, about the nature of this truth. We have been bordering on blasphemous for a long time, much before Kellyanne Conway. We are comfortable with alternate truths and alternative facts. These can be invariant truths as in a frequentist statistical framework, or varying truths, having subjective probabilities, when we adopt the Bayesian framework. Data is deemed to devolve from these true states of nature. If we were looking at data on the trajectory over time of a free-falling apple on planet earth, this data derives from and supports Newtonian hypotheses about gravity as the true state behind this experience. The link here between the true state of nature and our trajectory data is deterministic. We come in as statisticians, when this link between what may be true and what we observe is probabilistic.

In probabilistic settings the true state of nature can then throw us varying experiences. We can have alternate facts – very different ones emanating from the same underlying truth. Consider an incident reported in the media of a vegan and a woman of the Bahai faith involved in a shooting incident at a YouTube facility. Vegans – typically wouldn’t hurt a fly, a Bahai – of a pacifist faith, and a woman, and we had it in the news – an undisputable fact – a person with all three characteristics shooting to hurt and kill. Of course, we have thousands of other facts about other people meeting that profile who wouldn’t do a thing like that and some who might misbehave in milder manners. All alternate facts. And there are a lot of other possible explanations for acts of violence in this example and in other contexts – all alternate truths! In clinical trials, a cancer patient on a standard therapy may survive 4 years while another on a new therapy, considered an improvement, may succumb to the illness in 3 years. Statistics comes in to help make sense of it all, in this and other stochastic settings, through a statistical hypothesis testing framework.

There I go again with ‘statistical hypothesis testing framework’ and you may be thinking of going to a tropical rain forest to have the leeches suck your blood instead of reading any further – but hold on – I have that oxymoron coming up – a joke in physics! – at least one in statistics has some small finite negative probability! And about certain sharps that scientists carry – turn them in before they do any damage!

## The Statistical Hypothesis Testing Framework

In the classical frequentist framework, we would start with a set of dueling hypotheses. For the example above, of a new therapy versus a standard therapy for cancer, one would start with the hypotheses that there is no difference between the standard therapy and the new therapy and pit this against the hypothesis of better survival outcome with the new therapy. The former is called the null hypothesis and reflects prevalent opinion, while the latter is called an alternate hypothesis and is something we hope to establish.

There are various data schemas we can use to ascertain the merit of these hypotheses going from retrospective real-world data collection to controlled prospective designed studies. One may conduct a clinical study, with a sufficiently large number of patients randomly assigned to the two therapies, to assess which of the two hypotheses above are supported. There is usually some quibbling about how conclusions of such studies are expressed. Statisticians are usually taught to express one of two conclusions supported by the data. If the data indicates a lack of a difference between therapies we would say “We were unable to reject the null hypothesis of no difference between therapies” with the odd double negative suggesting that current available evidence does not support differences but perhaps we might with more evidence. If the data does support a difference we would say “We reject the null hypothesis of no difference in favor of the alternate hypothesis that the new therapy is superior to standard therapy.”

This framework is consistent with an early philosophy of science framework called falsification – any explanation continues to be relevant till it is rejected by another which is better supported by the data. We spoke earlier about data on a falling object supporting a Newtonian world view as an example of the fact/ explanation dichotomy in science and statistics. Let’s look at how we moved from that, in order to explain falsification. During world war 1, a team of European scientists, totally out of touch with the realities of their time, took a long arduous trip fraught with risk to South America to record data during a solar eclipse. Their goal was to see if light was deflected by gravitational fields, in line with Einstein’s theory of relativity or the more strongly supported Newtonian hypotheses of the time. Their data did support the theory of relativity. If the falsification framework is a right perspective, then we can note, that to date, we have been unable to reject Einstein’s theory of relativity in favor of another such elegant alternate theory of all things. Though I do hear news at times of particles travelling faster than the speed of light. I believe that means that there are physicists out there standing at two ends of long tube with one telling the other “I see it coming out at this end – have you sent it out as yet?”!! And one such physicist will come to us eventually with an alternate explanation as simple as Einstein’s explanations using analogies of objects falling from a window of a train, relative to observers on the moving train and relative to those outside (for Einstein’s explanations written for the lay person and for the story about the confirmation of his theory see Relativity: The Special and General Theory, 1920, Methuen).

One must digress a little to note that our small band of explorers with a lot of Chutzpah, watching that eclipse, have likely done more for us than the warring nations and leaders of their time. Their brethren were lead on by jingoistic tunes like – ‘Dulce et decorum est, pro patria mori’ – ‘It is sweet and fitting to die for one’s country’. There is poem by Wilfred Owen with this title about how one wouldn’t say that when walking behind a cart full of dead and dying soldiers choking on mustard gas. The rationality of science, seeking to find explanations that cohere experiences and eventually help us choose between them, when we can, would have stopped at “My country when right” and not the usual “My country – right or wrong”.

Competing hypotheses to explain experience – a simple and transparent framework which can be explained in nutshell, as I have attempted to do. Unfortunately, rationality like this can have very sharp edges – Occam’s razor serves well but can cut you very badly! Simple frameworks do leave things out. Many will argue that experience alone has a richness which does not need explanations, as in this experience/explanation framework – most have an immersive experience when looking at an expressionist’s masterpiece or listening to a famous composer’s music, without necessarily evaluating or understanding. Like someone once said, “what can be explained in a nutshell” might ‘deserve to stay there’!

However, before we get too cynical about what we do, note that the clinical study comparing therapies, falling into this evaluative framework, answers a simple but very important question for most of us – does this new therapy work? This is a necessary first step. We need to move on to – is it safe? Does it work on laboratory tests and other objective measures relevant to the disease? Do patients like it – does it improve subjective measures of quality of life? And finally – will it work for me? Or will I be like that patient who died early on the new therapy (likely to have done just as well or better on standard therapy), despite some statistician reporting a “rejection of the null hypothesis of no difference in favor of the alternate hypothesis” that the new therapy improves survival. We will look at these questions later, but first let’s explore further, how inferences are drawn. Bear with me while I discuss that —

Don’t move on to that loud recording of screaming banshees! (the birds or the band!). Then read on folks about something on a cat in a limbo between life and death – creepy!

## The decision theoretic framework

A popular analogy used to explain how we statisticians choose between two hypotheses is based on the criminal justice system. There, we can err when we hold someone guilty when innocent and when we acquit someone when guilty. We are fine when we find someone guilty or innocent when they are. We would like to reduce the error rates. Analogously, in statistical inference, we want to control the error rate of concluding that the alternate hypothesis of effect is true when the null hypothesis of no difference is a better characterization. This is usually called a type I error rate. The error of concluding in favor of no difference when there is one is called a type II error. We usually rule in favor of the alternate hypotheses (difference in therapies) when the possibility of being wrong (the type I error rate) is less than 5%. One would then call the result statistically significant. It is necessary to note that this conclusion, very much unlike the legal analogy, pertains to aggregates computed over individuals rather than to each individual. We conclude differences between therapies on aggregate characteristics such as medians and means which may not hold as convincingly, between two individuals with the differing interventions. The fallacy of presuming individual effect based on an effect in aggregates has been criticized for a while by epidemiologists as an ‘ecological fallacy’ (see Rothman and Greenland (editors), Modern Epidemiology, 1998, Lippincott, Williams, and Wilkins). This had lead to a movement away from analyses in epidemiology where the data records were aggregates over large units such as counties or other sub-divisions such as median income, disease rates, racial or ethnic composition etc., towards case-control and cohort studies which have data on individual subjects. Here we will be demonstrating that these studies based on individual records (with examples of Cohort studies) are infected with this fallacy as well.

The general decision theoretic approach described above is not very different using Bayesian approaches applied to clinical trial data. The Bayesian approach has, over time, in the clinical trail context, been forced into a frequentist mold through the use of non-informative prior information (which is the same as not using any prior information!) and adaptations of the frequentist decision theoretic framework. The general Bayesian framework may help by holding on to multiple hypotheses, with high or low probabilities, a ‘this and this’ framework rather than a ‘this or this’ framework. Frequently however, a hypothesis with a high Bayesian posterior probability (a revised probability of the hypothesis given observed data) is likely to lead to choices favoring it in just the same manner as those favoring a hypothesis retained by frequentist analysis. As with the frequentist approach we adopt the experience/explanation framework and conclusions tend to be based on aggregates and represent a first step to help in approaches customized to the individual. For instance, statisticians Dr. Peter Thall and colleagues at the MD Anderson Cancer Center support the use of a probability of a true proportion responding to therapy (an aggregate random parameter) in one group exceeding that in another to aid in the choice between therapies. It is important to understand inferences in the aggregate, drawn from statistical analyses, and see why these may not always hold for individuals. Let’s look at a major innovation in statistical theory, used often in frequentist approaches, which drove statisticians into inferences about aggregates – unlike that legal framework discussed earlier which looks at individuals.

If you are getting a little numb with all this theory and considering electro-shock therapy to break out of it, then just hang in there – up ahead is that cat that appears to be blissfully purring as it arches its back – but shadows in a distance could deceive – is it alive or is it in the grips of *Yama* Himself and half-way to the nether-world! And also, about yellow, green, brown and black belts -are we talking karate!

## The central limit theorem

This major result is often called the law of large numbers and features in most inferential analyses. It states that even when the distribution of data on individuals is erratic and non- standard, the distribution of aggregate statistics has a tractable form (usually the symmetric bell-shaped distribution or a related distribution) allowing us to read off probabilities. For instance, we might have a skewed distribution for the reduction in diastolic blood pressure (BP) under therapy for individual patients due to resistance to therapy. This might lead to a distribution with more likelihood of lower values to the left of the peak, reflecting some likelihood of a lack of response, rather than to the right. If we looked at the distribution of average BP reduction of a sufficiently large number of patients, it would tend to have the symmetrical normal distribution. Let’s look at the histograms in my interactive display and calculator below. We have two groups randomized to two therapies capable of reducing blood pressure. We look at the reduction in BP in mm Hg for the two groups using the two histograms. The distributions of the individual BP changes are the skewed wide distribution I mentioned earlier – it is simulated data and you may not always see the skew in a given simulation. The distribution of the average is much skinnier and always peaked and symmetric as shown. The measure of the spread of the distribution of the average, the standard error, is lower than the corresponding measure, the standard deviation, for the parent distribution, by a factor given by the square root of the sample size. Statistical p-values and inferences are drawn based on the separation of the tighter known distributions of the average rather than the wider intractable parent distributions in the histograms. Bayesian formulations in clinical trials, as noted earlier, also rely on a skinny distribution of the aggregate and draw conclusions about the aggregate. Bayesian approaches would look at the distribution of the mean (a random parameter), while the frequentist approach would consider the mean invariant and look at the distribution of the sample average – both are aggregates.

The reader may ask ‘What do you mean by the distribution of the aggregate? Don’t you get just get one average in each of these two groups?”. Yes, the distribution of the average has no real existence unless we repeat the study 100 times or so, and we usually do it only once. This notional distribution of the average allows us to assess differences in the aggregate. Ordinary synonyms for the word notional that I used for the distribution of the average include apparitional, illusory and chimerical. When we apply anything we infer from the average to individuals, we are invoking Schrodinger’s cat through this result about aggregates – Schrodinger’s cat! – simultaneously alive and dead in the mythology of physics, when we would rather be looking at all the field mice scurrying around! (and maybe bunnies, goons and fairies – see my daughter’s favorite kindergarten rhyme here)! However, most statisticians including myself will swear by the central limit theorem and the use of this notional distribution of the average. Some excellent mathematicians built gilded marble steps, without any logical cracks or crevices, all the way up to it. And we do say —

I know it is real, it exists and I have seen it!

In management parlance there is often reference to six sigma quality, and as you make it up certification levels, you move from a yellow all the way to the black belt. Sigma is the thing estimated by the standard deviation or measure of spread of the wider distribution for individual measures. When we get a statistically significant result we are actually happy with a difference of about twice a standard error – and if management found out what a standard error is they would give us statisticians no more than a belting! I just bought a million-dollar personal injury insurance policy and so here goes – a standard error is the measure of the spread of the distribution of the averages, given as noted earlier, by the standard deviation divided by a function of the square root of the sample sizes. Proportional to the widths of the skinny distributions I mentioned earlier. Not much thing!

To accentuate the magnitude of an effect, senior clinical team members sometimes get us to chime in with a question like – “now, how many zeroes, did we have in that p-value?”. Some of us statisticians may cringe a little, but it might help with the bonus, and so we do give them that nice number we obtained in our statistically significant analysis. The default excel graphic provides the separation between the distributions, both in the aggregate and in the individual, when you see 5 zeroes in the p-value – often interpreted as a one in a million chance that the null hypothesis of no effect is supported as opposed to the alternate that the new therapy is superior to the standard therapy. Any statistician would label this ‘significant’ – synonymous with something ‘noteworthy’, if you look it up in a dictionary. Further, a clinician would look at the differences in the average reductions in BP across the two therapy groups of much more than half the standard deviation (a threshold often used to gauge if differences are meaningful) and deem it clinically significant. The histograms however indicate considerable overlap in individual BP reductions. We computed a proportion of the reductions in BP for someone in the standard therapy arm exceeding that for someone receiving the new therapy at about 30%. The reduction of BP for individuals on new therapy exceed that for standard, despite the highly significant finding, at a rate of about 70%. Two things can happen, like in an exam with true-false choices, and the new therapy barely passes with 70%. In my corporate compliance training courses, I usually need 80% to pass and there are usually questions with more than two options. Many results deemed as statistically significant are indicative of probabilistic propensities for superiority and it may not be appropriate to use them as prescriptive for certain choices for an individual patient.

Well, here I am bleating on and on and you are firing up your barbeque, getting your magic sauce ready, and those skewers, and getting ready to make lamb kabab of me – but wait it is still a while till the end of summer – coming up are some interactive graphics – pretty much the max amount of fun a statistician can give you! And after that calculator we will even bring in billiards!

## The inverted simulation and calculator

The interactive calculator has three tabs simulating data in two-arm clinical trials. The first tab is for continuous data such as the reduction in BP. The second has survival data looking at the time to an event such as a death or disease progression. The third tab has binomial data such as the achievement of a response threshold on therapy. For each of these calculators you can change the sample size per group (standard therapy or new therapy) and the number of zeroes in the p-value associated with the difference between the two therapies. The default sample sizes in the three calculators of 85, 176 and 230 allow no more than the usual 5% two-sided type I error and a 10% type II errors (errors defined earlier). For the continuous calculator we can detect a meaningful difference of 3.5 (half a standard deviation is usually considered meaningful) between a reduction of 5 for standard therapy (something like a diuretic) and a reduction of 8.5 for a new test therapy (something like a diuretic/beta-blocker combination) for a SD of 7 for BP reductions over time. For the other two default sample sizes we use a hazard ratio of 0.7 and a difference in proportion responding of 15%. Further details on the default sample sizes and the inverted simulation are in this attached document #1. Sample sizes larger than these would be called over-powered and would tend to detect trivial differences smaller than those considered meaningful. Smaller sample sizes would be under-powered and would tend to rule a new therapy ineffective even when it does have that minimal amount of effectiveness in the aggregate. The default number of zeroes in the p-value is 5 – only a 1 in a million-chance supporting the hypothesis of no difference between therapy groups. As noted earlier these default values still leave considerable overlap on the individual data histograms across groups and about an estimated 30% of the standard therapy reductions in BP exceeding that for new therapy – despite that extreme p-value and a difference in reductions between groups of about 50% larger than the clinically meaningful difference of 3.5. Note that density estimates, such as histograms, are rarely presented even though considerable research has been conducted in this area. See Terrell and Scott (JASA, 80, pp. 209 – 214.). Lo, Mack and Wang (Prob and Rel Fields, 80 (1989), pp. 461-473) look at density estimation in the context of survival data with censoring.

When you increase the sample size for any fixed number of zeroes in the p-value, watch the skinny notional distributions of the average approach each other and get skinnier. The estimate of the percent chance of individual BP reductions for standard therapy exceeding individual BP reductions for new therapy gets in the 35% to 40% range. So ‘bigger data’ helps discriminate smaller differences in the aggregate but does not tell us any more about individual differences – it is likely that big data based ‘significant’ conclusions for stochastic data are no more predictive and often less predictive for the individual. Meta-analyses, combining data from multiple studies, are often considered even better than the constituent complete set of blinded randomized studies testing a hypothesis. They can resolve conclusions about the aggregate when some studies in the mix considered are neutral and some are negative. This however, has the same ‘bigger data’ issue and does not help us any more in evaluating effect in individual subjects. In contrast, if we return to the calculator and reduce the sample size below the default value for any fixed number of zeroes in the p-value, you will see that ‘small’ data may actually be more useful. Try the survival calculator and the binomial calculator tabs as well. As with the continuous calculator you still have individual survival times for the standard therapy better than those for new therapy more than 30% of the time despite the very meaningful estimated ratio of hazards of death of new therapy to standard of close to 0.6, and that one in million p-value supporting the superiority of the new therapy (‘in the aggregate’ – people sometimes leave that part out). We collect all survivals beyond 7 years into the last bar in the histogram revealing a more marked difference. One should consider this and the 70% estimated chance of the new therapy survival being larger than that for standard therapy when making choices. However, with the flip rate of about 35% one might consider the standard therapy if it is more tolerable and/or if there are other indicators that one would be in the 35%. There is this quality of life indicator called the EQ-5D index where a zero score is evaluated as a state equivalent to death and a 1 corresponds to health and happiness. Cancer databases often have patients reporting negative indices – presumably a state worse than death. In the binary outcome calculator, the percent of times we have the standard therapy individuals do as well or better than someone on new therapy higher than about 60% – but usually it is possible to break the ‘non-responder’ and ‘responder’ labels into many ordinal grades and when you do that and reassess how often the standard therapy is as good or better you would get closer to the 30% number.

All these numbers! – you may be inclined to string some of these numbers together, X out your browser and rush to the nearest convenience store and play your state lottery – I plead that you desist and continue reading as there might even be a slightly better chance of gaining from this then the one in million shot at winning the mega-million jackpot! Coming up after this calculator we will bring in billiards – yes, the number 8 ball in the far-left corner and the number 7 to the far-right! And we will even dwell a little on the paranormal!

**Edit the blue cells in the spreadsheet and enter your data and the calculations in the bottom box of the spreadsheet and the graphics will refresh. **

## What’s great about prospective clinical trials?

Clinical trials comparing therapies usually involve an adequate amount of follow-up and an adequate number of subjects to uncover unusual and rare adverse events associated with the therapies studied in addition to the providing good estimates of rates for common side-effects. Studies typically have a large number of clinic and hospital sites participating across countries and continents and a large number of contracted organizations to create databases, verify the accuracy of data entered by the site, ensure randomization without bias and any blinding of patients and site personnel to therapies. These sites and organizations are entities independent of sponsors and are subject to audits by both the regulatory agencies and the sponsor – the data is likely robust and reliable and inferences about aggregate effect are accurate.

In Europe, North America and many other countries there is a requirement that sponsors of clinical trials provide details such as the primary and key objectives, hypotheses and endpoints at or before the start of a trial to an online database. In the US such a database is at clinicaltrials.gov. Further regulators require sponsors to provide a study protocol before study starts enrolling and detailed statistical analysis plans shortly after, and before any unblinding and analysis is conducted. Study protocols are also shared with Institutional review Boards (IRBs) before trials are initiated. These commitments reduce publication bias as there is a requirement to report results to clinical trials.gov or a similar online database irrespective of whether results were negative or positive. The prospective statistical plan also helps prevent cherry picking among the many choices when carving out analysis populations, endpoints, data cuts, hypotheses and analysis methods. A statistical plan specifies and selects one from all these choices of data presentation. A statistician will tell you that this controls the type I error rate – the likelihood of falsely concluding a difference between therapies when there is none. A pre-specified analysis has a lot of credibility – it is like calling a shot before making it in billiards. Further, regulatory agencies usually require two successful well controlled studies before the approval of a therapy, thus adding to this credibility of inferences supporting a therapy.

Though I make the case that most statistically significant results only represent some stochastic incrementalism in the aggregate, clinical trials are usually sized right to detect clinically meaningful differences in the aggregate – a chunky incrementalism. A statistician would size a study to detect reasonable improvements in the aggregate reducing the possibility of triggering a signal based on trivial differences. A series of such increments could add up to marked improvements in both efficacy and safety over time. Such an approach is a critical first step even if it may not help entirely in the choice of therapy by and for the individual patient. We will get to a discussion on that shortly. There can be long periods of stasis in drug development where new therapies continue to be compared on efficacy against old standards with little effect, or in trials establishing non-inferiority with a current effective standard. These can get approved based on improved and/or differing safety profiles, quality of life or economic benefit. Conclusions from well conducted clinical trials are likely to be far more reliable than those from data sources I describe below.

## Big and Easy Data

Obamacare accelerated trends towards the use of electronic records and other uncontrolled prospective and retrospective data. Acquiring such data can cost a tenth or lesser than running a controlled clinical trial to obtain the data. Very large datasets can be obtained with the downside mentioned earlier of the triggering of trivial results as noteworthy. Further there can be substantive biases due to the uncontrolled nature of the data. There exist methods, using propensity scores, such as those by Professors Rubin and Rosenbaum which can control for these biases when all likely confounding variables are available. It is noted that while statisticians can try to pre-specify analyses for such data, one is not required to publish either the analysis plan or any obtained negative results.

Further the analyses can be overly managed at the institution conducting the research, resulting in multiple inferential analyses, population carve-outs, endpoints, hypotheses and analysis methods and subsequent choices amongst these on results worth publishing. Other reasons why results do not see the light of day are that they are negative or neutral, not in the best interest of the organization, inconsistent with other published results emanating from the institution and contrary to opinions of influential external opinion leaders. Conclusions in such contexts should note that results are exploratory or hypothesis generating and that multiple analyses in addition to the reported results were conducted without adjustments for multiple testing to control the overall false positive rate. Such acknowledgements are necessary as data presentation for such data often patterns presentations for prospective controlled clinical trials.

Next is another God-Awful calculator.

## God-Awful Publication Bias Calculator

In the last section we looked at a large number of reasons why results don’t get published. We will look at the consequence of just two of the reasons I mention, negative or neutral results, on the validity of results that do get published. The calculator uses Bayes’ Theorem and mathematical details are in this attached document #2. In the first tab of the calculator, you can enter the probabilities that analyses conducted at your institution will be published given a positive finding (a statistically significant result) and that for publishing given a negative finding. I have default values of 80% and 10%. The third entry is the nominal false positive rate used in the analysis – usually a two-sided 5%. You can enter the actual p-value you obtain in your analysis. The actual false positive rate corresponding to the nominal 5% is actually close to 30%. We are still talking aggregates here and predictability in individuals is likely much lesser -even with our p-values with 5 zeroes, which needed no adjustment upwards, the patient level estimated rate of flipped efficacy was near 30%. Further adjustments for any of the multiple analyses mentioned above would make the results even less credible. Young and Karr [Significance. 2011; Vol 8, #3: 116-120] looked at 52 claims from uncontrolled studies with significant results which were published in reputed journals like NEJM, JAMA and JNCI, and notes that none of these significant findings held up in randomized clinical trials – 5 were supported in the opposite direction.

After the calculator look for talk about the paranormal! – a portal to the future through a blue haze – are you getting the shivers!

**Edit the blue cells in the spreadsheet and enter your data and the calculations in the bottom box of the spreadsheet will refresh.**

My Psychic Story

You see it in my calculator – a tab for a paranormal calculator and you don’t have to be too prescient to tell that I am getting to it next. But let me break into it slowly with a story. This starts with a youth in college. A freshman in college, even undecided on a major -life Is a big blank slate. Hopes, ambitions, aspirations, desires and a confusing profusion of choices. The freshman is walking around late evening in the main street near college after submitting a particularly grueling class project. With thoughts of possibly failing physics and statistics and never ever finding true love. And then, right in front, he sees a blinking sign for a psychic reading at a discounted student rate. And he knew – this must be it – the way out! After all those quick Ramen Noodle meals during exam week there was money to spare. The psychic was in – she had known for a while that he would come! He had his questions – Will I find the one? Will she be beautiful and nice? And will I be rich and successful? The crystal globe on psychic’s table was glowing, connected through an astral portal to the future and sending in messages only the psychic could discern. She stared into the globe. A blue haze suffused the room and she started answering his questions. She said “You will do well – just take marketing and finance instead! And true love is coming – she will be all you want and more.”

He did as he was told and sure enough things went very well. She was beautiful and nice and helped him through his MBA. Now he had made it and was a big time pharmaceutical executive with 2.6 kids – it was a boy and a girl and the third was coming – as a statistician I don’t like rounding too much and hence the 2.6. Then he remembered the psychic from his college town and how she seemed way more predictive than the company statisticians! A bit of the scientific temperament had rubbed off on him through all that freshman physics and statistics. So, he made a few trips to his college town psychic before every one of the first 6 World Series ball games. And through that blue haze came the scores and the name of the winning team. It cost a little more this time as he could neither get the student nor the faculty discount despite trying. Though a few bets with different bookies got him very much in the black.

Now, as I said, he had made it, but he felt he still had ways to go – there was that million-dollar boat to buy and that corner office at the workplace. There was this mega-million-dollar clinical study initiating. Wouldn’t it be nice to find out how it will end up? Not very good if it fails – all that money down the drain. If it does not or if he could change it to something else which works – now that would get him somewhere. He goes back to the psychic and she pulls up a press briefing from three years later and tells him that the study is negative – the drug doesn’t work – your company is in trouble – you are on dole -your wife has left you. So, the executive goes back to headquarters and cancels the clinical trial and pushes a different compound through clinical trials. Now if all this does happen then we need to have two pathways through space and time – one in which the failed trial did occur and one in which the failed trial did not occur. If only the latter remains, the psychic would not have been able to tell the executive about the outcome of the trial. The executive is better off in one of the two parallel pathways – let the Gods and Goddesses judge which one that is! So here we now have twice the amount of real estate in space time. How can that be? – you may ask. Elon Musk of Tesla and a lot of others are speculating that the universe is somewhat unreal, virtual – a simulation. And as statisticians we know it does not take too much work for more simulations to be created – just one more do-loop somewhere. All rather fantastic but you know being Bayesian with some probability, keeps me sane – just one of many worldviews I hold with some little probability when Bayesian – that we might be simultaneously living in these parallel timelines and that we are able to send some kind of information packet across these and back and forth in time. Kind of fun and eerie stuff for a statistician to think about – a probabilistic past in addition to a probabilistic future! Others with a lot of time to kill might want to puzzle through things like probabilistic or deterministic causal inference, and if you are in business it will be tricky thinking about insider trading. But now I have done that nifty paranormal drop-down in my calculator and I will be damned if I don’t sneak it into this webpage before some frequentist falsifies my hypotheses.

So, let’s get back to that God -Awful calculator and select the even more awful calculator tab for the paranormal. I have as default the probability of 1.0 for that timeline T1 if the executive is assured by the psychic that the study is a success, as he will not cancel it then. If it is a negative finding he will cancel the study and move along timeline T2 instead where he tries something else. So, for a negative finding the probability of T1 is 0.5. I am unable to argue that one timeline has more of a probability than another – both will be equally ‘real’ real estate to us – though the probability might be different if a lot more options are explored. As before, I apply Bayes theorem and there is more than a doubling of the false positive rate.

Now, you need to climb onboard your time machine, buckle up, and get back to these miserable times – Marquess De Sade beckons! We will flog that mythical Schrodinger’s cat to life a few more times – poor thing just has 9 lives! (and 9 or 10 deaths depending on whether you start from before or after the first life). I think the physicists have used a few already and I know Jim Cramer from CNBC used one when he called a little uptick in the financial markets, after a crash, a ‘dead cat bounce’. But next, let’s look at the relationship between ice-cream sales and the number of sand castles built at Miami beach – must be all that sugar!

## Correlation isn’t much association either

A typical example on correlation used by professors at college typically starts with funky data with the number of sand castles built plotted on the Y-axis against ice-cream sales on the X-axis. You look at the scatter plot a see a narrowish cloud of data aligned with the lower end of the cloud at lower values of both the number of sand castles and the ice-cream sales and with upper end having both high. And you also know that whenever you have something with Y’s and X’s, the Y’s belong on the left-hand side of some kind of equation with the X’s on the right leading to the Y’s. So, you are thinking “I get it! I get it! I get it!” and you are raising your hand up for all those class-participation points – ‘Professor, it is all that high-fructose corn-syrup in the ice-cream, obviously, which is getting those kids into a castle building frenzy’. Then your professor tells you ‘Na, Na, Na, Na, Na! there is a third thing you see, the number of people at the beach, which is likely leading to highs or lows on both simultaneously – correlation is merely association and not causation!”. As an aside, please note that It is a well-known fact (with no known alternative facts!) that the best way to land a cute statistics professor is to hang out at the beach before school starts and look for her (or him) counting sand castles. This professor here, spent a good 50 days at the beach, put the data together and got a p-value with the 5 zeroes when rejecting a null hypothesis of a zero correlation. I will not give you another messy calculator this time but will tell you that this would correspond to a calculated correlation coefficient of 0.613. A correlation coefficient between two measures is roughly a measure of the tendency of one thing measured, to be above or below its average, when the other measure is also above or below its average and goes from -1 for a perfect negative correlation and +1 for a perfect positive. It uses the two averages, something we had a little trouble with earlier, and like the average it is an aggregate statistic. It can be shown, that for a close to ellipsoidal scatter plot of data, a good 29.0% of data points will show an association in the opposite direction when the correlation coefficient is the 0.613 above – in 29.0% of the days at the beach, ice-cream sales will be below the average ice-cream sale with the number of sand castles built being above average and vice-versa (details on calculations in the attached document #3). Correlation isn’t much association for a good 29.0% of the individual data points – they associate in an opposite direction to that indicated by the correlation coefficient. The discordant percentages are 33.3%, 25.3% and 20.5% for correlations of 0.5, 0.7 and 0.8 respectively. The discordance rate can be quickly assessed for a correlation coefficient r as the Inverse Cosine of the absolute value of the r divided by the constant PI.

Let’s move on – the time has come, once again, to pull our cats back from limbo!

This time we will get some help with these unruly cats, walking in and out, at will, through that trap-door to the other realm. Our peers working in the social sciences, psychology, quality of life, mental health, drug abuse and addiction, criminal justice and other similar disciplines will help. They often start with the maddeningly rich complex profusion of phenomenon in their areas of endeavor. The necessary first step for specialists faced with this is to order these around a conceptual framework. A lot of collective thought and effort by committed and inspired people help develop such frameworks over time. The latent concepts themselves have no direct measures – they are very nuanced, finely crafted, concise and very convincing embodiments of all those difficult and varied phenomena we mentioned. Like most people, I am usually very sold on these characterizations – though some may argue that the scientist has conjured our mythical cat. Then typically a group of experts derive measures likely to be linked to or informative about the abstract concepts of interest, with some iterative refinement of the measures, based on data. One might end up with as many as 25 or so measures. Then the scientist collects those 25 or so measures on each of a sufficiently large number of subjects and consults his friendly neighborhood statistician. There are various methods such as principal component analysis, factor analysis etc., and these methods look at correlations amongst the measures and puts subsets of these together depending on their inter-correlations or their covariances (a close cousin) – maybe 4 ,5 or 6 items which are highly related to each other. Correlations are generated between the combination of these items, an abstract composite, and the individual items that went into it. The scientist then interprets this combination of items based on the measures that went into it while broadly remaining consistent with his initial conceptual basis. Similarly, interpretations are obtained for some 4 or 5 more combinations derived from the other measures in the 25 or so considered. There is some more iterative refinement of the measures and the number and definition of these notional concepts, built both on prior theory and on those empirical legs. In the aggregate, these notional concepts look good, each standing on the 4 or 5 or more measures they derive from. However, based on our discussions on correlations, the cat invoked, for a good number of individual subjects in the study, may have 1, 2 or more legs sticking straight up out of their backs! – their associations are discordant with the aggregate correlations. There are a number of different validities used to support these theoretical constructs and many involve aggregate statistics and correlations. Relationships, primarily stochastic, are assessed with external measures not used to obtain these constructs, to further justify the validity of the constructs. All incorrect for a large number of individuals. Another methodology involves classifying similar subjects into a small number of clusters based on a large set of initial measures. Then the means of these measures for each of these clusters can be used to provide a conceptual characterization of the clusters. These cluster means are usually significantly separated from those of other clusters. The point about the skinny distribution for the aggregate as opposed to the broad distribution for individual data continues to hold. The cluster mean is now a point in higher dimensions with the aggregate distribution being compact and tight like the body of the sun and that for the individual points can be like the far-flung planets, asteroids and comets orbiting it. Any such mix of conceptual and empirical, may take a good 3 years of work, a lot of journal publications, and almost everyone acknowledges the feline.

I know it is real, it exists and I have seen it!

I am quite a sucker for some of the things that come out of this line of work such as personality profiling – one I like is called the enneagram. I come out part peacemaker, part designer and part performer and there are some swell things written about these kinds of people.

Next, I get into some implications for public policy and then look at the fallacies in histograms and the standard deviation. Sounds like a reading you would save for the 29th or even the 30th of February! Keep going folks for we will bring back the executive, the psychic and the statistician – something about an interaction term between the defensive indifference baseball statistic and the color of the pitcher’s underwear! And then let me ask you – are you dressed right? – we are headed to a tea party! – the ladies are waiting!

## Who will bell the cat?

In public policy contexts, often being labelled into one of those conceptual categories connotes something perverse and the truth of this is usually made patent through some stochastic data on those who have it and those who don’t. Then the game of scientific tag starts based on this measurable composite thing and if ‘you are it’ you may be deemed to need various interventions, with or without your consent. These would be based on those side-by-side studies, as in the first calculator, amongst others who were earlier deemed to be ‘it’. Then there are studies establishing adverse consequences when interventions are withdrawn. Likely all the steps going from establishing the relevance of the discriminating conceptualization, the perverse nature of it and the need for ceaseless interventions are established using stochastic data. This pathway will usually gain credence through statistically significant relationships, effects deemed meaningful by investigators, and some may even be supported by the 5 zeroes in the p-value of a statistical test. Any enforcement is usually very rule bound, by people uncomfortable with exception or uncertainty and driven on speed and efficiency. Those in these dystopic pathways are often trolled, usually using wrongly acquired private or publicly available data, now or in the future, through the set of very disagreeable options in this pathway, even when it is clear to both the enforcer and the victim that better choices for and by the individual exist. Often you are ‘it’ based on some youthful indiscretion or lapse of reason. If it were a sin (usually it hurt you more than others and it was within the norms of behavior at the time), you have sought redemption from yourself or some larger consciousness. You have moved on – but your free expression and action continues to be restricted and you are subject to various intrusions and interventions by those trolling and shaming you based on your history.

As demonstrated earlier, the conclusions about the pathway derive from aggregate propensities and we can have 20% to 30% of individuals countering each one of its patterns. What is worse is that data supporting the links in this chain may be uncontrolled and highly filtered resulting in publication and other biases. Much worse is the use of case reports and anecdotal evidence. We do tend to be overly reactive to information – two Native American students attending a US university tour were pulled out and held for questioning based on a parent reporting them to law enforcement based on discomfort with the t-shirts they were wearing. Our paranoia is leading to a need for ‘transparency’ – a nice word justifying a move towards oppressive surveillance regimes. Transparency is necessarily two-way and most of us are not that interested in other people’s lives and don’t want ours intruded into either. If we keep heading this way, at the least we should track the number of individuals erroneously targeted (likely very large given discussions) with adverse interventions and the cost of each of these, for individuals targeted. Going by the black lives matter movement, it is clear that both the number and the cost are severely underestimated.

I think I may have raised a little bit of a stink here – no, it is not from your cat’s litter box! So, don’t break away to check. You have been with me for some 8500 words when you would have preferred reading the entire United States Tax Code in Latin. Hang in there – just another 4500 more words! We will shortly return to the psychic and the executive and a cheese steak and fries outlier – figure that one out! And then further down, we will head to that tea party – so classy and British!

## Fallacies in the histogram and the standard deviation as well

You will have to wait a little to meet those people again. First recall how we spoke earlier about the skinny notional distribution of the average with its spread characterized by the narrow standard error and the wider distribution for individuals with its spread characterized by the standard deviation. That can be a fallacy as well unless every individual’s measurement has an identical distribution – otherwise what are we talking about when we look at the histogram? This assumption is also made in Bayesian approaches when obtaining the likelihood used to update the prior distribution of an underlying aggregate to obtain the updated posterior distribution of the aggregate (see discussions on exchangeable observations in Spiegelhalter, Abrams and Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation, 2004, John Wiley). These measurements should be independent of each other – otherwise, for instance for a positive dependence, if we get some measures on one side of the histogram most others would tend to be on that side skewing the histogram away from the true distribution. These assumptions drive the classical central limit theorem used to derive the notional distribution of the average and as noted, the Bayesian approaches as well.

We have frequently referred to an alternate additional statistic reflecting individual variation through the proportion of times a patient on the standard had a better reduction in BP than a patient on a new therapy. We can try to move from this data driven fact to a state of nature true for all patients. We might infer that this proportion is an estimate of the probability of ‘any’ patient responding better to the standard therapy than to the new therapy in similar patients elsewhere outside the confines of the clinical trial. The assumptions in both the frequentist and Bayesian setting that we are making about ‘identical’ distributions across patients and independence across patients would support the previous statement. Returning to our BP example, this would mean that all patients under standard and new therapy would tend to hit reductions in BP of say 5 and 8.5 respectively, give or take about a standard deviation of 7. This however, wouldn’t be true, if, for instance, a patient had a skinnier or wider distribution under the two therapies than that reflected by our wide histograms and if the patient distributions are centered at different BP reduction levels than those for the wide histograms. It can be shown that drawing independent identical observations from a distribution obtained as the average of individual non-identical distributions is equivalent to drawing an observation from each of the independent non-identical individual distributions. The grouped averages and the histograms could be estimating either something interpretable as the true underlying mean and true underlying distribution of identically distributed individual measures or the average of the differing underlying means and distributions of individuals with differing distributions (see attached document #4 for details demonstrating these observations). The grouped variance obtains as an average of the individual variances plus the mean of the squared differences of the individual means from the average of the means. At one extreme this identity allows differing individual means with all variances of zero – these distributions have singularities at the mean values. At the other extreme all patients have the same underlying mean with identical or differing variances. Likely we have differing underlying variances and means for individuals and an argument that they have identical distributions is likely specious. Computed statistics such as the estimated aggregate treatment differences with confidence intervals, and p-values will be identical – Inferences that are likely to be masquerading as applicable equally in all individuals.

The data derived proportion of times patients on standard have a larger BP reduction under standard is likely estimating the mean probability over all patients of patients responding better on standard, rather than the probability for any and every patient. An estimated proportion of 30% could mean some patients have a 20% or less probability of doing better and some have a probability of 40% or more. So, the statistic we used to demonstrate an ecological fallacy, itself has an ecological fallacy. I beat the word ‘underlying’ (mean, variance and distribution), presumed to be different or the same across patients, to death in the last paragraph. I was making the scientific distinction that I had made earlier between an experience and an explanation. We are encouraged to look at an ordering of the underlying means from a clinical trial and presume that a patient is always innately better off on one therapy and not look at the experience – which tells us that a good number of patients ended up with a better response on standard than on new therapy. I personally have trouble with this as well as I am trained to value an explanation or a rule more than experience/phenomenon, as all experience should fall in place when we have the rule. A little explanation in physics will tell us that we will come out screaming, but alive and unhurt, from every roller-coaster ride. But this context here is somehow different. When evaluating whether someone will respond to a therapy we should be referencing both the mean measures on therapies, any established ordering of means for the patient (differing or identical distributions across patients) and observed prior data such as the flipped proportion estimate. In India when someone is getting fixed up for good, the prospective bride (and sometimes the groom) is usually told that we know their family, we know their community – they are good people! I am reminded about a little ditty the girls used in my Indian hometown when a boy misbehaved, which goes – ‘handsome is as handsome does!’. There is a strong move in India now towards allowing this assessment adequately. We need that and it is still helpful to reference the ‘good’ if it is not based on some prejudice.

We spoke earlier about a patient on standard therapy surviving 4 years when a patient on new therapy survived 3 years. Would that standard therapy patient have survived 5 years if he had taken the new therapy instead – hard to tell – most of us have only one life to live. We need to be able to assess the distribution for measures of interest, within subject and under both therapies before we can make a statement about a personalized probability of one therapy being better than the other for the patient. Sounds very much like the thing with the average – we usually have just one observation and that too on just one therapy. Perhaps the notion of a distribution underlying that one thing, identical or different across patients, is a convenient bit of fiction as well. More so for survival data and for diseases where we have only one shot at therapy. Cross-over and N-of-1 designs, where we can switch between short term therapies as the disease recurs on stopping therapy, can allow us to use a notion of a patient ‘distribution’ and help crudely assess that distribution. Even for survival data some kind of a distribution could be assessed using response on some quick leading indicators of future survival. But first let’s motivate this within subject distribution by returning to our executive, psychic and statistician.

## Cheese steak and fries outlier

Earlier I said ‘most of us’ have only one life to live and if your mind is not stuck in an eddy inside a dense probabilistic whirl then you will remember our account of the executive who did have at least those two parallel lives. It turns out that in the parallel timeline with the cancelled study, the executive and a company statistician had much more water-cooler break times before the alternate trial replacing the cancelled study got initiated. As luck would have it the executive dropped in when the statistician was bragging about his non-linear, triple interaction predictive model, using more than 20 batting, fielding and pitching statistics for all players over the last three years. He had even thrown in obscure things like defensive indifference, the color of the pitcher’s underwear and the anticipated cloud cover over the stadium during the game, and he had called 4 games of the first 6 World Series games. The executive overheard and couldn’t resist busting the statistician’s bubble. He told him he had predicted all 6 games so far and told him the outcome and the final score in the deciding 7th game. Then he walked away with a smug smile. The statistician did a quick calculation based on the 6 in 6 and rejected the null hypothesis that the executive’s stated predictions were a fluke in favor of the hypothesis that it was not, as he had computed a two-sided p-value of 0.03125. A few days later the 7th game turned out exactly as predicted. Being part Bayesian, like most statisticians are these days, he gave some credence to the prior information based on the 6 on 6 and updated the probability of a fluke given the likelihood of that win and that score in the 7th game. His conclusion was that the executive was onto something! If he found out, it would make a great publication in the Journal of the American Statistical Association (JASA). So, he landed up at the executive’s favorite drinking hole on Friday and sucked up majorly, even telling him how God-like he was, in baseball prognostication and in marketing and finance. Even ordered the next few rounds of drinks and it all came out – about the psychic in the blue haze. Bit of a disappointment that! – JASA wouldn’t publish a thing like that you know!

The executive was sloshed and happy and was back from one of many trips to the rest-room and complaining about what his prescription diuretic was doing to him and how those beers weren’t helping. That’s when the statistician figured something that JASA would definitely take. So, he told him about a k timeline, N-of-1 trial. If he volunteered he would get to be second author in JASA and told him that for k= 10, he would get to keep the incentive money across all timelines and get to enjoy 10 times the real estate, 10 wife’s and 26 children. He needed a little reassurance that he would be aware of only one existence and that it would really be just the 2.6 children. Next Monday, he was sober, still convinced on scientific merit and signed informed consent. Then the statistician rushed to the psychic. The psychic knew she had seen him in the future and said ‘you are a pretty strange one – I see you giving me a number – a reduction in BP – a 7.2 under diuretic – no wait it is 2.3 under diuretic – no wait it is a 5.2 under beta-blocker/diuretic combination- oh sorry this is mighty confusing’. The statistician asked her to take a deep breath and told her ‘yes, there are many – 10 in fact, 5 for diuretic and 5 for beta-blocker/diuretic – just slow down and tell me all – I will tell you why in just a minute’. Then the statistician noted all the data, pulled out his random code and subsequently got the help of the psychic to randomize the executive in 5 parallel timelines to 4 weeks of diuretic and in 5 more to 4 weeks of a beta-blocker/diuretic combination. And all that gives us the truth and nothing but the truth (roughly) about the within subject distribution for the reduction in BP for the executive. You will see details about this in an upcoming issue of JASA. There was actually one outlier – one timeline where the executive ate cheese steak and fries three days in a row before his clinic visit instead of the arugula salad that the doctor had advised.

## Ahem! We got pedigree too!

The physicists have their mythology and I brought one of them alive (I think) in this commentary. There are others they have, as well, such as that apple knocking Newton cold. Things to help us understand matters of gravitas! We humble statisticians have them too – no it is not about that ‘average apple’ falling a little to the left of a statistician with all that dense probability mass buzzing him out. It is a about a gentleman – a gentleman and intellectual –a British, gentleman and intellectual – it is difficult to add in ‘statistician’ given some two or three contradictions when I string it all together – but here goes – take a deep breath or swallow hard please – it is about a British, gentleman, intellectual and statistician at tea with a Lady.

The table was set. There were walnut scones, Jumbleberry jam, biscuits and tea – lots of that – 8 cups– all for the Lady. Let me tell you why. Earlier, the Lady, Lady Muriel Bristol, had claimed that she could tell if the tea or the milk was added to the cup first. Now our statistician, Sir Ronald Fisher, likely a bit of a skeptic, said ‘let’s find out if you can?’. The rest is part of our folklore. Sir Ronald Fisher had 4 cups made with the milk first and 4 with the tea first and then randomly ordered them on the table with the Lady blinded to the ordering. You see the words random and blinded – and that’s about where it all started – randomized, blinded, experimental design (for the tea tasting design and details on experimental design see Hinkelmann and Kempthorne, 1994, Design and analysis of experiments, John Wiley). The Lady picked all four cups out correctly. Sir Ronald Fisher figured a p-value of 1 in 70 or 0.0143. He might have got her to accompany him later to the Royal Ascot horse races to help him pick some winners – but that part of the story is as yet unverified.

This experiment, you should notice, is very much a N of 1 trial. There was N = 1 Lady, two interventions occurring 4 times each and the outcome was a correct guess. The probability that this individual, Lady Muriel Bristol, was guessing at random is just 1.4% – so we reject the null and conclude that she very likely wasn’t. The conclusion was not about an aggregate – we are not making the much weaker conclusion that people in the aggregate tend to be able to tell if the tea was added first. We could do a study asking a large number of people if they can tell if a cup of tea had the milk or the tea added first. A 65% correct rate could be statistically significantly different (with those 5 zeroes in the p-value) than guessing (null of 50%), if we had the right number of subjects. A 15% difference! – some may even call it meaningfully significant – but 35% of the individual subjects got it wrong!

Being of Indian origin, I think I can tell if my masala chai had the cardamom, ginger or cinnamon added into boiling water before the tea leaves and not after. And my statistics Ph.D. minor was in experimental design. Lot of things I share with the attendees at that tea party. So, even if it is a bit of a downer having my daughter complain to her mom about my occasional unplanned bodily noises, I know, deep down, that I am someone with pedigree!

The next section is going to be a bit of a slog. A little better than getting teeth pulled out, though. Call your dentist and tell him you plan on having just a little more fun instead. For coming up after that you will get to read my ‘Doctor, Doctor’ jokes!

## Getting to personalized medicine

The standard deviation (SD) of diastolic blood pressure measures (usually about 8 mm Hg) is a little larger than that for a measure of the reduction in blood pressure (about 7 mmHg) and this is due to the correlation of before and after measures for a subject. A correlation of 0.7 would have given close to the same SD for the two measures. A cross-over design would consider a similar difference of effects of two or more therapies given to the same subject and can be more efficient when correlations between effects are greater than zero. This would lead to a smaller sample size requirement than a parallel arm study. This would also get at the within subject distribution we referred to earlier to overcome issues with aggregates. However, the cross-over analyses will additionally provide between subject aggregates of within subject differences between therapies and is hence subject to similar fallacies to those described earlier. Menard et. al. (Hypertension, v 11, 2, 1988) report results of a hypertension trial involving cross-overs between a beta-blocker alone, a combination of two diuretics, and the two diuretics in combination with the Beta -blocker. The mean reductions from prior placebo wash over after adjusting for carryover effects were 3.2, 4.0 and 6.5 respectively. These differences were not demonstrated to be statistically significant in this Cross-over study with 24 patients. After patients went through the three therapies, they continued on the therapy with the largest reduction in Diastolic BP. An N-of 1 trial would have been similar and would have had just one patient with more repetition of therapies to gauge effect and variance for that one subject (See Design and Implementation of N-of-1 Trials: A Users Guide by AHRQ). Like the Menard trial this would allow for a more informed patient choice should the patient provide informed consent to such an approach. There would be the additional burden to the patient of switching back and forth through therapy and wash-out periods. And there are scores of options, for hypertension and other indications. Other issues with cross-over designs include difficulty with drop-outs, inappropriateness for many conditions, carry over of previous treatment effects and some difficulty in analysis and in confidence in findings given carry-over and period effects (see Senn and Barnett, Cross-over Trials in Clinical Research, John Wiley). In the Menard trial, 3 patients (12.5%) had insufficient response on any therapy. Seven (29.2%) had their largest reduction in BP on Beta-blockers – the therapy with the lowest aggregate response. Four (16.7%) had their largest reduction in BP on the diuretics and 10 (41.7%) on the Diuretics in combination with beta-blockers – the therapy with the largest aggregate response. Had it been a larger study yielding statistically significant results (the difference between the largest average reduction and the smallest in the Menard study was close to half the standard deviation), a winner takes all treatment strategy would have put all subjects on the diuretics/beta-blocker combination. This result supports earlier discussions on the considerable percent of subjects who buck aggregate data supporting a therapy and do well on a therapy deemed inferior.

In our first calculator we are looking at completely randomized designs or designs where there is blocking just to ensure that the randomization does not give us an imbalance in the size of the two groups. Often, we have randomized block designs where subjects are randomized within blocks – typically a set of 4 patients within strata (for two treatment groups), with 2 patients randomly assigned to each group. One might stratify a hypertension trial by gender and age (<50, >= 50 years). Computing our statistic on the flipped proportion within each of the 4 {GenderXAge} strata should lead to a smaller proportion when there is increased homogeneity of measures within strata for both therapy groups. This allows some degree of personalization of effect for patients. Dynamic randomization which balances group membership within a larger number of strata as well as propensity score matching in the non-randomized setting could help tease out an effect better for more complex patient profiles. Supervised machine learning methods used in artificial intelligence (AI) are built on a decision theoretic framework reducing error around an expected mean given a patient profile (see Hastie, Tibshirani, and Friedman, The elements of Statistical Learning, 2008, Springer). One must note that predictions based on machine learning are also based on aggregate data though they are trained to a larger degree by data closer to a profile. The ideas discussed earlier about the skinny distribution for aggregates and the wide distribution for the patient still applies. These distributions now associate with a complex patient profile including characteristics and therapeutic interventions rather than those we saw earlier standing on one feature (a particular therapy – new or standard). Both the skinny and the wide distributions will be tighter due to homogeneity when looking at a narrow patient profile. There are a number of machine learning methods to choose from which will give somewhat different widths, locations and shapes for these distributions. Hastie et al provide expressions, for some models, for the much wider prediction intervals for individuals as well as the confidence interval of the aggregate prediction. Clear expressions of this distinction between patient and aggregate predictions are provided in standard books on multivariate regression (see for instance Raymond H Myers, Classical and Modern Regression with Applications, 1989, Duxbury). The patient prediction is a lot more useful for a physician treating a patient – indicative that some patients with an adverse profile might do well and some others with a good profile might do poorly, calling for a more personalized look at patient data.

Another good statistic to look at is the concordance probability – it is the probability that a randomly selected pair of patients, one with a poorer outcome than another, will be correctly identified based on inputting the two patient profiles in the model (see Harrell, Frank E., Regression Modelling Strategies, Springer). This statistic inspires the flip proportion statistic in this commentary and like that statistic it tends to be about the same when the size of the data set changes – it is not an artifact of big and bigger data. It tends not to be too large. The historically popular Framingham Heart Study Model (2002 version) had a concordance probability of a little more than 70% (see Pencina and D’Agostino, Statistics in Medicine, 2004, v23, #13, 2109 -2123). Likely other more predictive disease contexts and a good choice of the model and relevant predictors could provide larger concordance probabilities, while still leaving between 10% to 35% of patients characterized incorrectly. Patient factors and populations not considered during training of the model could lead to biased estimates through an AI model. Even randomized and blinded studies lead to strong expectancy effects with data moving towards patient and physician biases. The data itself is ‘trained’ towards these biases and the AI model trained on the data may quantify these and perpetuate them. In a different context, a predictive model used by Chicago law enforcement, resulted in inappropriate racial profiling not very different from human racial profiling, possibly due to inherent biases in databases. Often the model is based on correlations without a strong theoretically linked basis. For instance, a model may use the bill payment history of subjects and a poor history may signal a cognitive disability requiring intervention. Clearly there are other contexts such as poverty or unemployment which could lead to such a poor history.

Very kind of you to stay on while I blathered and blathered. Thank you! — but I am really writing for two very interesting people – find out who when you get to the end! And there are those ‘Doctor, Doctor’ jokes!

## Implications for patients

Significant stochastic propensities for efficacy, small or large, are likely to be associated with ecological fallacies when we look at individual patients. We computed proportions of patients on one therapy responding favorably to one therapy despite aggregate data significantly (and even clinically significantly) supporting its inferiority to an alternate, as a statistic demonstrating this fallacy. We had some difficulty ascribing meaning to this statistic but I hope I have convinced you of its relevance. This flipped proportion was based on a single core endpoint and not on the complete profile of effects of a new therapy including safety, financial costs, convenience and some assessment of short and long term adverse effects. When an informed patient and physician choice for therapy occurs using such complete holistic effects surrounding therapy, the flipped proportion is likely even less predictive of the eventual intervention and outcome.

We discussed emerging trends in personalized medicine. Models with a high degree of complexity incorporating methods for reducing bias and variability are likely needed to help make individualized predictions and choices. Personalized medicine in many contexts is currently too rudimentary to justify a normative therapeutic recommendation based on a patient’s genetic or disease profile, particularly when there are larger variances associated with patient specific predictions. We have to hold onto to a large number of choices for patients. Randomized controlled clinical trials are a good critical first step in demonstrating the validity of each of these choices. Uncontrolled studies can support a therapeutic intervention if there are large number of other independent analyses supporting the intervention.

This rather complex landscape, without clear answers for all patients, necessarily requires consent from patients after alternatives are discussed adequately. When faced with a set of equally unfavorable choices, patients should have the right to forego any therapy. It is necessary to factor in the patient’s subjective quality of life assessments under therapy and consider changes in therapy which improve it. One may consider available care options, which are infrequently used due to less commercial backing, such as therapies which are off patent or those for which intellectual property rights do not apply. In many cases, evidence is available about these therapies from randomized controlled trials (see for instance, a review about Hibiscus sabdariffa extract for Hypertension by Hopkins et al, Fitoterapia, 2013, Mar, 85, 84-89). I sometimes hear of elderly patients having as many as a dozen pills a day. Perhaps there is an overuse and the physician can monitor closely and do nothing or consider preventive care, especially when evidence supporting action is weak. The physician’s intuition derived from training, skill, and experience, bedside communications, patient reports of health, sickness and pain and patient specific data continue to be critical. Aggregate data and model-based predictions are helpful but are not likely to entirely replace personalized care.

I must add that I am neither a scientist nor a physician. I currently work for a pharmaceutical company in oncology as a statistician on some clinical trial data and a lot on an oncology non-interventional registry. I worked more than 15 years back on hypertension trials and the little I know has probably aged – consider all I say about diastolic BP as pedagogical and illustrative of the points I make about personalized medicine and personalized care. I did do my doctoral dissertation on the sample mean – so I think that I have a pretty mean understanding of sample mean! When people see the Dr. next to my name they sometimes ask me if I am ‘like a real Doctor, Doctor or just a Doctor’ and I have to say ‘Doctor’. I have painted a somewhat dismal picture of the real strength of clinical data and would like to note that your physician, who gets to say “yes, Doctor, Doctor”, has an MD and considerable training and experience on your ailment. Some may have an MD and a Ph.D. and you might hear them say “yes, Doctor, Doctor, Doctor” if you ask them that question. It may be wise to heed your physician’s advice with second opinions when needed. Usually they give you room to choose among a set of alternatives, including palliative care when the end seems near. Sometimes, though, there may be almost a coercion towards certain therapies, not necessarily from your physician, but from the concerned troop of care givers, family and friends who surround you when you are indisposed. You could use this commentary to help in a situation like this, and to push against any kind of normative and prescriptive directives or interventions by public, private and corporate agents restricting personal choice in medical as well as social settings – much of the data used to support these are likely to reflect mild to moderate aggregate probabilistic tendencies which do not always justify action on individuals.

The FDA has a list of errant disqualified clinical investigators. You probably don’t want to go to them – they are likely Doctor, Doctors who Doctored!

Well, Tweedledee and Tweedledum, I am finally done!