Logistic Regression Sample Size

Sample Size for logistic regression

Here we present a calculator implementing the sample size formula provided by Hsieh (1989) for multiple logistic regression using a continuous predictor.

Hsieh (1989) note that epidemiological studies often involve a continuous risk factor predicting the occurrence of a disease and it is often necessary to determine the size of the cohort needed to demonstrate a significant association between a risk factor and the disease.

To determine the appropriate size of the cohort one needs estimates of the probability of the event at the mean of the predictor and the probability of the event at mean plus one standard deviation of the predictor. One needs a false positive (alpha) rate or the probability of concluding in favor of an association between predictor and event when there is none and one needs a false negative rate (beta) which is the probability of concluding there is no association when the risk factor does predict the event. Also needed are the multiple correlation coefficient between the predictors and other predictors/covariates in the model and the anticipated drop-out rate from the study.

The default example in the calculator is from Agresti (2002) where there is an estimated 8% probability of heart disease at the average cholesterol level in the study (about 5 mmol/l) and an estimated increase to 12% probability of heart disease at the average plus standard deviation (about 5+1.2 or 6.2 mmol/l). Another factor, blood pressure, will be included in the model and it is estimated that the correlation coefficient between cholesterol and blood pressure is 0.4. With this information the calculator computes a cohort size of 729 as necessary to demonstrate with 90% probability an effect of cholesterol on heart disease when using a one sided logistic regression hypothesis test at a significance level of 5%. One can enter a correlation coefficient of zero for the case where the predictor is the sole independent variable in the model.

For details on the calculations see the following attachment.

Edit the blue cells in the spreadsheet and enter your data and the calculations in the spreadsheet will refresh. 


The BLUE and RED Estimators: BLUE is an acronym for Best Linear Unbiased Estimator. I Threw in a RED estimator!