# JOURNEY TO THE BOTTOM OF MULTIPLE REGRESSION

By: **Aly Diana**

Multiple regression analysis is a cornerstone technique in scientific research for understanding complex relationships among variables. This powerful statistical method allows us to examine the effects of multiple predictors on a single outcome while controlling for other variables. Although I have practiced multiple regression analysis repeatedly, I must admit that I sometimes forget the simple steps or checks needed to ensure the “final” model is robust. Thus, I have written this to refresh my memory and, hopefully, help others ensure that we conduct this analysis correctly. While this is only a brief overview, I hope it will be helpful.

Objectives of Multivariate Analysis

The main objectives of performing multivariate analysis include identifying relationships, controlling confounding variables, making predictions, and understanding interactions. By determining how multiple independent variables collectively influence the dependent variable, we can isolate the effect of each predictor while controlling for the influence of other variables. This approach also helps in devel-oping models that can predict outcomes based on several predictors and exploring how the effect of one variable on the outcome may depend on the level of another variable.

Step-by-Step Guide to Conducting Multiple Regression Analysis

**Data Preparation**

Before performing multiple regression analysis, it is crucial to prepare your data. First, check for missing data and handle missing values through imputation or exclusion to ensure the integrity of the dataset. Identifying and addressing outliers is also essential, as they can disproportionately influence the regression coefficients. The decision to remove or keep outliers should be based on whether the values are plausible within the context of the data. Next, assess multicollinearity using the Variance Inflation Factor (VIF). High VIF values (>10) indicate that predictors are highly correlated, which can inflate standard errors and make it difficult to assess the individual effect of each predictor. To deal with multicollinearity, you can remove one of the correlated variables or combine them if theoretically justifiable.

**Model Specification**

Specify the model by identifying the dependent variable (Y) and the independent variables (X1, X2, …, Xn). Ensure that your model is theoretically sound and includes all relevant variables. Distinguish between predictors, which are variables hypothesized to affect the outcome, and confounding factors, which influence both the predictors and the outcome. Confounding factors can bias the estimated effect of predictors if not accounted for in the model.

**Assumption Checking**

Multiple regression analysis relies on several assumptions. The relationship between the dependent and independent variables should be linear to ensure the model accurately captures the true relationship. Observations should be independent of each other, meaning the value of one observation should not influence another, ensuring unbiased estimates. For example, in a study on patients, if measurements are taken from the same patient multiple times, independence is violated. The variance of residuals (errors) should be constant across all levels of the independent variables (homoscedasticity), ensuring the model performs equally well across all values of the predictors. Residuals should be approximately normally distributed, which is important for valid hypothesis testing and confidence intervals. Use diagnostic plots (e.g., residual plots, Q-Q plots) and statistical tests (e.g., Shapiro-Wilk test for normality) to check these assumptions. While the normality of the dependent variable is desirable, the normality of independent variables is not a strict requirement for multiple regression because the central limit theorem implies that, with a large enough sample size, the distribution of the error term will approach normality.

**Model Fitting**

Fit the model using statistical software like R, Stata, or SPSS. The general form of the multiple regression equation is:

**Y=β0 + β1X1 + β2X2 + … + βnXn + ϵ**

where β0 is the intercept, β1, β2, …, βn are the coefficients for the independent variables, and ϵ is the error term.

**Interpreting Results**

Each coefficient (β) represents the change in the dependent variable for a one-unit change in the corresponding independent variable, holding all other variables constant. For example, if β1=2, a one-unit increase in X1 leads to a 2-unit increase in Y.

P-values indicate the significance of each predictor, with a low p-value (typically<0.05) suggesting that the predictor is significantly associated with the outcome. Adjusted R-squared provides an indication of how well the model explains the variability in the dependent variable, adjusted for the number of predictors in the model.

**Interaction Analysis in a Multivariate Model**

Interaction analysis explores whether the effect of one independent variable on the dependent variable depends on the level of another independent variable. Including interaction terms in your regression model allows you to capture these complex relationships.

**Example: Factors Affecting Late Presentation in People Living with HIV (PLWH)**

Consider a study examining factors affecting late presentation (defined as having a low CD4 count at diagnosis) among PLWH. The factors include the year of diagnosis, reflecting changes in policy and public health interventions over time, and the testing approach, comparing provider-initiated testing and counseling (PITC) versus voluntary counseling and testing (VCT). The interaction between the year of diagnosis and the testing approach explores whether the effectiveness of PITC versus VCT has changed over time. The model can be specified as:

Late Presentation = β0+β1Year at Diagnosis + β2Testing Approach + β3 (Year at Diagnosis × Test-ing Approach) + ϵ

Steps for interaction analysis include creating inter-action terms by multiplying the independent variables of interest, including the interaction terms in the model along with the main effects, and interpreting the interaction. The coefficient of the interaction term (β3) indicates whether the effect of the testing approach on late presentation changes depending on the year of diagnosis. A significant interaction term suggests a varying impact over time.

**Adjusted vs. Non-Adjusted Models**

Non-adjusted models include only the primary predictors of interest without accounting for potential confounding variables, providing a crude estimate of the relationships. Adjusted models include additional variables that may confound the relationship between the primary predictors and the outcome. Adjusting for these confounders provides a more accurate estimate of the true effect of the predictors.

**Conclusion**

Multiple regression analysis is a versatile tool that, when used correctly, can provide deep insights into complex data. By understanding the assumptions, correctly specifying our model, and incorporating interaction terms, we can uncover intricate relationships within our data. Remember, the key to a successful regression analysis lies in meticulous data preparation, thorough assumption checking, and careful interpretation of results. Best wishes to all of us in our analytical endeavors.

Reference

Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). Sage Publications.