Linear Regression

We will work with Linear Regression In R.

Let’s fit a Normal Linear Regression Model to this.

# Fitting the linear model
# Note: lm() automatically handles missing values by omitting them
model <- lm(Ozone ~ Temp, data = data)

Inference

The inference follows from the summary of the model. The coefficients with a p-value< (acceptable significance level) are regarded as statistically significant. Here, Temperature positively influences the Ozone concentration, since coefficient is positive. R squared value denotes the extent to which the model with temperature explains the variation in data.

# Displaying the results summary
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.729 -17.409  -0.587  11.306 118.271 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -146.9955    18.2872  -8.038 9.37e-13 ***
## Temp           2.4287     0.2331  10.418  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.71 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

Multiple linear Regression

Has multiple regressors. \(\hat{Y}=b_0+b_1X_1+b_2X_2+\ldots+b_pX_p\) is the estimated regressions line. Once again, we can check for significance level of each variable by using summary(). Lets build a model like this.

m<-mean(data$Ozone,na.rm=TRUE)
data$Ozone[is.na(data$Ozone)]<-m
m<-mean(data$Temp,na.rm=TRUE)
data$Ozone[is.na(data$Temp)]<-m
m<-mean(data$Wind,na.rm=TRUE)
data$Ozone[is.na(data$Wind)]<-m
m<-mean(data$Solar.R,na.rm=TRUE)
data$Ozone[is.na(data$Solar.R)]<-m
# We are cleaning the dataaset before fitting the model

multi_model <- lm(Ozone ~ Solar.R + Wind + Temp, data = data)

Regression diagnostics

Diagnostic plots

Once a model is fit, we must check the residuals (the difference between observed and predicted values) to ensure they meet standard statistical assumptions

  • The plot(model) command in R provides four critical graphs: Residuals vs Fitted: Checks for linearity. Points should be randomly scattered around the zero line.
  • Normal Q-Q: Checks if residuals follow a normal distribution. Points should follow the straight diagonal line.
  • Scale-Location: Checks for homoscedasticity (constant variance).
  • Residuals vs Leverage: Identifies influential data points (outliers) that strongly impact the model.
# Set up a 2x2 grid to view all diagnostic plots at once 
par(mfrow = c(2, 2)) 
plot(multi_model)

# Reset layout 
par(mfrow = c(1, 1))

Normality tests

Shapiro-Wilks Test

Beyond visual plots, we can statistically test if the residuals are normally distributed. A p-value > 0.05 suggests that we fail to reject the null hypothesis of normality.

# Extract residuals
res <- residuals(multi_model)

# Run Shapiro-Wilk test
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.90307, p-value = 2.777e-08

The p-value that you obtain over here is less than 0.05, meaning that there is significant evidence against the null hypothesis of Normality. The Normality assumption of the Normal Linear Regression has failed. Therefore, we need to look for other models. GLMs can be applied here

Jarque-Bera test

The Jarque-Bera test is a statistical test used to assess whether a given dataset follows a normal distribution based on its skewness and kurtosis.

# Install the tseries package (only need to run this once)
#install.packages("tseries")

# Load the package
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
jb_test_result <- jarque.bera.test(res)
print(jb_test_result)
## 
##  Jarque Bera Test
## 
## data:  res
## X-squared = 145.17, df = 2, p-value < 2.2e-16

Again JB test gives strong evidence against null hypothesis.

Pearsons chi-suqred test for Normality

This is a general goodness of fit test which you use in many places in simulation and statistics.

#install.packages("nortest")
library(nortest)

# Perform Pearson chi-square test for normality
pearson.test(res)
## 
##  Pearson chi-square normality test
## 
## data:  res
## P = 24.137, p-value = 0.01949

Again the Normality is rejected. Strong evidence against Null. ### Visualizing the regression line

ggplot(data, aes(x = Temp, y = Ozone)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  labs(title = "Linear Regression: Ozone vs Temperature",
       subtitle = "Blue line represents the best-fit linear model",
       x = "Temperature (F)",
       y = "Ozone (ppb)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

VIF for multi collinearity

VIF quantifies how much the variance (and thus the standard error) of an estimated regression coefficient is “inflated” due to its correlation with other predictors in the model.

While exact limits vary by field, usually:

  • VIF < 5: Generally considered acceptable; multicollinearity is low to moderate.

  • VIF > 5 to 10: Indicates noticeable multicollinearity that may require further investigation.

  • VIF > 10: Signifies severe multicollinearity. This makes coefficient estimates unstable, meaning small changes in data can lead to large, erratic swings in your results.

Use the car package to ensure predictors are not overly correlated.

# You may need to install the car package first: install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Calculate VIF for each predictor
vif(multi_model)
##  Solar.R     Wind     Temp 
## 1.089300 1.267492 1.367450

Tests for endogeneity

Endogeneity is a situation that arises when residuals are correlated with one or more independent variables. Durbin-Wu-Hausman (DWH) Test can be used to test for this. This violates a key assumption of Ordinary Least Squares (OLS) regression, which requires all independent variables to be exogenous—meaning they are not influenced by the system being studied and are uncorrelated with the error term. This is primarily due to omitted variables, reverse causality, measurement errors, etc.

Because OLS fails under endogeneity, researchers use alternative methods involving Instrumental Variables (IV) by Introducing a new variable (an “instrument”) that is correlated with the endogenous regressor but has no direct relationship with the error term or the dependent variable.

To perform the Durbin-Wu-Hausman (DWH) test on your airquality regression model in R, you must first identify a potential instrumental variable. In the airquality dataset, the Month or Day variables are often used as instruments for environmental factors like Solar.R or Temp due to their seasonal nature.

# install.packages("ivreg")
library(ivreg)

# 1. Estimate the IV model
# Using 'Month' as an instrument for 'Solar.R'
iv_model <- ivreg(Ozone ~ Solar.R + Wind + Temp | Month + Wind + Temp, 
                  data = airquality)

# 2. Run diagnostics
summary(iv_model, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = Ozone ~ Solar.R + Wind + Temp | Month + Wind + 
##     Temp, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -48.3444 -20.2114   0.2776  18.8440  86.0428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -54.4092    30.7068  -1.772   0.0793 .  
## Solar.R       0.2592     0.1365   1.899   0.0603 .  
## Wind         -3.4628     0.8554  -4.048 9.79e-05 ***
## Temp          1.0672     0.5111   2.088   0.0392 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1 107     5.486   0.021 *
## Wu-Hausman         1 106     3.895   0.051 .
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.54 on 107 degrees of freedom
## Multiple R-Squared: 0.3335,  Adjusted R-squared: 0.3148 
## Wald test: 32.31 on 3 and 107 DF,  p-value: 5.976e-15

0.051 is the p-value corresponding to the Durbin-Wu-Hausman (DWH) test. This can be found in the Wu-Hausman row in the summary. If p-value< \(\alpha\), then this suggests endogeneity. For an \(\alpha\) of 0.05, the p-value is above this threshold, so we fail to reject null hypothesis. Which means regressors are exogeneous.

Tests for homoscedasticity

The Breusch-Pagan test provides a formal p-value for homoscedasticity. It is available in the lmtest package via the bptest() function. The tests has a null hypothesis that assumes homoscedasticity. If p<\(\alpha\) we reject homoscedasticity.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(multi_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  multi_model
## BP = 6.3382, df = 3, p-value = 0.09627

At a significance level of 0.05, we fail to reject \(H_0\). Homoscedasticity holds.

If heteroscedasticity is detected, the usual way out is to apply a log transformation on \(Y\) or using the Weighted Least Squares Technique for coefficient estimation. OLS calculates standard errors based on the assumption of constant variance. When variance is not constant, the Standard Errors become incorrect (usually underestimated). As a result, you may often find some variables significant when they are infact not. This is due to the way data points are weighted in OLS. OLS gives each point (residual in fact) equal weightage. However, if some observations have much higher variance (noise) than others, they are less reliable. OLS ignores this and treats the “noisy” data with the same importance as the “precise” data. This makes your estimates less efficient, meaning they will vary wildly from one sample to another.

Linear Regression for prediction

To demonstrate how linear regression is used for prediction, we follow a three-step process: fitting the model to your existing data, creating new input values, and using the predict() function to generate the estimated outcome.

Let’s split the data into training set and testing set.

# install.packages("rsample")
library(rsample)

set.seed(123) # important for reproducibility

# Create a split object (70% training)
data_split <- initial_split(data, prop = 0.70)

# Extract the sets using dedicated functions
train_data <- training(data_split)
test_data  <- testing(data_split)

Now, we have created training and testing datasets. We fit a model on the training set and assess its capabilities using the testing set.

lm_model <- lm(Ozone ~ Temp, data = train_data)

summary(lm_model)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.244 -22.962  -7.918   7.805 169.940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -66.4161    29.6852  -2.237 0.027376 *  
## Temp          1.4457     0.3787   3.818 0.000228 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.45 on 105 degrees of freedom
## Multiple R-squared:  0.1219, Adjusted R-squared:  0.1135 
## F-statistic: 14.58 on 1 and 105 DF,  p-value: 0.0002281

Model is now fitted on the training data. The next part is to predict using this model to new data. Now, we use the model to predict Ozone levels for the test_data (data the model has never seen before).

test_predictions <- predict(lm_model, newdata = test_data)

# Compare actual vs predicted values
comparison <- data.frame(
  Actual = test_data$Ozone,
  Predicted = test_predictions
)

cor.test(test_data$Ozone,test_predictions)
## 
##  Pearson's product-moment correlation
## 
## data:  test_data$Ozone and test_predictions
## t = 1.9519, df = 44, p-value = 0.05733
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.00871886  0.52922233
## sample estimates:
##       cor 
## 0.2822937

We perform a simple correlation test to see if the model agrees with observed data. We see that 95% CI does not contain 0. This means that the correlation among the true values and the estimated values are significant. However, this only means that they would agree in terms of their directions. i.e., one would go up when another goes up or one would go down when another goes down. Moreover, correlation is the degree of linear association. Let’s visualize this through a plot.

ggplot(comparison, aes(x = Actual, y = Predicted)) +
  geom_point(color = "darkblue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Actual vs. Predicted Ozone",
       subtitle = "Red dashed line represents perfect prediction",
       x = "Actual Ozone (ppb)",
       y = "Predicted Ozone (ppb)") +
  theme_minimal()

From the figure, we can easily see that the predictions are not that good. We could do better.

Assessing model performance

Here is the easiest way to get model performance measure. We use a one line command from the performance package to get all relevant performance measures in one go.

# install.packages("performance")
library(performance)

# Get all indices at once: AIC, BIC, R2, RMSE, Sigma, etc.
#sigma is the residual standard error
model_performance(lm_model,data=test_data)
## # Indices of model performance
## 
## AIC    |   AICc |    BIC |    R2 | R2 (adj.) |   RMSE |  Sigma
## --------------------------------------------------------------
## 1083.0 | 1083.2 | 1091.0 | 0.122 |     0.114 | 37.100 | 37.452

There are no well defined acceptance thresholds for these. AIC and BIC are used two compare two competing models. Lower values of AIC and BIC indicate a better model. Higher values of \(R^2\) and \(Adj\ R^2\) is better. We usually say \(R^2\) suggests unacceptability only when <0.3. In our case it is true for the test data. We see the model has overfitted the training data and is unable to capture the behaviour of the test data. In terms of RMSE, it should be small compared to the mean and SD of your data. For our Ozone data data$Ozone the mean is 48.68 and SD is 41.35. The RMSE is smaller than this. Also the range of this data seems is about 180. The RMSE suggests an average prediction error of approx. 20% (\(\frac{37}{180}\times 100\)).It is context dependent whether the 20% error is acceptable.

Summary

For the dataset that we just explored, although we carried out all operations and tests, We also established that a Normal Linear model is not a good option since the response variable is not Normally distributed. We have to look for other modelling alternatives.