We will work with Linear Regression In R.
Let’s fit a Normal Linear Regression Model to this.
# Fitting the linear model
# Note: lm() automatically handles missing values by omitting them
model <- lm(Ozone ~ Temp, data = data)
The inference follows from the summary of the model. The coefficients with a p-value< (acceptable significance level) are regarded as statistically significant. Here, Temperature positively influences the Ozone concentration, since coefficient is positive. R squared value denotes the extent to which the model with temperature explains the variation in data.
# Displaying the results summary
summary(model)
##
## Call:
## lm(formula = Ozone ~ Temp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.729 -17.409 -0.587 11.306 118.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -146.9955 18.2872 -8.038 9.37e-13 ***
## Temp 2.4287 0.2331 10.418 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.71 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.4877, Adjusted R-squared: 0.4832
## F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
Has multiple regressors. \(\hat{Y}=b_0+b_1X_1+b_2X_2+\ldots+b_pX_p\)
is the estimated regressions line. Once again, we can check for
significance level of each variable by using summary().
Lets build a model like this.
m<-mean(data$Ozone,na.rm=TRUE)
data$Ozone[is.na(data$Ozone)]<-m
m<-mean(data$Temp,na.rm=TRUE)
data$Ozone[is.na(data$Temp)]<-m
m<-mean(data$Wind,na.rm=TRUE)
data$Ozone[is.na(data$Wind)]<-m
m<-mean(data$Solar.R,na.rm=TRUE)
data$Ozone[is.na(data$Solar.R)]<-m
# We are cleaning the dataaset before fitting the model
multi_model <- lm(Ozone ~ Solar.R + Wind + Temp, data = data)
Once a model is fit, we must check the residuals (the difference between observed and predicted values) to ensure they meet standard statistical assumptions
# Set up a 2x2 grid to view all diagnostic plots at once
par(mfrow = c(2, 2))
plot(multi_model)
# Reset layout
par(mfrow = c(1, 1))
Beyond visual plots, we can statistically test if the residuals are normally distributed. A p-value > 0.05 suggests that we fail to reject the null hypothesis of normality.
# Extract residuals
res <- residuals(multi_model)
# Run Shapiro-Wilk test
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.90307, p-value = 2.777e-08
The p-value that you obtain over here is less than 0.05, meaning that there is significant evidence against the null hypothesis of Normality. The Normality assumption of the Normal Linear Regression has failed. Therefore, we need to look for other models. GLMs can be applied here
The Jarque-Bera test is a statistical test used to assess whether a given dataset follows a normal distribution based on its skewness and kurtosis.
# Install the tseries package (only need to run this once)
#install.packages("tseries")
# Load the package
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
jb_test_result <- jarque.bera.test(res)
print(jb_test_result)
##
## Jarque Bera Test
##
## data: res
## X-squared = 145.17, df = 2, p-value < 2.2e-16
Again JB test gives strong evidence against null hypothesis.
This is a general goodness of fit test which you use in many places in simulation and statistics.
#install.packages("nortest")
library(nortest)
# Perform Pearson chi-square test for normality
pearson.test(res)
##
## Pearson chi-square normality test
##
## data: res
## P = 24.137, p-value = 0.01949
Again the Normality is rejected. Strong evidence against Null. ### Visualizing the regression line
ggplot(data, aes(x = Temp, y = Ozone)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
labs(title = "Linear Regression: Ozone vs Temperature",
subtitle = "Blue line represents the best-fit linear model",
x = "Temperature (F)",
y = "Ozone (ppb)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
VIF quantifies how much the variance (and thus the standard error) of an estimated regression coefficient is “inflated” due to its correlation with other predictors in the model.
While exact limits vary by field, usually:
VIF < 5: Generally considered acceptable; multicollinearity is low to moderate.
VIF > 5 to 10: Indicates noticeable multicollinearity that may require further investigation.
VIF > 10: Signifies severe multicollinearity. This makes coefficient estimates unstable, meaning small changes in data can lead to large, erratic swings in your results.
Use the car package to ensure predictors are not overly
correlated.
# You may need to install the car package first: install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Calculate VIF for each predictor
vif(multi_model)
## Solar.R Wind Temp
## 1.089300 1.267492 1.367450
Endogeneity is a situation that arises when residuals are correlated with one or more independent variables. Durbin-Wu-Hausman (DWH) Test can be used to test for this. This violates a key assumption of Ordinary Least Squares (OLS) regression, which requires all independent variables to be exogenous—meaning they are not influenced by the system being studied and are uncorrelated with the error term. This is primarily due to omitted variables, reverse causality, measurement errors, etc.
Because OLS fails under endogeneity, researchers use alternative methods involving Instrumental Variables (IV) by Introducing a new variable (an “instrument”) that is correlated with the endogenous regressor but has no direct relationship with the error term or the dependent variable.
To perform the Durbin-Wu-Hausman (DWH) test on your airquality regression model in R, you must first identify a potential instrumental variable. In the airquality dataset, the Month or Day variables are often used as instruments for environmental factors like Solar.R or Temp due to their seasonal nature.
# install.packages("ivreg")
library(ivreg)
# 1. Estimate the IV model
# Using 'Month' as an instrument for 'Solar.R'
iv_model <- ivreg(Ozone ~ Solar.R + Wind + Temp | Month + Wind + Temp,
data = airquality)
# 2. Run diagnostics
summary(iv_model, diagnostics = TRUE)
##
## Call:
## ivreg(formula = Ozone ~ Solar.R + Wind + Temp | Month + Wind +
## Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.3444 -20.2114 0.2776 18.8440 86.0428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -54.4092 30.7068 -1.772 0.0793 .
## Solar.R 0.2592 0.1365 1.899 0.0603 .
## Wind -3.4628 0.8554 -4.048 9.79e-05 ***
## Temp 1.0672 0.5111 2.088 0.0392 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 107 5.486 0.021 *
## Wu-Hausman 1 106 3.895 0.051 .
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.54 on 107 degrees of freedom
## Multiple R-Squared: 0.3335, Adjusted R-squared: 0.3148
## Wald test: 32.31 on 3 and 107 DF, p-value: 5.976e-15
0.051 is the p-value corresponding to the Durbin-Wu-Hausman (DWH) test. This can be found in the Wu-Hausman row in the summary. If p-value< \(\alpha\), then this suggests endogeneity. For an \(\alpha\) of 0.05, the p-value is above this threshold, so we fail to reject null hypothesis. Which means regressors are exogeneous.
The Breusch-Pagan test provides a formal p-value for homoscedasticity. It is available in the lmtest package via the bptest() function. The tests has a null hypothesis that assumes homoscedasticity. If p<\(\alpha\) we reject homoscedasticity.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(multi_model)
##
## studentized Breusch-Pagan test
##
## data: multi_model
## BP = 6.3382, df = 3, p-value = 0.09627
At a significance level of 0.05, we fail to reject \(H_0\). Homoscedasticity holds.
If heteroscedasticity is detected, the usual way out is to apply a log transformation on \(Y\) or using the Weighted Least Squares Technique for coefficient estimation. OLS calculates standard errors based on the assumption of constant variance. When variance is not constant, the Standard Errors become incorrect (usually underestimated). As a result, you may often find some variables significant when they are infact not. This is due to the way data points are weighted in OLS. OLS gives each point (residual in fact) equal weightage. However, if some observations have much higher variance (noise) than others, they are less reliable. OLS ignores this and treats the “noisy” data with the same importance as the “precise” data. This makes your estimates less efficient, meaning they will vary wildly from one sample to another.
To demonstrate how linear regression is used for prediction, we follow a three-step process: fitting the model to your existing data, creating new input values, and using the predict() function to generate the estimated outcome.
Let’s split the data into training set and testing set.
# install.packages("rsample")
library(rsample)
set.seed(123) # important for reproducibility
# Create a split object (70% training)
data_split <- initial_split(data, prop = 0.70)
# Extract the sets using dedicated functions
train_data <- training(data_split)
test_data <- testing(data_split)
Now, we have created training and testing datasets. We fit a model on the training set and assess its capabilities using the testing set.
lm_model <- lm(Ozone ~ Temp, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = Ozone ~ Temp, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.244 -22.962 -7.918 7.805 169.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.4161 29.6852 -2.237 0.027376 *
## Temp 1.4457 0.3787 3.818 0.000228 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.45 on 105 degrees of freedom
## Multiple R-squared: 0.1219, Adjusted R-squared: 0.1135
## F-statistic: 14.58 on 1 and 105 DF, p-value: 0.0002281
Model is now fitted on the training data. The next part is to predict
using this model to new data. Now, we use the model to predict Ozone
levels for the test_data (data the model has never seen
before).
test_predictions <- predict(lm_model, newdata = test_data)
# Compare actual vs predicted values
comparison <- data.frame(
Actual = test_data$Ozone,
Predicted = test_predictions
)
cor.test(test_data$Ozone,test_predictions)
##
## Pearson's product-moment correlation
##
## data: test_data$Ozone and test_predictions
## t = 1.9519, df = 44, p-value = 0.05733
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.00871886 0.52922233
## sample estimates:
## cor
## 0.2822937
We perform a simple correlation test to see if the model agrees with observed data. We see that 95% CI does not contain 0. This means that the correlation among the true values and the estimated values are significant. However, this only means that they would agree in terms of their directions. i.e., one would go up when another goes up or one would go down when another goes down. Moreover, correlation is the degree of linear association. Let’s visualize this through a plot.
ggplot(comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "darkblue") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
labs(title = "Actual vs. Predicted Ozone",
subtitle = "Red dashed line represents perfect prediction",
x = "Actual Ozone (ppb)",
y = "Predicted Ozone (ppb)") +
theme_minimal()
From the figure, we can easily see that the predictions are not that good. We could do better.
Here is the easiest way to get model performance measure. We use a
one line command from the performance package to get all
relevant performance measures in one go.
# install.packages("performance")
library(performance)
# Get all indices at once: AIC, BIC, R2, RMSE, Sigma, etc.
#sigma is the residual standard error
model_performance(lm_model,data=test_data)
## # Indices of model performance
##
## AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
## --------------------------------------------------------------
## 1083.0 | 1083.2 | 1091.0 | 0.122 | 0.114 | 37.100 | 37.452
There are no well defined acceptance thresholds for these. AIC and
BIC are used two compare two competing models. Lower values of AIC and
BIC indicate a better model. Higher values of \(R^2\) and \(Adj\
R^2\) is better. We usually say \(R^2\) suggests unacceptability only when
<0.3. In our case it is true for the test data. We see the model has
overfitted the training data and is unable to capture the behaviour of
the test data. In terms of RMSE, it should be small compared to the mean
and SD of your data. For our Ozone data data$Ozone the mean
is 48.68 and SD is 41.35. The RMSE is smaller than this. Also the range
of this data seems is about 180. The RMSE suggests an average prediction
error of approx. 20% (\(\frac{37}{180}\times
100\)).It is context dependent whether the 20% error is
acceptable.
For the dataset that we just explored, although we carried out all operations and tests, We also established that a Normal Linear model is not a good option since the response variable is not Normally distributed. We have to look for other modelling alternatives.