Polynomial regression

Normal linear models and Generalized Linear Models have limited predictive power since they have only linear terms. While regularization and pricnipal component methods try to improve model performance by simplifying the model. Another appraoch is to relax the assumption of linearity. The first method is to include higher order powers of the predictor variables to approximate the data better. This is called polynomial regression. We introduce this first. The region boundaries here are called knots and these seperate functions join at these boundaries. Generalized Additive Models extend the above splines and polynomial regression models to account for multiple predictors.

Implementing polynomial regression

Consider the dataset mtcars. Let’s consider operating on this data.

data(mtcars)
plot(mtcars$mpg~mtcars$hp)

Clearly there is a non-linear relationship between mileage and hp. Let’s try to capture that through a non-linear function.

model1<-lm(mpg~poly(hp,2,raw=TRUE),data=mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ poly(hp, 2, raw = TRUE), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5512 -1.6027 -0.6977  1.5509  8.7213 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.041e+01  2.741e+00  14.744 5.23e-15 ***
## poly(hp, 2, raw = TRUE)1 -2.133e-01  3.488e-02  -6.115 1.16e-06 ***
## poly(hp, 2, raw = TRUE)2  4.208e-04  9.844e-05   4.275 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.077 on 29 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7393 
## F-statistic: 44.95 on 2 and 29 DF,  p-value: 1.301e-09

The above code implements a non-linear polynomial fit yielding the relation.\[mileage=14.744-6.115\times hp+4.275\times hp^2+\epsilon\]

Had we used raw=FALSE, we would have obtained another regression equation where regressors are transformed.We get the independent contributions of the regressor\[mileage=26.93-8.464\times f_1(hp)+4.275\times f_2(hp)+\espilon\]

The coefficient of \(f_1\) represents the linear trend. and that of \(f_2\) represents the curvature and contribution of the quadratic trend. Also the predictors are not multicollinear in this form and hence are more sensible statistically (as the former model inflate standard errors). But for interpretation using the raw predictors rather than their orthonormal transformation is better. In an orthonormal model:The Intercept is the actual mean of \(mpg\).The Linear term tells you the overall linear trend across the entire range of your data.The Quadratic term tells you the curvature after the linear trend has been accounted for. Rounding errors due to scaling is fixed here.

Predictions based on the fitted model

hprange<-range(mtcars$hp)
hp_grid<-seq(from=hprange[1],to=hprange[2])
predicted_mileage<-predict(model1,newdata=list(hp=hp_grid))

Let’s plot these predicted values on the graph.

library(ggplot2)
ggplot()+
  geom_point(aes(x = hp, y = mpg),data=mtcars,color="black")+
  geom_line(aes(x = hp_grid,y=predicted_mileage), color = "red") 

Let’s now consider another models with cubic and another with fourth order terms and compare them for any significant difference.

model2<-lm(mpg~poly(hp,3,raw=T),data=mtcars)
model3<-lm(mpg~poly(hp,4,raw=T),data=mtcars)
model0<-lm(mpg~hp,data=mtcars)
anova(model0,model1,model2,model3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp
## Model 2: mpg ~ poly(hp, 2, raw = TRUE)
## Model 3: mpg ~ poly(hp, 3, raw = T)
## Model 4: mpg ~ poly(hp, 4, raw = T)
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 447.67                                   
## 2     29 274.63  1   173.043 17.3352 0.0002868 ***
## 3     28 269.61  1     5.026  0.5035 0.4840465    
## 4     27 269.52  1     0.087  0.0088 0.9261317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If \(p-value<\alpha\) it supports the hypothesis that models are statistically different and higher order terms are required. We observe that in our case the linear model and quadratic models are statistically different so, there is a need to use the quadratic model. But moving from quadratic to cubic is not necessary. Therefore, there is no significant advantage in using a model with more cubic or higher order terms.

Splines

Next, we introduce the notion of regression splines. These models divide the range of \(x\) into regions and fit seperate polynomial regression models to each region. While polynomials are “global” (changing a point on one end affects the whole curve), splines are “local”—they use different pieces of curves joined at points called knots.

Instead of poly(hp, 2), we use bs(hp, df = ...). The df (degrees of freedom) determines how many “knots” or bends the spline has. By default, the bs() function uses cubic polynomials (degree 3) for every segment. When you change df, you aren’t changing the power (\(x^3\) stays \(x^3\)); you are changing how many “knots” (joints) are used to connect those cubic segments.

library(splines)
model_spline <- lm(mpg ~ bs(hp, df = 5), data = mtcars)
summary(model_spline)
## 
## Call:
## lm(formula = mpg ~ bs(hp, df = 5), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0551 -1.6748 -0.3748  1.3276  9.4184 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       30.616      2.804  10.918  3.3e-11 ***
## bs(hp, df = 5)1   -1.791      5.698  -0.314  0.75581    
## bs(hp, df = 5)2  -11.435      3.148  -3.633  0.00121 ** 
## bs(hp, df = 5)3  -15.946      5.556  -2.870  0.00805 ** 
## bs(hp, df = 5)4  -18.905      5.497  -3.439  0.00198 ** 
## bs(hp, df = 5)5  -15.258      4.268  -3.575  0.00140 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.202 on 26 degrees of freedom
## Multiple R-squared:  0.7633, Adjusted R-squared:  0.7178 
## F-statistic: 16.77 on 5 and 26 DF,  p-value: 2.04e-07

bs(hp, df = 5)1 is not significant (\(p = 0.755\)). This suggests that in the very first segment of horsepower (the lowest HP cars), the spline is flat or doesn’t have a strong local trend. With df = 5, R has placed internal knots at the 33rd and 66th percentiles of your horsepower data. You can actually see these “joints” if you plot the spline:

knots <- attr(bs(mtcars$hp, df = 5), "knots")
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "darkgrey", alpha = 0.5) +
  # The Spline Curve
  stat_smooth(method = "lm", formula = y ~ bs(x, df = 5), color = "blue", size = 1) +
  # Vertical lines representing the "Knots"
  geom_vline(xintercept = knots, linetype = "dashed", color = "darkgreen") +
  annotate("text", x = knots, y = 33, label = "Knot", color = "darkgreen", vjust = -1) +
  labs(title = "Where the Spline 'Joints' Are",
       subtitle = "The curve is a different cubic equation between each green line",
       x = "Horsepower", y = "MPG")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(mtcars, aes(x = hp, y = mpg)) +
  # 1. The Data Points
  geom_point(color ="black", size = 1, alpha = 0.6) +
  
  # 2. The Original Quadratic Model (from your first request)
  stat_smooth(method = "lm", 
              formula = y ~ poly(x, 2), 
              aes(color = "Quadratic Polynomial"), 
              se = FALSE, size = 1.2) +
  
  # 3. The New Regression Spline (Cubic spline, df=3)
  stat_smooth(method = "lm", 
              formula = y ~ bs(x, df = 5), 
              aes(color = "Regression Spline (df=5)"), 
              se = FALSE, size = 1.2) +
  
  # 4. Customizing colors and labels
  scale_color_manual(name = "Model Type", 
                     values = c("Quadratic Polynomial" = "red", 
                                "Regression Spline (df=5)" = "blue")) +
  labs(title = "Model Comparison: Polynomial vs. Spline",
       x = "Horsepower (hp)",
       y = "Miles Per Gallon (mpg)")

Fitting GAM

GAMs allow use to findin optimal smoothing, by selecting number of knots. And it also allows us to have other covariates

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
# s(hp) tells R to find the optimal smoothness for horsepower
model_gam <- gam(mpg ~ s(hp), data = mtcars)
summary(model_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## mpg ~ s(hp)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0906     0.5487   36.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df     F p-value    
## s(hp) 2.618  3.263 26.26  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.735   Deviance explained = 75.7%
## GCV = 10.862  Scale est. = 9.6335    n = 32

The edf value obtained here is 2.618 tell us that the best fit curve is a little more complex than a quadratic but less than a cubic.

library(ggplot2)

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "darkgrey", size = 1.5) +
  # method = "gam" uses the mgcv logic automatically
  stat_smooth(method = "gam", 
              formula = y ~ s(x), 
              color = "purple") +
  labs(title = "GAM: Automatically Smoothed Relationship")

Adding other predictors

library(gam)
## Loading required package: foreach
## Loaded gam 1.22-7
## 
## Attaching package: 'gam'
## The following objects are masked from 'package:mgcv':
## 
##     gam, gam.control, gam.fit, s
model_gam2 <- gam(mpg ~ s(hp) + s(wt), data = mtcars)
plot.Gam(model_gam2,se=TRUE,col="darkblue")

Let’s compare all models.

anova(model1,model_spline,model_gam2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ poly(hp, 2, raw = TRUE)
## Model 2: mpg ~ bs(hp, df = 5)
## Model 3: mpg ~ s(hp) + s(wt)
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     29 274.63                                   
## 2     26 266.52  3     8.107  0.6045    0.6187    
## 3     23 102.81  3   163.713 12.2080 5.544e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The GAM model clearly outperforms both other models.