GLM implementation in R

Most of the tools that we learned in GLM course is illustrated in this markdown document. To illustrate the implementation of these techniques, we shall use some datasets that is inbuilt in R or create our own. We use the esoph dataset in R. The dataset consist of data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. The dataset has the variables “agegp” (Age group), “alcgp” (Alcohol consumption), “tobgp” (tobacco consumption), “ncases” (Number of observations having esophageal Cancer with these covariate pattern), and “ncontrol”(Number of controls which do not have sophageal Cancer with these covariate pattern). Except for the number of cases and controls, all the data are qualitative/ ordinal. Let’s load it and see how it looks like.

data(esoph)
df<-esoph
head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

We also explore the structure of this dataset and its descriptive statistics particulars.

str(df)
## 'data.frame':    88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...
summary(df)
##    agegp          alcgp         tobgp        ncases         ncontrols     
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 0.000  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 1.000  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 4.000  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   : 8.807  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:10.000  
##  75+  :11                                Max.   :17.000   Max.   :60.000

Contingency tables

The code for creating a contingency table is as follows. We us the command table with the names of the catgorical variables passed as arguments. The command cross tabulates hte count in each cell. We create a \(I\times J\) table to start. This logic can be used to create multidimensional tables too. Let’s create a table.

t<-table(df$agegp,df$alcgp)
t
##        
##         0-39g/day 40-79 80-119 120+
##   25-34         4     4      3    4
##   35-44         4     4      4    3
##   45-54         4     4      4    4
##   55-64         4     4      4    4
##   65-74         4     3      4    4
##   75+           3     4      2    2

We can now calculate marginal totals for this table by using the addmargins() command and pass the table name as the argument. Also, we can convert the table into proportions by using the proportion() command. addmargins(t,1) gives the row totals and addmargins(t,2) gives the column totals

print("Table with marginal totals:")
## [1] "Table with marginal totals:"
addmargins(t)
##        
##         0-39g/day 40-79 80-119 120+ Sum
##   25-34         4     4      3    4  15
##   35-44         4     4      4    3  15
##   45-54         4     4      4    4  16
##   55-64         4     4      4    4  16
##   65-74         4     3      4    4  15
##   75+           3     4      2    2  11
##   Sum          23    23     21   21  88
print("Contingency Table with Proportions:")
## [1] "Contingency Table with Proportions:"
proportions(t)
##        
##          0-39g/day      40-79     80-119       120+
##   25-34 0.04545455 0.04545455 0.03409091 0.04545455
##   35-44 0.04545455 0.04545455 0.04545455 0.03409091
##   45-54 0.04545455 0.04545455 0.04545455 0.04545455
##   55-64 0.04545455 0.04545455 0.04545455 0.04545455
##   65-74 0.04545455 0.03409091 0.04545455 0.04545455
##   75+   0.03409091 0.04545455 0.02272727 0.02272727

The proportions() command calculates the joint probability estimates in the table (assuming either Multinomial sampling design, these are corresponding to our MLE point estimates for population proportions). To calculate the marginal probabilities, we use the command proportions(t,1) for row marginals and **proportions(t,2)* for column marginals. Let’s see how this works.

proportions(t,2)
##        
##         0-39g/day     40-79    80-119      120+
##   25-34 0.1739130 0.1739130 0.1428571 0.1904762
##   35-44 0.1739130 0.1739130 0.1904762 0.1428571
##   45-54 0.1739130 0.1739130 0.1904762 0.1904762
##   55-64 0.1739130 0.1739130 0.1904762 0.1904762
##   65-74 0.1739130 0.1304348 0.1904762 0.1904762
##   75+   0.1304348 0.1739130 0.0952381 0.0952381

The reason why these numbers look so uniform is perhaps because its a case control (retrospective study) where the number of cases and controls are selected beforehand. But here we also figure out one of the grave errors we have encountered while transforming the data frame. The transformation interprets the number of counts of each factor as an occurance. But the structure of our dataset is perhaps different, where each covariate pattern is having a case and a control frequency column. This frequency is our count in the contingency table. This is a data preprocessing required before we analyze our data.

Let’s look at one easy way to accomplish this. Here we try to see how the number of cases are getting influenced by a categorical variables age. You could add more by ~alcgp+agegp+tobgp.

ct<-xtabs(cbind(ncases,ncontrols) ~ alcgp, data = df)
pt<-proportions(ct)
pt
##            
## alcgp           ncases  ncontrols
##   0-39g/day 0.02974359 0.39589744
##   40-79     0.07692308 0.28717949
##   80-119    0.05230769 0.08923077
##   120+      0.04615385 0.02256410

##Doing some analysis

Let’s do some analysis using some standard table analysis tools. Lets check for independence first and then proceed further. The test requires the input of the counts in the table.

chisq.test(ct)
## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 158.95, df = 3, p-value < 2.2e-16

The results states \(p< 0.05\), our usual admissible type-I error threshold. Therefore, we can reject the null hypothesis of independence between the cancer and alcohol consumption with very little risk of encountering type-I error. An alternative is to run different tests together like the LR and \(\chi^2\). This is illustrated below.

assocstats(ct)
##                     X^2 df P(> X^2)
## Likelihood Ratio 146.50  3        0
## Pearson          158.95  3        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.374 
## Cramer's V        : 0.404

However, the accuracy of the test is questionable since the tests requires asymptotic sample size. Let’s instead use the Fisher’s exact test. Although the test was described for $2 $ tables, it can be used in statistical software for \(I\times J\) tables too. Although, this is computationally intensive.

fisher.test(ct)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ct
## p-value < 2.2e-16
## alternative hypothesis: two.sided

Fisher’s test also produces identical results. Therefore, it is safe to assume that there is some dependence between the variables. But now we need to figure out the nature of this relationship. Let’s use the trend test. The trend test is also commonly known by an alternative name called Cochran-Mantel-Haenszel (CMH) test.

CMHtest(ct)
## Cochran-Mantel-Haenszel Statistics for alcgp by  
## 
##                  AltHypothesis  Chisq Df       Prob
## cor        Nonzero correlation 152.97  1 3.8818e-35
## rmeans  Row mean scores differ 158.79  3 3.3413e-34
## cmeans  Col mean scores differ 152.97  1 3.8818e-35
## general    General association 158.79  3 3.3413e-34

Visualizations

Let’s talk of one visualization tool that is useful for multi dimensional tables. This is called double decker plot.

doubledecker(ncases ~ alcgp +tobgp, data=df)

Let’s also look at another plot that captures the odds ratios.

lor <- oddsratio(ct)  # capture log odds ratios
plot(lor, 
     xlab="Alcohol (g/day)", 
     ylab="Log Odds Ratio (Cases)")