Most of the tools that we learned in GLM course is illustrated in this markdown document. To illustrate the implementation of these techniques, we shall use some datasets that is inbuilt in R or create our own. We use the esoph dataset in R. The dataset consist of data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. The dataset has the variables “agegp” (Age group), “alcgp” (Alcohol consumption), “tobgp” (tobacco consumption), “ncases” (Number of observations having esophageal Cancer with these covariate pattern), and “ncontrol”(Number of controls which do not have sophageal Cancer with these covariate pattern). Except for the number of cases and controls, all the data are qualitative/ ordinal. Let’s load it and see how it looks like.
data(esoph)
df<-esoph
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
We also explore the structure of this dataset and its descriptive statistics particulars.
str(df)
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
summary(df)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 0.000
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 1.000
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 4.000
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean : 8.807
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:10.000
## 75+ :11 Max. :17.000 Max. :60.000
The code for creating a contingency table is as follows. We us the command table with the names of the catgorical variables passed as arguments. The command cross tabulates hte count in each cell. We create a \(I\times J\) table to start. This logic can be used to create multidimensional tables too. Let’s create a table.
t<-table(df$agegp,df$alcgp)
t
##
## 0-39g/day 40-79 80-119 120+
## 25-34 4 4 3 4
## 35-44 4 4 4 3
## 45-54 4 4 4 4
## 55-64 4 4 4 4
## 65-74 4 3 4 4
## 75+ 3 4 2 2
We can now calculate marginal totals for this table by using the addmargins() command and pass the table name as the argument. Also, we can convert the table into proportions by using the proportion() command. addmargins(t,1) gives the row totals and addmargins(t,2) gives the column totals
print("Table with marginal totals:")
## [1] "Table with marginal totals:"
addmargins(t)
##
## 0-39g/day 40-79 80-119 120+ Sum
## 25-34 4 4 3 4 15
## 35-44 4 4 4 3 15
## 45-54 4 4 4 4 16
## 55-64 4 4 4 4 16
## 65-74 4 3 4 4 15
## 75+ 3 4 2 2 11
## Sum 23 23 21 21 88
print("Contingency Table with Proportions:")
## [1] "Contingency Table with Proportions:"
proportions(t)
##
## 0-39g/day 40-79 80-119 120+
## 25-34 0.04545455 0.04545455 0.03409091 0.04545455
## 35-44 0.04545455 0.04545455 0.04545455 0.03409091
## 45-54 0.04545455 0.04545455 0.04545455 0.04545455
## 55-64 0.04545455 0.04545455 0.04545455 0.04545455
## 65-74 0.04545455 0.03409091 0.04545455 0.04545455
## 75+ 0.03409091 0.04545455 0.02272727 0.02272727
The proportions() command calculates the joint probability estimates in the table (assuming either Multinomial sampling design, these are corresponding to our MLE point estimates for population proportions). To calculate the marginal probabilities, we use the command proportions(t,1) for row marginals and **proportions(t,2)* for column marginals. Let’s see how this works.
proportions(t,2)
##
## 0-39g/day 40-79 80-119 120+
## 25-34 0.1739130 0.1739130 0.1428571 0.1904762
## 35-44 0.1739130 0.1739130 0.1904762 0.1428571
## 45-54 0.1739130 0.1739130 0.1904762 0.1904762
## 55-64 0.1739130 0.1739130 0.1904762 0.1904762
## 65-74 0.1739130 0.1304348 0.1904762 0.1904762
## 75+ 0.1304348 0.1739130 0.0952381 0.0952381
The reason why these numbers look so uniform is perhaps because its a case control (retrospective study) where the number of cases and controls are selected beforehand. But here we also figure out one of the grave errors we have encountered while transforming the data frame. The transformation interprets the number of counts of each factor as an occurance. But the structure of our dataset is perhaps different, where each covariate pattern is having a case and a control frequency column. This frequency is our count in the contingency table. This is a data preprocessing required before we analyze our data.
Let’s look at one easy way to accomplish this. Here we try to see how the number of cases are getting influenced by a categorical variables age. You could add more by ~alcgp+agegp+tobgp.
ct<-xtabs(cbind(ncases,ncontrols) ~ alcgp, data = df)
pt<-proportions(ct)
pt
##
## alcgp ncases ncontrols
## 0-39g/day 0.02974359 0.39589744
## 40-79 0.07692308 0.28717949
## 80-119 0.05230769 0.08923077
## 120+ 0.04615385 0.02256410
##Doing some analysis
Let’s do some analysis using some standard table analysis tools. Lets check for independence first and then proceed further. The test requires the input of the counts in the table.
chisq.test(ct)
##
## Pearson's Chi-squared test
##
## data: ct
## X-squared = 158.95, df = 3, p-value < 2.2e-16
The results states \(p< 0.05\), our usual admissible type-I error threshold. Therefore, we can reject the null hypothesis of independence between the cancer and alcohol consumption with very little risk of encountering type-I error. An alternative is to run different tests together like the LR and \(\chi^2\). This is illustrated below.
assocstats(ct)
## X^2 df P(> X^2)
## Likelihood Ratio 146.50 3 0
## Pearson 158.95 3 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.374
## Cramer's V : 0.404
However, the accuracy of the test is questionable since the tests requires asymptotic sample size. Let’s instead use the Fisher’s exact test. Although the test was described for $2 $ tables, it can be used in statistical software for \(I\times J\) tables too. Although, this is computationally intensive.
fisher.test(ct)
##
## Fisher's Exact Test for Count Data
##
## data: ct
## p-value < 2.2e-16
## alternative hypothesis: two.sided
Fisher’s test also produces identical results. Therefore, it is safe to assume that there is some dependence between the variables. But now we need to figure out the nature of this relationship. Let’s use the trend test. The trend test is also commonly known by an alternative name called Cochran-Mantel-Haenszel (CMH) test.
CMHtest(ct)
## Cochran-Mantel-Haenszel Statistics for alcgp by
##
## AltHypothesis Chisq Df Prob
## cor Nonzero correlation 152.97 1 3.8818e-35
## rmeans Row mean scores differ 158.79 3 3.3413e-34
## cmeans Col mean scores differ 152.97 1 3.8818e-35
## general General association 158.79 3 3.3413e-34
Let’s talk of one visualization tool that is useful for multi dimensional tables. This is called double decker plot.
doubledecker(ncases ~ alcgp +tobgp, data=df)
Let’s also look at another plot that captures the odds ratios.
lor <- oddsratio(ct) # capture log odds ratios
plot(lor,
xlab="Alcohol (g/day)",
ylab="Log Odds Ratio (Cases)")