We will apply some methods for supervised and unsupervised analysis to two datasets. This two datasets are related to red and white variants of the Portuguese vinho verde wine and are available at UCI ML repository.
Our goal is to characterize the relationship between wine quality and its analytical characteristics.
The output variable is a note between 0 (the worst) and 10 (the best) determined by expert evaluation. It can be considered as discrete categories so that this problem can be formulated as classification or regression.

We will develop models of wine quality for each of the wine variant to determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties.
At the end of this tutorial we will combine analytical data and look for any well defined clusters.

1: Data Summary
2: Choose optimal models by exhaustive, forward and backward selection
3: Optimal model by cross-validation
4: Lasso/ridge
5: PCA
6: Model wine quality using principal components

1: Data Summary

First, we load the required librairies and data sets.

library(leaps)
library(ggplot2)
library(reshape2)
library(MASS)
library(ggcorrplot)
library(plotmo)

whiteDat = read.table("winequality-white.csv", sep = ";", header = T)
redDat = read.table("winequality-red.csv", sep = ";", header = T)

Then I create a shortcut function to sort the dataframe columns by the absolute value of their correlation with the outcome.

sortByCorr = function(dataset, refColName) {
  # Sort the dataframe columns by the absolute value of their correlation with
  # a given column
  #
  # Args:
  #   dataset: A vector, matrix, or data frame to sort
  #   refColName: The name of the reference colum for the correlation
  #
  # Returns:
  #   The sorted dataframe
  refColIdx = grep(refColName, colnames(dataset))
  corrTmp = cor(dataset)[, refColIdx]
  corrTmp[order(abs(corrTmp), decreasing = TRUE)]

  dataset[, order(abs(corrTmp), decreasing = TRUE)]
}

1.1: White wine data

1.1.a: Summary

dim(whiteDat)

## [1] 4898   12

sapply(whiteDat, class)

##        fixed.acidity     volatile.acidity          citric.acid
##            "numeric"            "numeric"            "numeric"
##       residual.sugar            chlorides  free.sulfur.dioxide
##            "numeric"            "numeric"            "numeric"
## total.sulfur.dioxide              density                   pH
##            "numeric"            "numeric"            "numeric"
##            sulphates              alcohol              quality
##            "numeric"            "numeric"            "integer"

# We take a look at the data distribution:  
head(whiteDat)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

# Get some metrics about the variables
summary(whiteDat)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The white wine dataset has 4898 observations, 11 predictors and 1 outcome (quality). All of the predictors are numeric values, outcomes are integer.

The summary stats shows that most of the variables has wide range compared to the IQR, which may indicate spread in the data and the presence of outliers.
We investigate further by producing boxplots for each of the variables:

oldpar = par(mfrow = c(2,6))
for ( i in 1:11 ) {
  boxplot(whiteDat[[i]])
  mtext(names(whiteDat)[i], cex = 0.8, side = 1, line = 2)
}
par(oldpar)

Swagger editor

It demonstrate that all variables, except alcohol contains outliers.

We now use a scatter plot matrix to get an insight on the outliers locations:

pairs(whiteDat[, -grep("quality", colnames(whiteDat))])

Swagger editor

We see that outliers seems to be on the higher end.
Now we look at the predictor values distribution:

oldpar = par(mfrow = c(6,2))
for ( i in 1:12 ) {
  truehist(whiteDat[[i]], xlab = names(whiteDat)[i], col = 'lightgreen', main = paste("Average =", signif(mean(whiteDat[[i]]),3)))
}

Swagger editor

par(oldpar)

We note that all the variables has positively skewed distributions except quality which is normally distributed. citric.acide show a peak at the lower end.

1.1.b: Outlier detection

For each variables, we consider observations that lie outside 1.5 * IQR as outliers.

outliers = c()
for ( i in 1:11 ) {
  stats = boxplot.stats(whiteDat[[i]])$stats
  bottom_outlier_rows = which(whiteDat[[i]] < stats[1])
  top_outlier_rows = which(whiteDat[[i]] > stats[5])
  outliers = c(outliers , top_outlier_rows[ !top_outlier_rows %in% outliers ] )
  outliers = c(outliers , bottom_outlier_rows[ !bottom_outlier_rows %in% outliers ] )
}

We use the Cook’s ditance to detect influential observations.

mod = lm(quality ~ ., data = whiteDat)
cooksd = cooks.distance(mod)
plot(cooksd, pch = "*", cex = 2, main = "Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm = T), col = "red")

Swagger editor

head(whiteDat[cooksd > 4 * mean(cooksd, na.rm=T), ])

##     fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 18            6.2             0.66        0.48            1.2     0.029
## 21            6.2             0.66        0.48            1.2     0.029
## 99            9.8             0.36        0.46           10.5     0.038
## 251           5.9             0.21        0.28            4.6     0.053
## 252           8.5             0.26        0.21           16.2     0.074
## 254           5.8             0.24        0.44            3.5     0.029
##     free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 18                   29                   75  0.9892 3.33      0.39
## 21                   29                   75  0.9892 3.33      0.39
## 99                    4                   83  0.9956 2.89      0.30
## 251                  40                  199  0.9964 3.72      0.70
## 252                  41                  197  0.9980 3.02      0.50
## 254                   5                  109  0.9913 3.53      0.43
##     alcohol quality
## 18     12.8       8
## 21     12.8       8
## 99     10.1       4
## 251    10.0       4
## 252     9.8       3
## 254    11.7       3

By looking at each row we can find out why it is influential:

Row 99 and 252 have very high residual.sugar.
Row 18 and 21 have very high free.sulfur.dioxide.
Row 251 have very high sulphates.
Row 254 and 99 have very low free.sulfur.dioxide.

We remove all the ouliers in our list from the dataset and create a new set of histograms:

coutliers = as.numeric(rownames(whiteDat[cooksd > 4 * mean(cooksd, na.rm=T), ]))
outliers = c(outliers , coutliers[ !coutliers %in% outliers ] )

cleanWhiteDat = whiteDat[-outliers, ]
oldpar = par(mfrow=c(6,2))
for ( i in 1:12 ) {
  truehist(cleanWhiteDat[[i]], xlab = names(cleanWhiteDat)[i], col = 'lightgreen', main = paste("Average =", signif(mean(cleanWhiteDat[[i]]),3)))
}

Swagger editor

par(oldpar)

By removing the outliers, the dataset size reduced to 3999 observations.
Now, the variables are approximatly normaly distributed, except for residual.sugar which is unimodal and skewed to the right. This could be an interesting fact for further analysis.

We now use a scatterplot matrice to roughly determine if there is a linear correlation between our variables:

pairs(cleanWhiteDat, col = cleanWhiteDat$quality, pch = cleanWhiteDat$quality)

Swagger editor

pairs(cleanWhiteDat[,c(7, 8, 10, 11)], col = cleanWhiteDat$quality, pch = cleanWhiteDat$quality)

Swagger editor

Only residual.sugar/density and density/alcohol pairs seems to have a linear correlation.

We note a trend with the alcohol variable: higher the alcohol value is, better is the quality.
In the oposite, it seems like the lowest the density, the better the quality.

1.1.c: Correlation

The following correlation matrix confirm the strong correlation between residual.sugar/density and density/alcohol. It also confirm that alcohol is the variable with the highest correlation with quality. At a lower level, density and chlorides also have a significant correlation with quality.

ggcorrplot(cor(cleanWhiteDat), hc.order = TRUE, type = "lower", lab = TRUE, insig = "blank")

Swagger editor

colnames(sortByCorr(dataset = cleanWhiteDat, refColName = 'quality'))

##  [1] "quality"              "alcohol"              "density"             
##  [4] "chlorides"            "total.sulfur.dioxide" "residual.sugar"      
##  [7] "volatile.acidity"     "pH"                   "fixed.acidity"       
## [10] "citric.acid"          "sulphates"            "free.sulfur.dioxide"

1.2 Red wine data

We apply the same procedure to the red wine dataset.

1.2.a: Summary

dim(whiteDat)

## [1] 4898   12

sapply(redDat, class)

##        fixed.acidity     volatile.acidity          citric.acid 
##            "numeric"            "numeric"            "numeric" 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##            "numeric"            "numeric"            "numeric" 
## total.sulfur.dioxide              density                   pH 
##            "numeric"            "numeric"            "numeric" 
##            sulphates              alcohol              quality 
##            "numeric"            "numeric"            "integer"

# We take a look at the data distribution:  
head(redDat)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

# Get some metrics about the variables
summary(redDat)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The red wine dataset has 1599 observations, 11 predictors and 1 outcome (quality). The variables are the same as for the white wine data set. All of the predictors are numeric values, outcomes are integer.

As with the whithe wine dataset, the summary stats shows that most of the variables has wide range compared to the IQR, which may indicate spread in the data and the presence of outliers.
We investigate further by producing boxplots for each of the variables:

oldpar = par(mfrow = c(2,6))
for ( i in 1:11 ) {
  boxplot(redDat[[i]])
  mtext(names(redDat)[i], cex = 0.8, side = 1, line = 2)
}
par(oldpar)

Swagger editor

We can see that all variables, even alcohol this thime, contains outliers.

To get an insight of the outliers position we can use a scatter plot matrix.

pairs(redDat[, -grep("quality", colnames(redDat))])

Swagger editor

We see that outliers seems to be on the higher and upper ends.
Now we look at the predictor values distribution:

oldpar = par(mfrow = c(6,2))
for ( i in 1:12 ) {
  truehist(redDat[[i]], xlab = names(redDat)[i], col = 'lightgreen', main = paste("Average =", signif(mean(redDat[[i]]),3)), nbins = 50)
}

Swagger editor

par(oldpar)

We note that almost all distributions are positively skewed. Quality, pH and density are approximately normally distributed.

1.1.b: Outlier detection

For each variables, we consider observations that lie outside 1.5 * IQR as outliers.

redOutliers = c()
for ( i in 1:11 ) {
  stats = boxplot.stats(redDat[[i]])$stats
  bottom_outlier_rows = which(redDat[[i]] < stats[1])
  top_outlier_rows = which(redDat[[i]] > stats[5])
  redOutliers = c(redOutliers , top_outlier_rows[ !top_outlier_rows %in% redOutliers ] )
  redOutliers = c(redOutliers , bottom_outlier_rows[ !bottom_outlier_rows %in% redOutliers ] )
}

We use the Cook’s ditance to detect influential observations.

mod = lm(quality ~ ., data = redDat)
cooksd = cooks.distance(mod)
plot(cooksd, pch = "*", cex = 2, main = "Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm = T), col = "red")

Swagger editor

head(redDat[cooksd > 4 * mean(cooksd, na.rm=T), ])

##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 14           7.8            0.610        0.29            1.6     0.114
## 34           6.9            0.605        0.12           10.7     0.073
## 46           4.6            0.520        0.15            2.1     0.054
## 80           8.3            0.625        0.20            1.5     0.080
## 87           8.6            0.490        0.28            1.9     0.110
## 92           8.6            0.490        0.28            1.9     0.110
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 14                   9                   29  0.9974 3.26      1.56     9.1
## 34                  40                   83  0.9993 3.45      0.52     9.4
## 46                   8                   65  0.9934 3.90      0.56    13.1
## 80                  27                  119  0.9972 3.16      1.12     9.1
## 87                  20                  136  0.9972 2.93      1.95     9.9
## 92                  20                  136  0.9972 2.93      1.95     9.9
##    quality
## 14       5
## 34       6
## 46       4
## 80       4
## 87       6
## 92       6

By looking at each row we can find out why it is influential:

Row 14 and 46 have very low free.sulfur.dioxide.
Row 34 have very high residual.sugar.
Row 46 have very high alcohol.
Row 87 and 92 have very high sulphates.

plot(lm(quality~citric.acid, data=redDat), which=c(5))

Swagger editor

plot(lm(quality~total.sulfur.dioxide, data=redDat), which=c(5))

Swagger editor

plot(lm(quality~sulphates, data=redDat), which=c(5))

Swagger editor

plot(lm(quality~alcohol, data=redDat), which=c(5))

Swagger editor

We remove all the ouliers in our list from the dataset and create new histograms:

coutliers = as.numeric(rownames(redDat[cooksd > 4 * mean(cooksd, na.rm=T), ]))
redOutliers = c(redOutliers , coutliers[ !coutliers %in% redOutliers ] )

cleanRedDat = redDat[-redOutliers, ]
oldpar = par(mfrow=c(6,2))

for ( i in 1:12 ) {
  truehist(cleanRedDat[[i]], xlab = names(cleanRedDat)[i], col = 'lightgreen', main = paste("Average =", signif(mean(cleanRedDat[[i]]),3)), nbins = 50)
}

Swagger editor

par(oldpar)

By removing the outliers, the dataset size reduced to 1179 observations.

From the above plots, we can see that:

fixed.acidity have a normal distribution.
volatile.acidity seems to dispaly a bimodal normal distribution. Log-transforming it make the distribution becomes normal.
citric.acid, free.sulfur.dioxide, total.sulfur.dioxide are skewed to the right.
residual.sugar, chlorides, density, pH have a normal distribution.
alcohol vary from 8.70 to 13.40. Most of the data are in the range 9-10.

Now, the variable distributions are approximatly normal except for residual.sugar which is unimodal and skewed to the right. This could be an interesting fact for further analysis.

We now use a scatterplot matrice to roughly determine if there is a linear correlation between our variables:

pairs(cleanRedDat, col = cleanRedDat$quality, pch = cleanRedDat$quality)

Swagger editor

pairs(cleanRedDat[,c(1, 8, 9, 10, 11)], col = cleanRedDat$quality, pch = cleanRedDat$quality)

Swagger editor

Only fixed.acidity/density and fixed.acidity/pH pairs seems to have a linear correlation.

We note a trend with the sulphates variable: the higher the sulphates value is, the better the quality. Idem for the alcohol variable.

1.1.c: Correlation

The following correlation matrix confirm the strong correlation between fixed.acidity/density and fixed.acidity/pH. It also reveal the correlation between fixed.acidity/citric.acid, volatie.acidity/citric.acid and total.sulfur.dioxide/free.sulfur.dioxide.

The variables with the highest correlation with quality are alcohol and sulphates.

ggcorrplot(cor(cleanRedDat), hc.order = TRUE, type = "lower", lab = TRUE, insig = "blank")

Swagger editor

colnames(sortByCorr(dataset = cleanRedDat, refColName = 'quality'))

##  [1] "quality"              "alcohol"              "sulphates"           
##  [4] "volatile.acidity"     "citric.acid"          "total.sulfur.dioxide"
##  [7] "density"              "chlorides"            "fixed.acidity"       
## [10] "pH"

1.3 Should we log-transform the data?

Log transformation is widely used to make data conform to normality and/or reduce variability of data. For the white wine, since our data are now close to the normal distribution, we don’t need to transform them.

In both case we can see in the following pairwise plots that log-transformation doesn’t improve linearity of the relationships between predictors and outcome.

pairs(log(cleanWhiteDat))

Swagger editor

pairs(log(cleanRedDat))

Swagger editor

1.4 Initial model performance

In order to have a point of comparison for the models to comes, we fit an initial model for each dataset with all variables:

whiteFit = lm(quality~., cleanWhiteDat)
redFit = lm(quality~., cleanRedDat)

summary(whiteFit)

## 
## Call:
## lm(formula = quality ~ ., data = cleanWhiteDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.92980 -0.51195 -0.04107  0.45599  2.45905 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.983e+02  2.702e+01   7.339 2.60e-13 ***
## fixed.acidity         1.521e-01  2.690e-02   5.652 1.69e-08 ***
## volatile.acidity     -1.887e+00  1.605e-01 -11.754  < 2e-16 ***
## citric.acid           6.282e-02  1.399e-01   0.449 0.653443    
## residual.sugar        9.737e-02  1.011e-02   9.630  < 2e-16 ***
## chlorides            -3.589e+00  1.442e+00  -2.489 0.012865 *  
## free.sulfur.dioxide   4.415e-03  1.038e-03   4.253 2.15e-05 ***
## total.sulfur.dioxide  2.656e-04  4.372e-04   0.607 0.543577    
## density              -1.995e+02  2.738e+01  -7.287 3.81e-13 ***
## pH                    9.555e-01  1.275e-01   7.494 8.20e-14 ***
## sulphates             6.767e-01  1.243e-01   5.444 5.52e-08 ***
## alcohol               1.272e-01  3.375e-02   3.768 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7241 on 3987 degrees of freedom
## Multiple R-squared:  0.2623, Adjusted R-squared:  0.2602 
## F-statistic: 128.9 on 11 and 3987 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(whiteFit$residuals^2)

## [1] 0.5226782

summary(redFit)

## 
## Call:
## lm(formula = quality ~ ., data = cleanRedDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92124 -0.36701 -0.05888  0.40757  1.87569 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.997e+01  2.672e+01   0.747  0.45513    
## fixed.acidity         2.427e-02  3.024e-02   0.802  0.42246    
## volatile.acidity     -8.010e-01  1.461e-01  -5.481 5.18e-08 ***
## citric.acid          -2.679e-01  1.657e-01  -1.617  0.10615    
## residual.sugar        2.781e-03  4.985e-02   0.056  0.95552    
## chlorides            -1.778e+00  1.341e+00  -1.327  0.18489    
## free.sulfur.dioxide   3.327e-03  2.578e-03   1.290  0.19715    
## total.sulfur.dioxide -2.882e-03  9.219e-04  -3.126  0.00181 ** 
## density              -1.602e+01  2.727e+01  -0.588  0.55687    
## pH                   -5.468e-01  2.242e-01  -2.439  0.01489 *  
## sulphates             1.657e+00  1.662e-01   9.973  < 2e-16 ***
## alcohol               2.810e-01  3.315e-02   8.478  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5752 on 1167 degrees of freedom
## Multiple R-squared:  0.3947, Adjusted R-squared:  0.389 
## F-statistic: 69.17 on 11 and 1167 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(redFit$residuals^2)

## [1] 0.3274646

White wine model:

The most important variables are: fixed.acidity, volatile.acidity, residual.sugar, free.sulfur.dioxide, density, pH, sulphates and alcohol.
citric.acid and total.sulfur.dioxide doesn’t seems to be relevant in this model.
Adjusted R2 = 0.2602
MSE = 0.5226782

Red wine model:

The most important variables are: volatile.acidity, sulphates and alcohol.
fixed.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide and density doesn’t seems to be relevant in this model.
Adjusted R2 = 0.389
MSE = 0.3274646

2: Choose optimal models by exhaustive, forward and backward selection

2.1: White wine

Using the pre-processed white wine data, let’s use regsubsets to choose optimal set of variables for modeling wine quality using exhaustive, forward and backward methods:

testRegsubsets = function(dataset, yColName, methods, metrics) {
  # Perform the given best subset selection methods and plot the given corresponding model
  # metrics and the variable membership for each models
  #
  # Args:
  #   dataset: The data frame on which to perform subset selection methods
  #   yColName: The name of the outcome variable
  #   methods: Vector. The list of the subset selection methods to perform
  #   metrics: Vector. The list of metrics to plot
  #
  # Returns:
  #   Plot the given model metrics and the variable membership for each models
  summaryMetrics = NULL
  whichAll = list()
  for (myMthd in methods) {
    rsRes = regsubsets(dataset[, yColName]~ ., dataset[, -which(names(dataset) %in% yColName)], method = myMthd, nvmax = 11)
    summRes = summary(rsRes)
    whichAll[[myMthd]] = summRes$which
    for (metricName in metrics) {
      summaryMetrics = rbind(summaryMetrics,
        data.frame(method = myMthd, metric = metricName,
          nvars = 1:length(summRes[[metricName]]),
          value = summRes[[metricName]]))
    }
  }
  
  # Plot variable membership for each models
  old.par = par(mfrow = c(1,2), ps = 16, mar = c(5,7,2,1))
  for (myMthd in names(whichAll)) {
    image(1:nrow(whichAll[[myMthd]]),
          1:ncol(whichAll[[myMthd]]),
          whichAll[[myMthd]], xlab = 'N(vars)', ylab = '',
          xaxt = 'n', yaxt = 'n', breaks = c(-0.5, 0.5, 1.5),
          col = c('white', 'lightblue1'), main = myMthd)
    axis(1, 1:nrow(whichAll[[myMthd]]), rownames(whichAll[[myMthd]]))
    axis(2, 1:ncol(whichAll[[myMthd]]), colnames(whichAll[[myMthd]]), las = 2)
  }
  par(old.par)
  
  # Plot the model metrics
  ggplot(summaryMetrics, aes(x = nvars, y = value, shape = method, colour = method)) + geom_path() + geom_point() + facet_wrap(~metric, scales = "free") + theme(legend.position = "top")
}

testRegsubsets(dataset = cleanWhiteDat, yColName = "quality", methods = c("exhaustive", "backward", "forward"), metrics = c("rsq", "rss", "adjr2", "cp", "bic"))

Swagger editor

We see that exhaustive and forward subset selection methods came with models of very comparable performance by every associated metric while backward subset selection method have dissimilar performance when the number of predictor is less than 6, then performances are very comparable.

For all methods, increase in variable number appears to result in progressive improvement of the fit. It appear that the optimal model contain 9 variables. For all methods, the 9 variables model contain density, residual.sugar, volatile.acidity, pH, fixed.acidity, sulphates, free.sulfur.dioxide, alcohol and chlorides.

2.2: Red wine

testRegsubsets(dataset = cleanRedDat, yColName = "quality", methods = c("exhaustive", "backward", "forward"), metrics = c("rsq", "rss", "adjr2", "cp", "bic"))

Swagger editor

We see that all subset selection methods came with models of very comparable performance by every associated metric. All except BIC reaches minimum when 6 out of 11 variables are in the model. Since the gain is very low between 5 and 6 for other metrics, it may be safe to said that the best model should contain 5 predictors.

For both backward and forward methods, the 5 variables model contain alcohol, sulphates, volatile.acidity, total.sulfur.dioxide and pH.

2.3 Comparison between red and white wine optimal models

Best models for both red wine and white wine contains alcohol, sulphates, volatile.acidity and pH.
The best model for red wine is the only one to contain total.sulfur.dioxide while the best model for white wine is the only one to contain density, residual.sugar, fixed.acidity, free.sulfur.dioxide and chlorides.

We now fit the best models found in this section for both dataset:

whiteFit2 = lm(quality~density + residual.sugar + volatile.acidity + pH + fixed.acidity + sulphates + free.sulfur.dioxide + alcohol + chlorides, cleanWhiteDat)
summary(whiteFit2)

## 
## Call:
## lm(formula = quality ~ density + residual.sugar + volatile.acidity + 
##     pH + fixed.acidity + sulphates + free.sulfur.dioxide + alcohol + 
##     chlorides, data = cleanWhiteDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.92869 -0.51616 -0.04218  0.46081  2.45222 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.928e+02  2.594e+01   7.430 1.32e-13 ***
## density             -1.939e+02  2.629e+01  -7.375 1.98e-13 ***
## residual.sugar       9.557e-02  9.810e-03   9.741  < 2e-16 ***
## volatile.acidity    -1.872e+00  1.546e-01 -12.110  < 2e-16 ***
## pH                   9.430e-01  1.265e-01   7.457 1.08e-13 ***
## fixed.acidity        1.509e-01  2.661e-02   5.670 1.53e-08 ***
## sulphates            6.809e-01  1.241e-01   5.485 4.39e-08 ***
## free.sulfur.dioxide  4.831e-03  8.271e-04   5.840 5.63e-09 ***
## alcohol              1.318e-01  3.320e-02   3.971 7.27e-05 ***
## chlorides           -3.505e+00  1.435e+00  -2.442   0.0147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7239 on 3989 degrees of freedom
## Multiple R-squared:  0.2622, Adjusted R-squared:  0.2605 
## F-statistic: 157.5 on 9 and 3989 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(whiteFit2$residuals^2)

## [1] 0.5227566

redFit2 = lm(quality~alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + pH, cleanRedDat)
summary(redFit2)

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     total.sulfur.dioxide + pH, data = cleanRedDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.99787 -0.35463 -0.07231  0.42725  1.85975 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.608455   0.450967   8.002 2.93e-15 ***
## alcohol               0.300591   0.019112  15.728  < 2e-16 ***
## sulphates             1.634988   0.156711  10.433  < 2e-16 ***
## volatile.acidity     -0.693162   0.113788  -6.092 1.51e-09 ***
## total.sulfur.dioxide -0.002519   0.000666  -3.783 0.000163 ***
## pH                   -0.493500   0.133387  -3.700 0.000226 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.576 on 1173 degrees of freedom
## Multiple R-squared:  0.3899, Adjusted R-squared:  0.3873 
## F-statistic: 149.9 on 5 and 1173 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(redFit2$residuals^2)

## [1] 0.3300581

White wine model:

All the variables in the model are important predictors.
chlorides is the less relevant explainatory variable.
Adjusted R2 = 0.2605 (0.389 in the previous model)
MSE = 0.5227566 (doesn’t improve)

Red wine model:

All the variables in the model are important predictors.
Adjusted R2 = 0.3873 (0.389 in the previous model)
MSE = 0.3300581 (doesn’t improve)

Compared to our initial models, performance doesn’t improve but the model is simpler because it contains less variables.

Optimal model by cross-validation

For each dataset, we will use the bootstrap resampling method to estimate test error for models with different numbers of variables and then compare with the results of the cross-validation method.

predict.regsubsets = function (object, newdata, id, ...){
  form = as.formula(object$call [[2]])
  mat = model.matrix(form ,newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars]%*% coefi
}

# Best subset and bootstrap
resampleMSEregsubsetsWineDat = function(dataset, inpMthd, subsetMthd, nTries = 100) {
  if (!inpMthd %in% c("traintest", "bootstrap")) {
    stop("Unexpected resampling method!")
  }
  dfTmp = NULL
  whichSum = array(0, dim = c(ncol(dataset)-1, ncol(dataset), length(subsetMthd)), dimnames = list(NULL, colnames(model.matrix(quality~ ., dataset)), subsetMthd))
  for (iTry in 1:nTries) {
    trainIdx = NULL
    if (inpMthd == "traintest") {
      trainIdx = sample(nrow(dataset), nrow(dataset)/2)
    } else if (inpMthd == "bootstrap") {
      trainIdx = sample(nrow(dataset), nrow(dataset), replace = TRUE)
    }
    for (jSelect in subsetMthd) {
      rsTrain = regsubsets(quality~ ., dataset[trainIdx, ], nvmax = ncol(dataset)-1, method = jSelect)
      whichSum[,,jSelect] = whichSum[,,jSelect] + summary(rsTrain)$which
      for (kVarSet in 1:(ncol(dataset)-1)) {
        kCoef = coef(rsTrain, id = kVarSet)
        testPred = model.matrix(quality~ ., dataset[-trainIdx, ])[, names(kCoef)] %*% kCoef
        mseTest = mean((testPred-dataset[-trainIdx, "quality"])^2)
        dfTmp = rbind(dfTmp, data.frame(sim = iTry, sel = jSelect, vars = kVarSet, mse = c(mseTest, summary(rsTrain)$rss[kVarSet]/length(trainIdx)), trainTest = c("test", "train")))
      }
    }
  }
  list(mseAll = dfTmp, whichSum = whichSum, nTries = nTries)
}

# Cross-validation training and test error
xvalMSEregsubsetsWinedat = function(dataset, subsetMthd, nTries = 30, kXval = 5) {
  retRes = NULL
  for (iTry in 1:nTries) {
    xvalFolds = sample(rep(1:kXval, length.out = nrow(dataset)))
    # Try each method available in regsubsets
    # to select the best model of each size:
    for (jSelect in subsetMthd) {
      mthdTestErr2 = NULL
      mthdTrainErr2 = NULL
      mthdTestFoldErr2 = NULL
      for (kFold in 1:kXval) {
        rsTrain = regsubsets(quality~ ., dataset[xvalFolds != kFold, ], nvmax = 11, method = jSelect)
        # Calculate test error for each set of variables
        # using predict.regsubsets implemented above:
        nVarTestErr2 = NULL
        nVarTrainErr2 = NULL
        for (kVarSet in 1:11) {
          # make predictions for given number of variables and cross-validation fold:
          kCoef = coef(rsTrain, id = kVarSet)
          testPred = model.matrix(quality~ ., dataset[xvalFolds == kFold, ])[, names(kCoef)] %*% kCoef
          nVarTestErr2 = cbind(nVarTestErr2, (testPred-dataset[xvalFolds == kFold, "quality"])^2)
          trainPred = model.matrix(quality~ ., dataset[xvalFolds != kFold, ])[, names(kCoef)] %*% kCoef
          nVarTrainErr2 = cbind(nVarTrainErr2, (trainPred-dataset[xvalFolds != kFold, "quality"])^2)
        }
        # accumulate training and test errors over all cross-validation folds:
        mthdTestErr2 = rbind(mthdTestErr2, nVarTestErr2)
        mthdTestFoldErr2 = rbind(mthdTestFoldErr2, colMeans(nVarTestErr2))
        mthdTrainErr2 = rbind(mthdTrainErr2, nVarTrainErr2)
      }
      # add to data.frame for future plotting:
      retRes = rbind(retRes,
                    data.frame(sim = iTry, sel = jSelect, vars = 1:ncol(nVarTrainErr2), mse = colMeans(mthdTrainErr2), trainTest = "train"),
                    data.frame(sim = iTry, sel = jSelect, vars = 1:ncol(nVarTrainErr2), mse = colMeans(mthdTestErr2), trainTest = "test"))
    }
  }
  retRes
}

3.1: White wine

3.1.a: The bootstrap

Resample by splitting dataset into training and test:

whiteBootRes = resampleMSEregsubsetsWineDat(cleanWhiteDat, "bootstrap", c("exhaustive", "backward", "forward"), 30)

Plot resulting training and test MSE:

ggplot(whiteBootRes$mseAll, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest)

Swagger editor

Test error improves by increasing model size up to about 9 variables. Median test MSE of the larger model is comparable to the lower quartile of MSE for the smaller model. Perhaps going from 8 to 9 variables also on average decreases test MSE as well, although that decrease is small comparing to the variability observed across resampling tries.

The test MSEs on models with 9 and 10 variables are very comparable but it is best to select the model with the least variables. This compromise between completeness and performance seems to justified to me since increasing the model complexity would just slightly improve it.

old.par = par(mfrow = c(1, 2), ps = 16, mar = c(5, 7, 2, 1))
for (myMthd in dimnames(whiteBootRes$whichSum)[[3]]) {
  tmpWhich = whiteBootRes$whichSum[,,myMthd] / whiteBootRes$nTries
  image(1:nrow(tmpWhich), 1:ncol(tmpWhich), tmpWhich,
        xlab = "N(vars)", ylab = "", xaxt = "n", yaxt = "n", main = myMthd,
        breaks = c(-0.1, 0.1, 0.25, 0.5, 0.75, 0.9, 1.1),
        col = gray(seq(1, 0, length = 6)))
  axis(1, 1:nrow(tmpWhich), rownames(tmpWhich))
  axis(2, 1:ncol(tmpWhich), colnames(tmpWhich), las = 2)
}

Swagger editor

par(old.par)

Plots of average variable membership in the model suggest that:

citric.acid and total.sulfur.dioxide doesn’t get included until all variables are required to be in the model.
either alcohol or density or CACH are included when only one variable is chosen.
for models with nine variables chlorides or fixed.acidity is the last variable in the model.

3.1.b: K-fold Cross-validation

We can compare the bootstrap estimate test error with the results of the cross-validation resampling method on the same dataset:

dfTmp = rbind(data.frame(xvalMSEregsubsetsWinedat(cleanWhiteDat, c("exhaustive", "backward", "forward"), 30, kXval = 2), xval = "2-fold"),
               data.frame(xvalMSEregsubsetsWinedat(cleanWhiteDat, c("exhaustive", "backward", "forward"), 30, kXval = 5), xval = "5-fold"),
               data.frame(xvalMSEregsubsetsWinedat(cleanWhiteDat, c("exhaustive", "backward", "forward"), 30, kXval = 10), xval = "10-fold"))
ggplot(dfTmp, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)

Swagger editor

The average test error estimates obtained by cross-validation are approximately comparable to those obtained by bootstrap approaches and are by far less variable across multiple rounds of cross-validation, in particular for higher numbers of cross-validation folds.

In conclusion the best model here contains 9 variables: alcohol, sulphates, pH, density, free.sulfur.dioxide, chlorides, residual.suger, volatile.acidity and fixed.acidity.

3.2: Red wine

3.2.a: The bootstrap

As for the white wine data we resample using the bootstrap method and plot resulting training and test MSE:

redBootRes = resampleMSEregsubsetsWineDat(cleanRedDat, "bootstrap", c("exhaustive", "backward", "forward"), 30)
ggplot(redBootRes$mseAll, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest)

Swagger editor

Test error improves by increasing model size up to about 5 variables. Median test MSE of the larger model is comparable to the lower quartile of MSE for the smaller model. Since test MSEs on models with more than 5 variables are very comparable the 5 variables model is considered as the best.

old.par = par(mfrow = c(1, 2), ps = 16, mar = c(5, 7, 2, 1))
for (myMthd in dimnames(redBootRes$whichSum)[[3]]) {
  tmpWhich = redBootRes$whichSum[,,myMthd] / redBootRes$nTries
  image(1:nrow(tmpWhich), 1:ncol(tmpWhich), tmpWhich,
        xlab = "N(vars)", ylab = "", xaxt = "n", yaxt = "n", main = myMthd,
        breaks = c(-0.1, 0.1, 0.25, 0.5, 0.75, 0.9, 1.1),
        col = gray(seq(1, 0, length = 6)))
  axis(1, 1:nrow(tmpWhich), rownames(tmpWhich))
  axis(2, 1:ncol(tmpWhich), colnames(tmpWhich), las = 2)
}

Swagger editor

Plots of average variable membership in the model suggest that:

residual.sugar or density doesn’t get included until all variables are required to be in the model.
alcohol is included when only one variable is chosen.
for models with five variables ph or total.sulfur.dioxide is the last variable in the model.

3.2.b: K-fold Cross-validation

As previously we can compare the bootstrap estimate test error with the results of the cross-validation resampling method on the same dataset:

dfTmp = rbind(data.frame(xvalMSEregsubsetsWinedat(cleanRedDat, c("exhaustive", "backward", "forward"), 30, kXval = 2), xval = "2-fold"),
               data.frame(xvalMSEregsubsetsWinedat(cleanRedDat, c("exhaustive", "backward", "forward"), 30, kXval = 5), xval = "5-fold"),
               data.frame(xvalMSEregsubsetsWinedat(cleanRedDat, c("exhaustive", "backward", "forward"), 30, kXval = 10), xval = "10-fold"))
ggplot(dfTmp, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)

Swagger editor

The same conclusion as for the white wine dataset holds here: The average test error estimates obtained by cross-validation are approximately comparable to those obtained by bootstrap approaches and are by far less variable across multiple rounds of cross-validation, in particular for higher numbers of cross-validation folds.

In conclusion the best model here contains 5 variables: alcohol, suplhates, pH, total.sulfur.dioxide, volatile.acidity.

3.3: Comparing red and white wine models

The best model for red wine contains only 5 variables while the best model for white wine contains 9. It seems to less complex to explain and predict red wine quality.
This results are similar to what we found using regsubsets in the previous task.

The resulting white model is the same as the one we found in sub-problem 2. It contains the 9 following variables: alcohol, sulphates, pH, density, free.sulfur.dioxide, chlorides, residual.sugar, volatile.acidity and fixed.acidity.

As for the white wine model, the best red wine model here is the same as previously. It contains the following 5 variables: alcohol, suplhates, pH, total.sulfur.dioxide (or free.sulfur.dioxide) and volatile.acidity.

Both contain alcohol, suplhates, pH and volatile.acidity, which may indicate that this variables are the most important predictors for quality, whatever the wine color.

4: lasso/ridge

Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.

4.1: White wine

4.1.a: Lasso regression

# x = model.matrix(quality~ ., cleanWhiteDat[, -8])[, -1]
x = model.matrix(quality~ ., cleanWhiteDat)[, -1]
y = cleanWhiteDat[, "quality"]

lassoRes = glmnet(x, y, alpha = 1)
plot_glmnet(lassoRes)

Swagger editor

We see that as the constraint is relaxed, more attributes shows up. alcohol is first to emmerge, followed by volatile.acidity and chlorides and then pH, residual.sugar, sulphates… and finally, density. Because the lasso can be view as a attributes selection method, the attributes emerging first are the one who explains the most of the outcome. We also note that all the coefficients are positive except density, chlorids and volatile.acidity.

cvLassoRes = cv.glmnet(x, y, alpha = 1)
plot(cvLassoRes)

Swagger editor

cvLassoRes = cv.glmnet(x, y, alpha = 1, lambda = 10^((-200:20)/80))
plot(cvLassoRes)

Swagger editor

This plot of cv.glmnet shows us that as the model complexity decrease, the MSE increase. However, as there is at least 7 attributes in the model, the MSE stay low.

predict(lassoRes, type = "coefficients", s = cvLassoRes$lambda.min)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           1.206083e+02
## fixed.acidity         8.571880e-02
## volatile.acidity     -1.864622e+00
## citric.acid           1.077039e-02
## residual.sugar        6.781266e-02
## chlorides            -4.164688e+00
## free.sulfur.dioxide   4.679359e-03
## total.sulfur.dioxide  .           
## density              -1.205164e+02
## pH                    6.610149e-01
## sulphates             5.397290e-01
## alcohol               2.115829e-01

Optimal (in min sens) model by lasso includes all variables except total.sulfur.dioxide.

predict(lassoRes, type = "coefficients", s = cvLassoRes$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           2.666865609
## fixed.acidity         .          
## volatile.acidity     -1.526877996
## citric.acid           .          
## residual.sugar        0.013638167
## chlorides            -4.054700600
## free.sulfur.dioxide   0.003589056
## total.sulfur.dioxide  .          
## density               .          
## pH                    0.095449835
## sulphates             0.098144930
## alcohol               0.311437899

Unlike what was seen above, optimal (in min-1SE sense) model by lasso includes 7 variables: volatile.acidity, residual.sugar, chlorides, free.sulfur.dioxide, pH, sulphates, alcohol. fixed.acidity, citric.acid, total.sulfur.dioxide and density are left out.

For a better understanding of these results we use resampling to estimate test error of lasso models fit to training data and stability of the variable selection by lasso across different splits of data into training and test.

dfTmp = NULL

# Get the coefficients index after a lasso regression
coefs = predict(lassoRes, type = "coefficients", s = cvLassoRes$lambda.1se)
coefsIdx = which(coefs != 0)
lassoLnWhiteDat = cleanWhiteDat[, coefsIdx]

whichSum = array(0,dim = c((length(coefsIdx)-1), length(coefsIdx), 3),
  dimnames = list(NULL, colnames(model.matrix(quality~ ., lassoLnWhiteDat)),
      c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(lassoLnWhiteDat)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward") ) {
    rsTrain = regsubsets(quality~ ., lassoLnWhiteDat[bTrain,], nvmax = 11, method = jSelect)
    # Add up variable selections:
    whichSum[,,jSelect] = whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:(length(coefsIdx)-1) ) {
      # make predictions:
      testPred = predict(rsTrain, lassoLnWhiteDat[!bTrain,], id = kVarSet)
      # calculate MSE:
      mseTest = mean((testPred-lassoLnWhiteDat[!bTrain, "quality"])^2)
      # add to data.frame for future plotting:
      dfTmp = rbind(dfTmp, data.frame(sim = iTry, sel = jSelect, vars = kVarSet,
      mse = c(mseTest, summary(rsTrain)$rss[kVarSet]/sum(bTrain)), trainTest = c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest)

Swagger editor

As we can see, the 5, 6 and 7 variables models are very similar.

whiteLassoCoefCnt = 0
whiteLassoMSE = NULL
for (iTry in 1:30) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(x)))
  cvLassoTrain = cv.glmnet(x[bTrain, ], y[bTrain], alpha = 1, lambda = 10^((-120:0)/20))
  lassoTrain = glmnet(x[bTrain, ], y[bTrain], alpha = 1, lambda = 10^((-120:0)/20))
  lassoTrainCoef = predict(lassoTrain, type = "coefficients", s = cvLassoTrain$lambda.1se)
  whiteLassoCoefCnt = whiteLassoCoefCnt + (lassoTrainCoef[-1, 1] != 0)
  lassoTestPred = predict(lassoTrain, newx = x[!bTrain, ], s = cvLassoTrain$lambda.1se)
  whiteLassoMSE = c(whiteLassoMSE, mean((lassoTestPred-y[!bTrain])^2))
}
mean(whiteLassoMSE)

## [1] 0.5464914

quantile(whiteLassoMSE)

##        0%       25%       50%       75%      100% 
## 0.5167767 0.5356218 0.5488897 0.5565241 0.5698861

whiteLassoCoefCnt

##        fixed.acidity     volatile.acidity          citric.acid 
##                    1                   30                    1 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                   30                   29                   30 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                   16 
##            sulphates              alcohol 
##                   15                   30

4.1.b: Ridge regression

ridgeRes = glmnet(x, y, alpha = 0)
plot_glmnet(ridgeRes)

Swagger editor

We see that as the λ decrease, more attributes shows up.

cvRidgeRes = cv.glmnet(x, y, alpha = 0)
plot(cvRidgeRes)

Swagger editor

The cv.glmnet plot shows us that the 11 variable model has a low error as long as the lambda value is below -1.

cvRidgeRes$lambda.min

## [1] 0.04016908

cvRidgeRes$lambda.1se

## [1] 0.1953263

predict(ridgeRes, type = "coefficients", s = cvRidgeRes$lambda.min)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           6.298621e+01
## fixed.acidity         4.321403e-02
## volatile.acidity     -1.762595e+00
## citric.acid           3.957287e-02
## residual.sugar        4.333255e-02
## chlorides            -5.739301e+00
## free.sulfur.dioxide   5.289764e-03
## total.sulfur.dioxide -3.744853e-04
## density              -6.184207e+01
## pH                    4.729349e-01
## sulphates             4.547114e-01
## alcohol               2.545227e-01

predict(ridgeRes, type = "coefficients", s = cvRidgeRes$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           3.650802e+01
## fixed.acidity         1.204128e-02
## volatile.acidity     -1.358885e+00
## citric.acid           9.474516e-02
## residual.sugar        2.245641e-02
## chlorides            -7.436166e+00
## free.sulfur.dioxide   4.781170e-03
## total.sulfur.dioxide -6.107318e-04
## density              -3.381372e+01
## pH                    3.076046e-01
## sulphates             3.117719e-01
## alcohol               2.129788e-01

ridgeResScaled = glmnet(scale(x), y, alpha = 0)
cvRidgeResScaled = cv.glmnet(scale(x), y, alpha = 0, lambda = 10^((-50:60)/20))
predict(ridgeResScaled, type = "coefficients", s = cvRidgeResScaled$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           5.950737684
## fixed.acidity         0.014457345
## volatile.acidity     -0.115066652
## citric.acid           0.006527448
## residual.sugar        0.139162023
## chlorides            -0.069276285
## free.sulfur.dioxide   0.076193500
## total.sulfur.dioxide -0.023026773
## density              -0.112759045
## pH                    0.048630857
## sulphates             0.035401235
## alcohol               0.283546517

The impact of alcohol and volatile.acidity is made more apparent.

whiteRidgeCoefCnt = 0
whiteRedridgeCoefAve = 0
whiteRidgeMSE = NULL
for ( iTry in 1:30 ) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(x)))
  cvridgeTrain = cv.glmnet(x[bTrain, ], y[bTrain], alpha = 0, lambda = 10^((-50:50)/20))
  ridgeTrain = glmnet(x[bTrain, ], y[bTrain], alpha = 0, lambda = 10^((-50:50)/20))
  ridgeTrainCoef = predict(ridgeTrain, type = "coefficients", s = cvridgeTrain$lambda.1se)
  whiteRidgeCoefCnt = whiteRidgeCoefCnt + (ridgeTrainCoef[-1,1] != 0)
  whiteRedridgeCoefAve = whiteRedridgeCoefAve + ridgeTrainCoef[-1, 1]
  ridgeTestPred = predict(ridgeTrain, newx = x[!bTrain, ], s = cvridgeTrain$lambda.1se)
  whiteRidgeMSE = c(whiteRidgeMSE, mean((ridgeTestPred-y[!bTrain])^2))
}
whiteRedridgeCoefAve = whiteRedridgeCoefAve / length(whiteRidgeMSE)
whiteRedridgeCoefAve

##        fixed.acidity     volatile.acidity          citric.acid 
##         1.021117e-02        -1.317364e+00         8.189876e-02 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##         2.093080e-02        -7.168219e+00         4.698647e-03 
## total.sulfur.dioxide              density                   pH 
##        -6.438735e-04        -3.344705e+01         3.158749e-01 
##            sulphates              alcohol 
##         2.489312e-01         2.049461e-01

mean(whiteRidgeMSE)

## [1] 0.5448221

quantile(whiteRidgeMSE)

##        0%       25%       50%       75%      100% 
## 0.5098039 0.5369563 0.5454565 0.5560579 0.5697058

4.2: Red wine

4.2.a: Lasso regression

redX = model.matrix(quality~ ., cleanRedDat)[, -1]
redY = cleanRedDat[, "quality"]

redLassoRes = glmnet(redX, redY, alpha = 1)
plot_glmnet(redLassoRes)

Swagger editor

Again, as the constraint is relaxed, more attributes shows up. alcohol still the first to emmerge, followed by sulphates and volatile.acidity. density is the last one. We also note that all the coefficients are negative except sulphates and alcohol.

redCvLassoRes = cv.glmnet(redX, redY, alpha = 1)
plot(redCvLassoRes)

Swagger editor

cvLassoRes = cv.glmnet(redX, redY, alpha = 1, lambda = 10^((-200:20)/80))
plot(cvLassoRes)

Swagger editor

This plot of cv.glmnet shows us that as the model complexity decrease, the MSE increase. However, as there is at least 5 attributes in the model, the MSE stay low.

predict(redLassoRes, type = "coefficients", s = redCvLassoRes$lambda.min)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           16.035168846
## fixed.acidity          0.017016809
## volatile.acidity      -0.776022056
## citric.acid           -0.218999533
## residual.sugar         .          
## chlorides             -1.745040572
## free.sulfur.dioxide    0.002978777
## total.sulfur.dioxide  -0.002813497
## density              -12.035079066
## pH                    -0.552975064
## sulphates              1.645358247
## alcohol                0.283365046

Optimal (in min sens) model by lasso includes all variables except residual.suger.

predict(redLassoRes, type = "coefficients", s = redCvLassoRes$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           2.6373451145
## fixed.acidity         .           
## volatile.acidity     -0.5771157203
## citric.acid           .           
## residual.sugar        .           
## chlorides             .           
## free.sulfur.dioxide   .           
## total.sulfur.dioxide -0.0008526568
## density               .           
## pH                   -0.0481091426
## sulphates             1.2937924905
## alcohol               0.2595807999

Similarly to what was seen above with regsubsets and resampling, optimal (in min-1SE sense) model by lasso includes 5 variables: alcohol, sulphates, volatile.acidity, total.sulfur.dioxide, and pH.

We use resampling to estimate test error of lasso models fit to training data and stability of the variable selection by lasso across different splits of data into training and test.

dfTmp = NULL

# Get the coefficients index after a lasso regression
redCoefs = predict(redLassoRes, type = "coefficients", s = redCvLassoRes$lambda.1se)
redCoefsIdx = which(redCoefs != 0)
lassoLnRedDat = cleanRedDat[, redCoefsIdx]

whichSum = array(0,dim = c((length(redCoefsIdx)-1), length(redCoefsIdx), 3),
  dimnames = list(NULL, colnames(model.matrix(quality~ ., lassoLnRedDat)),
      c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(lassoLnRedDat)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward") ) {
    rsTrain = regsubsets(quality~ ., lassoLnRedDat[bTrain,], nvmax = 11, method = jSelect)
    # Add up variable selections:
    whichSum[,,jSelect] = whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:(length(redCoefsIdx)-1) ) {
      # make predictions:
      testPred = predict(rsTrain, lassoLnRedDat[!bTrain,], id = kVarSet)
      # calculate MSE:
      mseTest = mean((testPred-lassoLnRedDat[!bTrain, "quality"])^2)
      # add to data.frame for future plotting:
      dfTmp = rbind(dfTmp, data.frame(sim = iTry, sel = jSelect, vars = kVarSet,
      mse = c(mseTest, summary(rsTrain)$rss[kVarSet]/sum(bTrain)), trainTest = c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp, aes(x = factor(vars), y = mse, colour = sel)) + geom_boxplot()+facet_wrap(~trainTest)

Swagger editor

The 5 variables model is the best model with lowest variability.

redLassoCoefCnt = 0
redLassoMSE = NULL
for (iTry in 1:30) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(redX)))
  cvLassoTrain = cv.glmnet(redX[bTrain, ], redY[bTrain], alpha = 1, lambda = 10^((-120:0)/20))
  lassoTrain = glmnet(redX[bTrain, ], redY[bTrain], alpha = 1, lambda = 10^((-120:0)/20))
  lassoTrainCoef = predict(lassoTrain, type = "coefficients", s = cvLassoTrain$lambda.1se)
  redLassoCoefCnt = redLassoCoefCnt + (lassoTrainCoef[-1, 1] != 0)
  lassoTestPred = predict(lassoTrain, newx = redX[!bTrain, ], s = cvLassoTrain$lambda.1se)
  redLassoMSE = c(redLassoMSE, mean((lassoTestPred-redY[!bTrain])^2))
}
mean(redLassoMSE)

## [1] 0.3511264

quantile(redLassoMSE)

##        0%       25%       50%       75%      100% 
## 0.3142218 0.3399837 0.3498090 0.3669581 0.3966957

redLassoCoefCnt

##        fixed.acidity     volatile.acidity          citric.acid 
##                    3                   30                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    3                    0 
## total.sulfur.dioxide              density                   pH 
##                   18                    0                    8 
##            sulphates              alcohol 
##                   30                   30

4.2.b: Ridge regression

redRidgeRes = glmnet(redX, redY, alpha = 0)
plot_glmnet(redRidgeRes)

Swagger editor

We see that as the λ decrease, more attributes shows up.

redCvRidgeRes = cv.glmnet(redX, redY, alpha = 0)
plot(redCvRidgeRes)

Swagger editor

The cv.glmnet plot shows us that the 11 variable model has a low error as long as the lambda value is below 0.

redCvRidgeRes$lambda.min

## [1] 0.0412185

redCvRidgeRes$lambda.1se

## [1] 0.508161

predict(redRidgeRes, type="coefficients", s = redCvRidgeRes$lambda.min)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           35.729725864
## fixed.acidity          0.032107613
## volatile.acidity      -0.723939083
## citric.acid           -0.146414693
## residual.sugar         0.020209457
## chlorides             -1.866020986
## free.sulfur.dioxide    0.003096153
## total.sulfur.dioxide  -0.002884088
## density              -32.085525788
## pH                    -0.420103364
## sulphates              1.617287384
## alcohol                0.250677310

predict(redRidgeRes, type="coefficients", s = redCvRidgeRes$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           40.854127773
## fixed.acidity          0.022933819
## volatile.acidity      -0.506314907
## citric.acid            0.155144522
## residual.sugar         0.024354939
## chlorides             -2.422340931
## free.sulfur.dioxide    0.001213229
## total.sulfur.dioxide  -0.002199765
## density              -36.878523406
## pH                    -0.180025866
## sulphates              1.137551541
## alcohol                0.161626564

ridgeResScaled = glmnet(scale(redX), redY, alpha = 0)
cvRidgeResScaled = cv.glmnet(scale(redX), redY, alpha = 0, lambda = 10^((-50:60)/20))
predict(ridgeResScaled, type = "coefficients", s = cvRidgeResScaled$lambda.1se)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)           5.64546226
## fixed.acidity         0.03479110
## volatile.acidity     -0.08526128
## citric.acid           0.02612294
## residual.sugar        0.01123119
## chlorides            -0.03435434
## free.sulfur.dioxide   0.01188344
## total.sulfur.dioxide -0.05933244
## density              -0.06002607
## pH                   -0.02527837
## sulphates             0.13715987
## alcohol               0.16219241

The impact of alcohol abd sulphates is made more apparent.

ridgeCoefCnt = 0
ridgeCoefAve = 0
ridgeMSE = NULL
for ( iTry in 1:30 ) {
  bTrain = sample(rep(c(TRUE, FALSE), length.out = nrow(redX)))
  cvridgeTrain = cv.glmnet(redX[bTrain, ], redY[bTrain], alpha = 0, lambda = 10^((-50:50)/20))
  ridgeTrain = glmnet(redX[bTrain, ], redY[bTrain], alpha = 0, lambda = 10^((-50:50)/20))
  ridgeTrainCoef = predict(ridgeTrain, type = "coefficients", s = cvridgeTrain$lambda.1se)
  ridgeCoefCnt = ridgeCoefCnt + (ridgeTrainCoef[-1,1] != 0)
  ridgeCoefAve = ridgeCoefAve + ridgeTrainCoef[-1, 1]
  ridgeTestPred = predict(ridgeTrain, newx = redX[!bTrain, ], s = cvridgeTrain$lambda.1se)
  ridgeMSE = c(ridgeMSE, mean((ridgeTestPred-redY[!bTrain])^2))
}
ridgeCoefAve = ridgeCoefAve / length(ridgeMSE)
ridgeCoefAve

##        fixed.acidity     volatile.acidity          citric.acid 
##          0.020897851         -0.495097415          0.169620191 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.026170861         -2.553976376          0.001198722 
## total.sulfur.dioxide              density                   pH 
##         -0.002100249        -36.084894539         -0.178762367 
##            sulphates              alcohol 
##          1.113489212          0.157424874

4.3 Conclusion

The use of regularized approaches to model quality of white wine lead us to a new model containing 7 variables instead of the 9 variable model previously found. This new model contains volatile.acidity, residual.sugar, chlorides, free.sulfur.dioxide, pH, sulphates, alcohol.
Density and fixed.acidity are left out.

We fit this new model to compare its performance with the previous ones:

whiteFit4 = lm(quality~volatile.acidity + residual.sugar + chlorides + free.sulfur.dioxide + pH + sulphates + alcohol, cleanWhiteDat)
summary(whiteFit4)

## 
## Call:
## lm(formula = quality ~ volatile.acidity + residual.sugar + chlorides + 
##     free.sulfur.dioxide + pH + sulphates + alcohol, data = cleanWhiteDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9500 -0.5177 -0.0304  0.4434  2.4820 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.4340237  0.3293309   4.354 1.37e-05 ***
## volatile.acidity    -1.9936400  0.1545821 -12.897  < 2e-16 ***
## residual.sugar       0.0263779  0.0028882   9.133  < 2e-16 ***
## chlorides           -5.9584457  1.4016860  -4.251 2.18e-05 ***
## free.sulfur.dioxide  0.0049055  0.0008315   5.899 3.95e-09 ***
## pH                   0.3071981  0.0860110   3.572 0.000359 ***
## sulphates            0.3980956  0.1187513   3.352 0.000809 ***
## alcohol              0.3577995  0.0128648  27.812  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7287 on 3991 degrees of freedom
## Multiple R-squared:  0.2521, Adjusted R-squared:  0.2508 
## F-statistic: 192.2 on 7 and 3991 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(whiteFit4$residuals^2)

## [1] 0.529885

White wine model:

All the variables in the model are important predictors.
chlorides is the less relevant explainatory variable.
Adjusted R2 = 0.2508 (slightly improve).
MSE = 0.529885 (slightly improve).

Red wine model:

About the red wine, the 5 variables model include the same variables as in two previous sub-problems: alcohol, suplhates, pH, total.sulfur.dioxide, volatile.acidity.
sulphates and alcohol appears to be the most influential variables.

5: PCA

We perform a principal components analysis on the scaled and unscaled merged wine data and produce corresponding plots.

cleanWhiteDat["color"] = "grey"
cleanRedDat["color"] = "red"

# Merge the two datasets
mergedWine = rbind(cleanWhiteDat, cleanRedDat)

old.par = par(mfrow=c(2,2),ps=16)
for (bScale in c(FALSE,TRUE)) {
  pcaResTmp = prcomp(mergedWine[, -c(12, 13)], scale.=bScale)
  biplot(pcaResTmp)
  mtext(paste(ifelse(bScale, "scaled", "untransformed"), "data"))
  plot(pcaResTmp$x[, 1:2], xlim = range(pcaResTmp$x[, 1])*1.1)
  mtext(paste(ifelse(bScale, "scaled", "untransformed"), "data"))
  for (iPC in 1:2) {
    cat(paste0("Ten largest by absolute value loadings, ", ifelse(bScale, "scaled", "untransformed"), " data -- PC", iPC, ":"), fill = TRUE)
    print(pcaResTmp$rotation[order(abs(pcaResTmp$rotation[, iPC]), decreasing = TRUE)[1:10], iPC])
  }
  cat("PVE by first five PCs (", ifelse(bScale, "scaled","untransformed"), " data):", fill = TRUE, sep = "")
  cat(100*pcaResTmp$sdev[1:5]^2 / sum(pcaResTmp$sdev^2), fill = TRUE)
}

## Ten largest by absolute value loadings, untransformed data -- PC1:
## total.sulfur.dioxide  free.sulfur.dioxide       residual.sugar 
##        -0.9747293028        -0.2188279196        -0.0440596856 
##        fixed.acidity              alcohol     volatile.acidity 
##         0.0068663532         0.0050860565         0.0012617925 
##                   pH            sulphates          citric.acid 
##         0.0007336095         0.0006786373        -0.0004936396 
##            chlorides 
##         0.0001504207 
## Ten largest by absolute value loadings, untransformed data -- PC2:
##  free.sulfur.dioxide total.sulfur.dioxide       residual.sugar 
##        -9.744501e-01         2.206421e-01        -4.140942e-02 
##              alcohol        fixed.acidity     volatile.acidity 
##        -5.260996e-03         4.595047e-03         5.612383e-04 
##            sulphates                   pH          citric.acid 
##        -3.386107e-04        -3.014149e-04         1.777537e-04 
##              density 
##        -5.192438e-06 
## PVE by first five PCs (untransformed data):
## 96.10652 3.354252 0.4713446 0.03957368 0.02660596

Swagger editor

## Ten largest by absolute value loadings, scaled data -- PC1:
## total.sulfur.dioxide            chlorides     volatile.acidity 
##           -0.4207814            0.4172822            0.4025068 
##  free.sulfur.dioxide            sulphates        fixed.acidity 
##           -0.3710354            0.3027714            0.2673589 
##       residual.sugar                   pH          citric.acid 
##           -0.2599425            0.2402709           -0.1857009 
##              density 
##            0.1598898 
## Ten largest by absolute value loadings, scaled data -- PC2:
##              density              alcohol       residual.sugar 
##          -0.58081693           0.51938791          -0.42476884 
##            chlorides total.sulfur.dioxide  free.sulfur.dioxide 
##          -0.25346298          -0.21569599          -0.19716372 
##        fixed.acidity                   pH            sulphates 
##          -0.19481681           0.10332072          -0.08433094 
##          citric.acid 
##          -0.06446984 
## PVE by first five PCs (scaled data):
## 32.42024 23.09982 13.69431 8.490649 5.56921

par(old.par)

On the scaled version, we can clearly see two clusters.

pcaResTmp = prcomp(mergedWine[, -c(12, 13)], scale.= T)

old.par = par(mfrow=c(1,2))

plot(pcaResTmp$x[, 1:2], xlim = range(pcaResTmp$x[, 1])*1.1, col = mergedWine$quality, main = "Scaled - Display quality")
legend(5.07, 4.44, legend=unique(mergedWine$quality), col=unique(mergedWine$quality), lty=1:2, cex=0.8)

plot(pcaResTmp$x[, 1:2], xlim = range(pcaResTmp$x[, 1])*1.1, col = mergedWine$color, main = "Scaled - Display color")
legend(4.29, 4.44, legend=c("White", "Red"), col=c("grey", "red"), lty=1:2, cex=0.8)

Swagger editor

par(old.par)

On the first plot we can observe the repartition of the wine quality. There is no “strong” trend but it appears that observations of quality 6 are spread everywhere while observations of quality 7 and 8 tends to be located at the top of each clusters and observations of quality 5 to the bottom.

The second plot show us the wine color repartition. Clearly, the two clusters identified above represent white and red wine, respectively located to the left and the right.

6: Model wine quality using principal components

Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.

pcaRed = prcomp(cleanRedDat[, -c(12, 13)], scale.= T)

fit = lm(cleanRedDat$quality~pcaRed$x[,1]+pcaRed$x[,2]+pcaRed$x[,3]+pcaRed$x[,4]+pcaRed$x[,5]+pcaRed$x[,6]+pcaRed$x[,7]+pcaRed$x[,8]+pcaRed$x[,9]+pcaRed$x[,10]+pcaRed$x[,11])
summary(fit)

## 
## Call:
## lm(formula = cleanRedDat$quality ~ pcaRed$x[, 1] + pcaRed$x[, 
##     2] + pcaRed$x[, 3] + pcaRed$x[, 4] + pcaRed$x[, 5] + pcaRed$x[, 
##     6] + pcaRed$x[, 7] + pcaRed$x[, 8] + pcaRed$x[, 9] + pcaRed$x[, 
##     10] + pcaRed$x[, 11])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92124 -0.36701 -0.05888  0.40757  1.87569 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.645462   0.016751 337.018  < 2e-16 ***
## pcaRed$x[, 1]   0.048462   0.009749   4.971 7.66e-07 ***
## pcaRed$x[, 2]  -0.262717   0.011270 -23.311  < 2e-16 ***
## pcaRed$x[, 3]  -0.124273   0.013202  -9.413  < 2e-16 ***
## pcaRed$x[, 4]   0.101262   0.016228   6.240 6.11e-10 ***
## pcaRed$x[, 5]  -0.012832   0.017991  -0.713  0.47582    
## pcaRed$x[, 6]   0.050962   0.020192   2.524  0.01174 *  
## pcaRed$x[, 7]   0.134745   0.021044   6.403 2.20e-10 ***
## pcaRed$x[, 8]  -0.001788   0.027009  -0.066  0.94722    
## pcaRed$x[, 9]   0.079421   0.028696   2.768  0.00573 ** 
## pcaRed$x[, 10] -0.114098   0.039794  -2.867  0.00421 ** 
## pcaRed$x[, 11]  0.081291   0.067025   1.213  0.22543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5752 on 1167 degrees of freedom
## Multiple R-squared:  0.3947, Adjusted R-squared:  0.389 
## F-statistic: 69.17 on 11 and 1167 DF,  p-value: < 2.2e-16

# Calculate the MSE
mean(fit$residuals^2)

## [1] 0.3274646

PC1, PC2, PC3, PC4 and 7 are the important predictors.
PC5, PC8 and PC11 are not required to explain the variability in quality.
Adjusted R2 = 0.389 (0.2508 in the previous section).
MSE = 0.3274646 which is much better than the MSE of the model in the previous section (0.529885).

1: Data Summary

1.1: White wine data

1.1.a: Summary

1.1.b: Outlier detection

1.1.c: Correlation

1.2 Red wine data

1.2.a: Summary

1.1.b: Outlier detection

1.1.c: Correlation

1.3 Should we log-transform the data?

1.4 Initial model performance

2: Choose optimal models by exhaustive, forward and backward selection

2.1: White wine

2.2: Red wine

2.3 Comparison between red and white wine optimal models

Optimal model by cross-validation

3.1: White wine

3.1.a: The bootstrap

3.1.b: K-fold Cross-validation

3.2: Red wine

3.2.a: The bootstrap

3.2.b: K-fold Cross-validation

3.3: Comparing red and white wine models

4: lasso/ridge

4.1: White wine

4.1.a: Lasso regression

4.1.b: Ridge regression

4.2: Red wine

4.2.a: Lasso regression

4.2.b: Ridge regression

4.3 Conclusion

5: PCA

6: Model wine quality using principal components

Recent Posts