Overview Of The Project

The goal of this blog post is to show you what is my thought process behind building a classification based model. In the first part of the project I’ll start with Initial Data Analysis (IDA) and afterwards in second part, I’ll proceed with Exploratory Data Analysis (EDA). Based on the findings in EDA, I’ll create new features that provide potentially a better results based on performance metric.

About The Data

The data set is based on “Wine Quality Data Set” from UCI Machine Learning Repository. There are 4898 number of instances, 12 features, no missing values and finally,  the areas of research is business based. There are actually two data sets, but for this blog post I’ll use “White Wine Data Set”, because it has greater number of instances.

Due to privacy, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The data sets can be viewed as  either classification or regression task, however I’ll use classification, because I want to see how accurate can I predict quality of wine based on other features.

For more information about the data set, you can visit directly the UCI Machine Learning Repository.

————————————————————————————————————————

Predicting Wine Quality – R

Before proceeding you could also look at my blog post, where I discuss my thought process behind making a template for a supervised machine learning models. It is great for beginners to understand what goes through a machine learning project and for intermediate and advance users, they could possibly learn few new tricks for feature engineering or visualization in general.

Initial Data Analysis

Libraries

library(dplyr) #data manipulation
library(tidyr) #data manipulation
library(psych) #descriptive statistics
library(Hmisc) #descriptive statistics

library(purrr)
library(tidyverse)
library(purrrlyr) #data type convertor

library(kableExtra)
library(xtable)

library(ggpubr) #multi plot
library(grid) #multi plot

library(caret) #machine learning

Loading Data Into Environment

winequality_white <- read_delim("winequality-white.csv", 
    ";", escape_double = FALSE, trim_ws = TRUE)

I advise that you use Rstudio Projects in case if you are using R as a machine learning based program. There are numerous advantages of using ‘Project’ instead of ‘scripts’. One of them is that ‘Project’ makes life much easier for data and project management. Plus there is an advantage of code reproduction with other team members.

Data Set Overview

One of the preliminary steps of Initial Data Analysis or IDA, is to understand how the data looks like and also, did the ‘reader’ function properly load the data into the environment.

#You can as well use glimpse(), str() or summary() function
#The one below is a custom made one

head(winequality_white) %>%
  knitr::kable() %>%
    kable_styling(full_width = F, position = "left") %>%
    row_spec(0, color = "white", background = "#6897BB")

Data Formating

Put the quality feature at the first position in the table, because that immensely increases the readability of the data and not only that, you know at any given point where your target feature is and as well, create a new feature that is categorical with only two values: low and high quality wine.

winequality_white <- winequality_white %>%
  select(quality, everything())

Some of the features have an empty space and that is not a desired way to have features saved in your local environment. Therefore, when a name has a space between two characters, then sometimes R, Python or any other programming language has problems with identifying the feature in a data frame and could cause malfunction while loading the data in. So, I advise to prevent that problem, by removing the spaces.

oldnames = c("fixed acidity","volatile acidity", "residual sugar", 
"free sulfur dioxide", "total sulfur dioxide", "citric acid")
newnames = c("fixedAcidity","volatileAcidity", "residualSugar", 
"freeSulfurdioxide", "totalSulfurDioxide", "citricAcid")

winequality_white <-winequality_white %>% 
  rename_at(vars(oldnames), ~ newnames) %>%
  mutate(quality = as.factor(ifelse(quality < 6, "low", "high")))

#You can do the same thing with the code below
colnames(winequality_white ) <- gsub(" ", "_", colnames(winequality_white ))

Data Type Check Up

When importing data into the environment, there is a chance that the ‘reader’ package/function might not do the job for which you would have hoped for. That’s why it is imperative to check the data type before proceeding to investigate the correct data type. There is a great package for categorical data and it is called forcats, it simplifies the whole process of changing the data type of one or more categorical features.

#Select data types
data_types <- sapply(winequality_white, class)
data_types <- data.frame(feature = names(data_types), values = data_types, row.names = NULL)

#transpose the training set
values <- t(winequality_white)

#select the new transposed matrix and convert it to data frame class
datamatrix = as.data.frame(values, stringsAsFactors = FALSE)

#bind both data frames 
structure_df <- bind_cols(data_types, datamatrix)

#select only three columns 
structure_dfa <- structure_df[,1:3]

structure_dfa %>%
knitr::kable() %>%
kable_styling(full_width = F, position = "left") %>%
row_spec(0, color = "white", background = "#6897BB", align = "center")

Do note that in R with factor you can use in your machine learning projects. You don’t have to use one-hot encoding if you wish. R is built in a way, that it recognizes the difference between factor and character data type. If you ever wondered how to change a data type of a certain feature, then you’ll see down below an example on how to change data types:

winequality_white <- winequality_white %>%
  mutate_if(sapply(winequality_white, is.integer), as.factor)

#the problem could be also solved like this:
#list the feature or features for which you would like to change data type
col <- c("quality")

#then use purrr's function dmap to change the data type of the feature
train[, col] <- train %>%
  select(one_of(col)) %>%
  dmap(as.factor)

NAs Check

#Sometimes values can be NA, but recorded under some other value. 
#Here are some of the potential NAs:
winequality_white[winequality_white == c("?", ".", ",", "!", "na")] <- NA

na <- winequality_white %>%
  map_df(function(x) sum(is.na(x))) %>%
  gather(feature, num_nulls) %>%
  arrange(desc(num_nulls)) %>%
  print(n = 10)

There aren’t any NAs in the data set, which is great. However do note that sometimes NAs could be represented as either ‘?’, ‘-‘, ‘!’, ‘na’, or any other string. That’s why it is important to have a good look at the data set. Visualization plus descriptive statistics can help you in identifying those types of errors and identifying these potential issues might save lots of headache in the future.

Usually I check presence of NA before the data split, because that ensures that some of the NAs don’t end up in the test set. To prevent that situation check the NA status before the data split, but don’t preform extensive exploratory analysis on the whole data set, since that introduces potential bias on the test set, which you want to avoid at all cost. The reason why you would like to avoid peeking into the test set is because of potential overfitting.

na %>%
  knitr::kable() %>%
      kable_styling(full_width = F, position = "left") %>%
      row_spec(0, color = "white", background = "#6897BB")

In case if there were NAs, then you could use one of these packages to impute or visualize the NAs:

Spliting The Data

set.seed(92)
trainIndex <- createDataPartition(winequality_white$quality, p = .7,
                                  list = FALSE,
                                  times = 1)

train <- winequality_white[ trainIndex,]
test <- winequality_white[-trainIndex,]

CreateDataPartitionis a function that is part of the caret package, which is one of my favorite packages in the whole R environment. It is partially equivalent to scikit-learn in Python. It is advisable to use set.seed when ever you split or implement a model, because it will replicate the exact results that you have had.

Visualize The Stratified Split

I tried different ways to sample from the main data set. For example, this data set samples the best with stratified random sampling, that is set at 70% for training and 30% for the test set. Random sampling does not preform well and since our target feature is quality, it does not gather at each iteration a constant equal distribution as in training set. You want to have a training set that somewhat resembles the test set and that it mirrors the results once put into production. For categorical data types it is advisable that stratified sampling is used, so that each category is neither under nor over represented.

rbind(data.frame(group = "train", train),
      data.frame(group = "test", test)) %>%
  gather(x, y, quality:sulphates) %>%
  ggplot(aes(x = y, color = group, fill = group)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)

Stratified sampling has yielded the results that I was hoping for. The next step is to see if the target feature is separated in the same proportion.For more information on sampling types go here.

#Since both Hmisc and psych have the same describe function, by using :: you exactly specify
#which package's function would you like to use. 
Hmisc::describe(train$quality)

Value         high    low
Frequency 2281 1148
Proportion 0.66   0.34

Hmisc::describe(test$quality)

Value          high  low
Frequency  977  492
Proportion 0.66  0.34

Both training’s and test set’s quality feature is separated in equal proportions and that’s why stratified sampling in this particular case is amazing, because it guarantees that both training and test set will have similar splits. That is great when you know that the new data, upon which the prediction will be based, are somewhat similar, otherwise use simple random filtering.

Exploratory Data Analysis

Descriptive Statistics

Numerical Data Type

The next step is to take all the values that are numerical and see what characteristics do they have. It is a good idea to do that, before visualization since you can immediately spot potential errors from the data wrangling stage. It is interesting to see what is the range of each feature, meaning the difference between min and max, how skewed the data set is and the size of the peak – kurtosis.

df_num <- train %>%
  select(-quality) %>%
  summarise_all(funs(min = min, 
                      Q25 = quantile(., 0.25), 
                      median = median, 
                      Q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd,
                      skewness = skew,
                      kurtosis = kurtosi))

# reshape it using tidyr functions
summary_num_stat <- df_num %>% gather(stat, val) %>%
  separate(stat, into = c("features", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(features, min, Q25, Q75, max, median, mean, sd, skewness, kurtosis)

summary_num_stat %>%
  knitr::kable() %>%
    kable_styling(full_width = F, position = "left") %>%
    row_spec(0, color = "white", background = "#6897BB", align = "right")

It appears that freeSulfurdioxide, residualSugar, totalSulfurDioxidehave quite large range of values. The ranges is computed by using max minus min. Interestingly enough, those are the features that are relatively right skewed. None of the values have minus within the range of observations and that is important because it means we can use Box-Cox transformation to further ‘normalize’ the data. If the values would have been negative, then we would have to use Yeo–Johnson transformation. Most of the features are manageable, except chloridesthat has lots of outliers. It would be interesting to see how those outliers are related with low and high quality of wine. 

Categorical Data Type

quality <- train %>%
  group_by(category) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

quality %>%
  knitr::kable() %>%
    kable_styling(full_width = F, position = "left") %>%
    row_spec(0, color = "white", background = "#6897BB", align = "right")

The good news is that this is not unbalanced categorical feature, which is great since it simplifies the whole predictive modeling process.

Univariate and Bivariate Relationships and Associations

Categorical Variables
  • Count of each category
  • Proportion of each category
  • Imbalanced categories
  • Mislabeled categories
ggplot(data = train, aes(x = category)) +
   geom_bar(color = 'black', fill = "#6897BB") +
   geom_text(stat='count', aes(label=..count..), vjust=3, size=5, color = "white") +
   #theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
   guides(fill=FALSE) +
   xlab("quality") + ylab("Count") +
   labs(title="Count per category for 'quality' feature", 
   #subtitle="Count of All Positive and Negative Samples", 
   caption="Source: winequality_white")

Continuous Variables
  • Measures of location
  • Measures of spread
  • Asymmetry
  • Outliers
  • Gaps
train %>%
  gather(-quality, key = "var", value = "value") %>% 
  ggplot(aes(x = value, fill = as.factor(quality))) +
  geom_histogram(color = "white", bins = 30, alpha = 0.6) +
  facet_wrap(~ var, scales = "free") +
  ylab("Frequency") +
  xlab("All Continuous Features") +
  labs(title = "The distribution of low and high quality wine",
       subtitle = "Average") +
  #scale_fill_brewer(name = "New Legend Title") +
  scale_fill_manual(values = c("#ff1919", "#0e2b58"), name = "Quality") +
  theme(text = element_text(color = "gray20"),
        legend.position = c("right"), # position the legend in the upper left 
        legend.direction = "vertical",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.text = element_text(size = 11, color = "gray10"),
        axis.text = element_text(face = "italic"),
        axis.ticks.y = element_blank(), # element_blank() is how we remove elements
        axis.line.y = element_blank(),
        panel.grid.major = element_line(),
        panel.grid.major.x = element_blank())

There are few things that stick out. It seems that the greater the winequalitythe higher the alcohol level. Lower quality wine is saturated around 9 and 12 percent alcohol, while better wine is rather spread across. It would be interesting to see at which quality groupchloridevalues that are between 0.1 and 0.2 quality group belong to. Same thing is with virtually all features exceptpH, because it seems fairly normally distributed. All the values ofdensityare between 0.94 and 1.04. I might even turn that feature to a categorical feature.

 

train %>%
  gather(-quality, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y= value, fill = as.factor(quality))) +
  geom_boxplot() +
  #scale_y_log10() +
  #scale_x_log10() +
  facet_wrap(~ var, scales = "free") +
  ylab("Feature") +
  xlab("Feature") +
  labs(title = "'Box Plot per Quality'") +
  #scale_fill_manual(name = "Boxplot", palette = "Paired") +
  scale_fill_manual(values = c("#ff1919", "#0e2b58"), name = "Quality") +
  theme(text = element_text(color = "gray20"),
        legend.position = c("right"), # position the legend in the upper left 
        legend.direction = "vertical",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.text = element_text(size = 11, color = "gray10"),
        axis.text.x=element_blank(), #hides x axis
        axis.text = element_text(face = "italic"),
        axis.ticks.y = element_blank(), # element_blank() is how we remove elements
        axis.line.y = element_blank(),
        panel.grid.major = element_line(),
        panel.grid.major.x = element_blank())

 

 

train %>%
  gather(-quality, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y= value, fill = as.factor(quality))) +
  geom_violin(alpha = 0.6) +
  #scale_y_log10() +
  #scale_x_log10() +
  facet_wrap(~ var, scales = "free") +
  ylab("Features") +
  xlab("Features") +
  labs(title = "'Violin Plot per Quality'") +
  #scale_fill_brewer(name = "Boxplot", palette = "Paired")
  #scale_fill_manual(name = "Boxplot", palette = "Paired") +
  scale_fill_manual(values = c("#ff1919", "#0e2b58"), name = "Quality") +
  theme(text = element_text(color = "gray20"),
        legend.position = c("right"), # position the legend in the upper left 
        legend.direction = "vertical",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.text = element_text(size = 11, color = "gray10"),
        axis.text = element_text(face = "italic"),
        axis.ticks.y = element_blank(), # element_blank() is how we remove elements
        axis.line.y = element_blank(),
        panel.grid.major = element_line(),
        panel.grid.major.x = element_blank())

 

class_comparison <- train %>%
  #Creating new feature by breaking alcohol into 2 levels 
  mutate(category=cut(alcohol, breaks=c(-Inf, 10, Inf), labels=c("low_alcohol", "high_alcohol"))) %>%
  select(quality, category, alcohol) 

#being more explicit in regards to what is meant by low and high
levels(class_comparison$quality) <- c('high_quality', 'low_quality')

#Calculating mean for each category
class_mean <- class_comparison %>%
  group_by(quality, category) %>%
  summarise(Mean = mean(alcohol))

#Creaating label and rounding it to max 3 decimals. 
class_mean <- class_mean %>%
  mutate(Label = paste0(round(Mean, 3)))

#Histogram Plot
ggplot(class_comparison, aes(x = alcohol, fill = as.factor(category))) +
  geom_histogram(bins = 30, color = "white") +
  scale_fill_manual(values=c("#F8766D", "#6897BB")) +
  geom_vline(data = class_mean, aes(xintercept = Mean), linetype = "dashed", alpha = .6) +
  facet_grid(quality ~ category, scales = "free") +
  scale_y_continuous(breaks=pretty_breaks(n=2), limits = c(0, 150)) +
  ylab("Frequency") +
  xlab("Alcohol") +
  labs(title = "Mean Alcohol value devided by Quality & Alcohol - low/high levels") +
  #subtitle = "The higher the level of wine, the greater is chance that it is considered high quality") +
  geom_text(data = class_mean, aes(x = Mean, y = 130, id = quality, label = Label),
            family = "Georgia", size = 3, hjust = -.1) +
  theme(text = element_text(size=9, family = "Georgia"),
        legend.position='none',
        axis.ticks = element_blank(),
        plot.title = element_text(size = 15, margin = margin(b = 8)),
        plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 20)),
        plot.caption = element_text(size = 6, margin = margin(t = 8), color = "grey70", hjust = 0),
        strip.text = element_text(size = 10, face="bold"))

 

train %>%
  gather(-quality, -alcohol, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = alcohol, color = quality)) +
  geom_point(alpha = 0.25) +
    scale_color_manual(name = "",
                     values = c("#ff1919", "#0e2b58")) +
  facet_wrap(~ var, scales = "free") + 
  labs(title = "Scatterplot") +
  theme(text = element_text(size=9, family = "Georgia"),
        axis.ticks = element_blank(),
        plot.title = element_text(size = 15, margin = margin(b = 8)),
        plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 20)),
        plot.caption = element_text(size = 6, margin = margin(t = 8), color = "grey70", hjust = 0),
        strip.text = element_text(size = 10, face="bold"))

 

Correlation Matrix

#First thing change the data type from character to integer, so that cor function can work
train_a <- train %>%
  mutate_if(sapply(train, is.factor), as.integer)

train_a$quality <- ifelse(train_a$quality == 2, 0, train_a$quality) 

#implement cor function
M <- cor(train_a)
#head(round(M,2))

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat<- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(train_a)

#Correlation plot
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", 
         tl.cex = 0.75, 
         number.cex = .55,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE)

 

The results aren’t even surprising, alcohol correlates negatively with the quality feature, which means that the higher the levels of the wine the greater the grade is. Density negatively correlates with quality and that means that the greater the value of density, the lower is the quality of wine.

Correlation Significance

To determine whether the correlation between variables is significant, compare the p-value to your significance level. Usually, a significance level (denoted as α or alpha) of 0.05 works well. An α of 0.05 indicates that the risk of concluding that a correlation exists—when, actually, no correlation exists—is 5%. The p-value tells you whether the correlation coefficient is significantly different from 0. (A coefficient of 0 indicates that there is no linear relationship.)

P-value ≤ α: The correlation is statistically significant
If the p-value is less than or equal to the significance level, then you can conclude that the correlation is different from 0.
P-value > α: The correlation is not statistically significant
If the p-value is greater than the significance level, then you cannot conclude that the correlation is different from 0. Source

In these results, the p-values for the correlation between porosity and hydrogen and between strength and hydrogen are both less than the significance level of 0.05, which indicates that the correlation coefficients are significant. The p-value between strength and porosity is 0.0526. Because the p-value is greater than the significance level of 0.05, there is inconclusive evidence about the significance of the association between the variables.

res2 <- rcorr(as.matrix(train))
cor_sig <- as.data.frame(res2$P)

cor_sig[is.na(cor_sig)] <- 0

cor_sig[,1:12] <- round(cor_sig, digits = 3)

cor_sig <- cor_sig %>% 
  mutate(
  features = row.names(.),
  quality = cell_spec(quality, color = ifelse(quality > 0.05, "red", "black")),
  fixedAcidity = cell_spec(fixedAcidity, color = ifelse(fixedAcidity > 0.05, "red", "black")),
  volatileAcidity = cell_spec(volatileAcidity, 
color = ifelse(volatileAcidity > 0.05, "red", "black")),
  citricAcid = cell_spec(citricAcid, color = ifelse(citricAcid > 0.05, "red", "black")),
  residualSugar = cell_spec(residualSugar, color = ifelse(residualSugar > 0.05, "red", "black")),
  chlorides = cell_spec(chlorides, color = ifelse(chlorides > 0.05, "red", "black")),
  freeSulfurdioxide = cell_spec(freeSulfurdioxide, 
color = ifelse(freeSulfurdioxide > 0.05, "red", "black")),
  totalSulfurDioxide = cell_spec(totalSulfurDioxide, 
color = ifelse(totalSulfurDioxide > 0.05, "red", "black")),
  density = cell_spec(density, color = ifelse(density > 0.05, "red", "black")),
  pH = cell_spec(pH, color = ifelse(pH > 0.05, "red", "black")),
  sulphates = cell_spec(sulphates, color = ifelse(sulphates > 0.05, "red", "black")),
  alcohol = cell_spec(alcohol, color = ifelse(alcohol > 0.05, "red", "black")))


cor_sig %>%
  #select_all() %>%
  kable(escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right") %>%
  row_spec(0, color = "white", background = "#6897BB", align = "center")

CitricAcid and FreeSulfurdioxide appear to have quite big p value and that presents that both those features might in fact have almost close to 0 correlation with quality feature. The result is not surprising at all. There is high chance that those two features will be disregarded once feature selection stage kicks in.

Feature Engineering

Now the next step is creating features based on findings from EDA stage. During the modeling stage, both baseline training set and a training set with feature engineering will be compared, to see how much gain did we achieved by using these newly created features. When I say baseline training set, I mean the training set where only initial features that were given are used.

Create ‘Isolation’ features where:

  • chloridesHIGH = chlorides < 0.06,
  • citricAcidBETWEEM = citricAcid < 0.5 & citricAcid > 0.1,
  • freeSulfurdioxideHIGH = freeSulfurdioxide > 65,
  • pHHIGH = pH > 3.40,
  • residualSugarHIGH = residualSugar > 14,
  • sulphatesHIGH = sulphates > 0.6,
  • volatileAcidityHIGH = volatileAcidity > 0.45,
  • sulphatesHIGH = sulphates > 0.6,
  • fixedAcidityLOW = fixedAcidity < 6,
  • alcoholLOW = alcohol < 9,
  • alcoholHIGH = alcohol > 12.5

Create ‘Interaction’ features of based on product of ‘alcohol’ and all other features except the target feature, ‘quality’. ‘Alcohol’ feature is very interesting since it has the highest correlation with our target feature and by doing a product of ‘alcohol’ and other features, I hope to create greater variation that would help in explaining the function by which we can explain  our target quality feature.

trainFE <- train %>%
  mutate(
    
    #Isolation features
    chloridesHIGH = chlorides < 0.06,
    citricAcidBETWEEM = citricAcid < 0.5 & citricAcid > 0.1,
    freeSulfurdioxideHIGH = freeSulfurdioxide > 65,
    pHHIGH = pH > 3.40,
    residualSugarHIGH = residualSugar > 14,
    sulphatesHIGH = sulphates > 0.6,
    volatileAcidityHIGH = volatileAcidity > 0.45,
    sulphatesHIGH = sulphates > 0.6,
    fixedAcidityLOW = fixedAcidity < 6,
    alcoholLOW = alcohol < 9,
    alcoholHIGH = alcohol > 12.5,
    
    #Interaction features
    chloridesAL = chlorides * alcohol,
    fixedAcidityAL = fixedAcidity * alcohol,
    volatileAcidityAL = volatileAcidity * alcohol,
    citricAcidAL = citricAcid * alcohol,
    residualSugarAL = residualSugar * alcohol,
    chloridesAL = chlorides * alcohol, 
    freeSulfurdioxideAL = freeSulfurdioxide * alcohol,
    totalSulfurDioxideAL = totalSulfurDioxide * alcohol,
    densityAL = density * alcohol,
    pHAL = pH * alcohol
  )

 

col <- c("chloridesHIGH", "citricAcidBETWEEM", "freeSulfurdioxideHIGH", "pHHIGH", 
"residualSugarHIGH", "sulphatesHIGH", "volatileAcidityHIGH", "fixedAcidityLOW", 
"alcoholLOW", "alcoholHIGH")

#take columns from col object and transform it from logical data type to a factor
trainFE[, col] <- trainFE %>%
  select(one_of(col)) %>%
  dmap(as.factor)
x = nearZeroVar(trainFE, saveMetrics = TRUE)

freqRatio– ratio of frequencies for the most common value over the second most common value
percentUnique– percentage of unique data points out of the total number of data points
zeroVar– vector of logicals for whether the predictor has only one distinct value
nzv – vector of logicals for whether the predictor is a near zero variance predictor

Overview of the features that have potentially no variance:

head(x) %>%
  knitr::kable() %>%
    kable_styling(full_width = F, position = "left") %>%
    row_spec(0, color = "white", background = "#6897BB", align = "right")

x[x[,"zeroVar"] == TRUE, ]
x[x[, "nzv"] == TRUE, ]

freeSulfurdioxideHIGH

trainFE <- trainFE %>%
  select(-freeSulfurdioxideHIGH)

 

Modeling

trControl = trainControl(method = "repeatedcv", 
                         number = 5, 
                         repeats = 5, 
                         savePredictions = TRUE, 
                         verboseIter = FALSE)

 

set.seed(92)
model_lm <- caret::train(quality ~ .,
                         data = train,
                         method = "lm",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)

model_lmFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "lm",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)

model_glm <- caret::train(quality ~ .,
                         data = train,
                         method = "glm",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)

model_glmFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "glm",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)


library(elasticnet)
model_lasso <- caret::train(quality ~ .,
                         data = train,
                         method = "lasso",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

library(elasticnet)
model_lassoFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "lasso",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

model_ridge <- caret::train(quality ~ .,
                         data = train,
                         method = "ridge",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

model_ridgeFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "ridge",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

library(glmnet)
model_glmnet <- caret::train(quality ~ .,
                         data = train,
                         method = "glmnet",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)



library(glmnet)
model_glmnetFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "glmnet",
                         preProcess = c("BoxCox","scale", "center"),
                         trControl = trControl)

model_rf <- caret::train(quality ~ .,
                         data = train,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trControl)

model_rfFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trControl)

set.seed(92)
model_Cart <- caret::train(quality ~ .,
                         data = train,
                         method = "rpart",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

model_CartFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "rpart",
                         preProcess = c("BoxCox", "scale", "center"),
                         trControl = trControl)

model_knn <- caret::train(quality ~ .,
                         data = train,
                         method = "knn",
                         preProcess = c("scale", "center"),
                         trControl = trControl)

model_knnFE <- caret::train(quality ~ .,
                         data = trainFE,
                         method = "knn",
                         preProcess = c("scale", "center"),
                         trControl = trControl)

 

# Compare model performances using resample()
models_compare <- resamples(list(LM = model_lm, LM_FE = model_lmFE, 
                                 GLM = model_glm, GLM_FE = model_glmFE,
                                 Lasso = model_lasso, LassoFE = model_lassoFE, 
                                 Ridge = model_ridge, RidgeFE = model_ridgeFE, 
                                 Elastic_Net = model_glmnet, Elastic_NetFE = model_glmnetFE,
                                 RF = model_rf, RFFE = model_rfFE,
                                 CART = model_Cart, CARTFE = model_CartFE, 
                                 KNN = model_knn, KNNFE = model_knnFE))

# Summary of the models performances
summary(models_compare)

 

dotplot(models_compare)

 

bwplot(models_compare)

 

# dot plots of accuracy
scales <- list(x=list(relation="free"), y=list(relation="free"))
dotplot(models_compare, scales=scales)

 

parallelplot(models_compare)

 

xyplot(models_compare, models=c("RF", "RFFE"))

 

 

varimp_mars <- varImp(model_glmnet)
plot(varimp_mars, main="Variable Importance with RF")

PROJECT IS STILL UNDER CONSTRUCTION

 

SOURCES:

Best Practices for Feature Engineering

Best Practices for Feature Engineering

Facebook Comments