Predicting SCE Using Statistical and Machine Learning Methods

Amir Mehdizadeh ¹

Department of Industrial and Systems Engineering, Auburn University

Mohammad Ali Alamdar Yazdi ²

Johns Hopkins Carey School Business, Johns Hopkins University

Miao Cai ³

College for Public Health and Social Justice, Saint Louis University

Qiong Hu ⁴

Department of Industrial and Systems Engineering, Auburn University

Alexander Vinel ⁵

Department of Industrial and Systems Engineering, Auburn University

Steve Rigdon ⁶

College for Public Health and Social Justice, Saint Louis University

Karen Davis ⁷

Department of Computer Science and Software Engineering, Miami University

Fadel M. Megahed ⁸

Farmer School of Business, Miami University

December 18, 2020

1 Preliminaries

In this R Markdown document, we highlight the process by which we have obtained the machine learning results on our dataset. The document is organized into several sections, which can be accessed through the navigation bar on the left.

In the code chunk below, we load the data into the R environment and install/load all needed packages for our analysis.

1.1 Initilization

if (!require("pacman")) install.packages("pacman") # checks if the packman package is installed locally
pacman::p_load(DT, foreach, magrittr, tidyverse,fst, # important analytic pkgs
               knitr, kableExtra, # to print nice looking tables
               dataPreparation, DataExplorer, zoo, # important for exploratory data analysis
               recipes, caret, caretEnsemble, doParallel, MLmetrics, # bases for our ML approach
               rpart, glmnet, Matrix, naivebayes, nnet, randomForest, # needed packages for ML
               kernlab, # needed for svm
               xgboost, plyr, # needed packages for xgboost
               caTools, pROC, # for classification metrics
               ROSE, DMwR # for sampling methods
)

1.2 Loading Data

df = read_csv("../Data/REG500_SCELag7.csv")
df$driver = NULL
df$distance = NULL
colnames(df) = c('data', 'driver', 'intervalTime', 'speedMean', 'speedSD', 'age', 'gender', 'prepInten', 'prepProb', 'windSpeed', 'visibility', 'cumDrive', 'dayOfWeek','weekend', 
                 'holiday', 'hourDayCat', 'SCELag7','SCE') # setting colNames

df %<>% mutate_if(is.character, as.factor) # converting strings to factor variables
df$holiday %<>% as.factor() # converting the holiday variable to a factor
df$SCE = as.factor(df$SCE)
levels(df$SCE) = list(Yes=1, No=0) # specifiying the levels of the response
levels(df$hourDayCat)= list(rush1 = "6 a.m. - 10 a.m.",
                            midDay ="11 a.m. - 14 p.m.",
                            rush2 ="15 p.m. - 20 p.m.",
                            oNight ="21 p.m. -  5 a.m.") # renaming the Levels of hourDayCat variable

1.3 Session Information for Reproducibility

A primary motivation of presenting this work in a R Markdown format is to provide the reader with both the code and the corresponding analysis/results. However, we understand that R as any (open-source) is constantly being updated with new versions for packages, which may change some of the functions used. In an effort to mitigate this, we provide three main pieces of information in the code chunk below:
- Setting the seed to 2020, which ensures that the same subsamples, and more generally generated random numbers, are obtained by other users if they were to use our data and our procedure. - Session Information, where we print out and save an object that contains the:
* R version, platform and Operating System
* Attached base packages
* Other attached packages and their versions, which allows readers to replicate our work by using these exact packages into their system - if future packages depreciate/change some of the utilized functions in our analysis. We refer the reader to this R studio Support Article for more details on this process.

source("functions.R") # Our custom built functions in R
set.seed(2020) 

sInfo = sessionInfo()

save(sInfo, file="Data/sessionInfo.RData")

2 Descripitive Analysis

Here, we perform an exploratory data analysis to better understand the features of our dataset. For the sake of convenience, we display the different components of the exploratory analysis in different tabs to reduce the length of this HTML document. We heavily rely on the functions from the DataExplorer and summarytools in this section.

Summary of the Continuous Predictors

descr(df, stats = c("min", "q1", "mean", "med", "q3", "max", "sd", "skewness", "kurtosis"), 
      headings = F) %>% round(digits = 4)

Predictor Histograms

plot_histogram(df,  ncol=3, nrow = 4,
               ggtheme = theme_bw())
ggsave(filename = "../Results/histogram.png", width = 6.5, height = 4, dpi = 600,
       units = 'in', device = 'png')

Box Plots by SCE

plot_boxplot(df, by="SCE", ncol=3, nrow = 4,  
             geom_boxplot_args = list("outlier.size" = 0.25),
             ggtheme = theme_bw())
ggsave(filename = "../Results/boxplot.png", width = 6.5, height = 4, dpi = 600,
       units = 'in', device = 'png')

Correlation Plot

plot_correlation(df, maxcat = 5L)

3 Predictive Modeling

In this study, we have considered three sampling scenarios. In the first scenario (random-based sampling), we randomly assign trips into the training and testing sets such that the proportion of events (i.e. y = 1) in both datasets is the same. In the second driver- based scenario, we separate the training and testing data based on the drivers, i.e., each driver’s trips are assigned to either the training or testing set. This way, we can evaluate the generalizability of the forecast. This experiment is referred to as driver-based sampling. Finally, we sample training and testing sets chronologically, i.e., all testing trips occur after all training trips. This way we evaluate the potential of the models to be used in predicting future events. This experiment is referred to as time-based sampling.

To predict the dichotmous SCE variable for each scenario, we have developed the following parallel computational setup that capitalizes on the 28 cores on our server:
- We divide our dataset into a training (80% of data) and a testing/holdout set (20%) of the data. The two samples, where created using a stratified sampling approach such that the distribution of SCE events among both the training and testing sample is similar.
- To train the model, we have selected 9 popular statistical/machine learning algorithms - used a 5-fold cross validation;
- applied ‘down’ sampling to obtain a balanced training dataset;
- if the algorithm requires tuning, we have examined 20 random combinations of the tuning parameter(s) to tune the algorithm.
- selected the model one standard error away from the best model in an effort to reduce over fitting. Note that this process is for selecting a representative model for each algorithm (based on its best tuning paramters and the 5 folds) to be used for prediction on the holdout/test dataset.

We also have done a linear greedy optimization on AUC using the caretEnsemble() from the caretensemble package. The goal in the chunk below is to examine whether a linear combination of the models would result in an improved prediction performance when compared to the singular models. To improve the prediction, we examine the following combination of models:

an ensemble of ridge and xgb
an ensemble of cart, svm, and glm
an ensemble of all the models together

The reader should note the following: (a) our approach in this section is to examine viable candidates for ensembling to highlight whether such an approach can improve the performance over the eleven machine learning models examined above; (b) all greedy ensembles are attempting to maximize the overall ROC; and (c) other possible models could have been examined (i.e., we are not claiming that this is an exhaustive list of all possible ensembles). We do not examine any further models given that the ensembling did not result in a practical increase in the AUC values (especially when we compare their results to the XGB algorithm).

Random-based sampling scenario

# Create Training and Testing Datasets

numCores = detectCores()
cl = makePSOCKcluster(numCores , outfile ="../Results/trainLog.txt") # Telling R to run on # cores
registerDoParallel(cl)


df$driver = NULL
df$date = NULL

## Random-based sampling scenario
trainRowNumbers = createDataPartition(df$SCE,
                                      p=0.8, 
                                      list = F) %>% as.vector()

trainData = df[trainRowNumbers,] # Training Dataset
testData = df[-trainRowNumbers,] # Testing Dataset


# Cross Validation Setup
fitControl = trainControl(
        method = "cv", # k-fold cross validation
        number = 5, # Number of Folds
        sampling = "down", # Down sampling
        search = "random", # Random search for parameter tuning when applicable
        summaryFunction = outputFun, # see functions.R file
        classProbs = T, # should class probabilities be returned
        selectionFunction = "best", # best fold
        savePredictions = "final",
        index = createResample(trainData$SCE, 5))


# Model Training
models = caretList(SCE ~., data = trainData, metric="ROC", 
                   tuneList= list(
                           cart= caretModelSpec(method="rpart", tuneLength=20),
                           glm = caretModelSpec(method = "glm", family= "binomial"),
                           lasso = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=1,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           nb = caretModelSpec(method="naive_bayes", tuneLength=20),
                           nnet = caretModelSpec(method="avNNet", tuneLength=20),
                           rf = caretModelSpec(method="rf", tuneLength=20),
                           ridge = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=0,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           svm = caretModelSpec(method="svmRadial", tuneLength=20),
                           xgb = caretModelSpec(method="xgbTree", tuneLength=20) 
                   ),  
                   trControl=fitControl, continue_on_fail = T,
                   preProcess = c("nzv", "center", "scale", "corr")
)

Driver-based scenario

# Create Training and Testing Datasets

numCores = detectCores()
cl = makePSOCKcluster(numCores , outfile ="../Results/trainLog.txt") # Telling R to run on # cores
registerDoParallel(cl)


## Driver-based sampling scenario

df$date = NULL

d = as.data.frame(unique(df$driver))
ind = floor(nrow(d)  * 0.8)
dtrain = as.vector(d[1:ind,])
dtest = as.vector(d[(ind + 1):nrow(d),])

trainData = df %>%
        filter(driver %in% dtrain)
trainData$driver = as.character(trainData$driver)

testData = df %>%
        filter(driver %in% dtest)
testData$driver = as.character(testData$driver)

trainData$driver = NULL
testData$driver = NULL


# Cross Validation Setup
fitControl = trainControl(
        method = "cv", # k-fold cross validation
        number = 5, # Number of Folds
        sampling = "down", # Down sampling
        search = "random", # Random search for parameter tuning when applicable
        summaryFunction = outputFun, # see functions.R file
        classProbs = T, # should class probabilities be returned
        selectionFunction = "best", # best fold
        savePredictions = "final",
        index = createResample(trainData$SCE, 5))


# Model Training
models = caretList(SCE ~., data = trainData, metric="ROC", 
                   tuneList= list(
                           cart= caretModelSpec(method="rpart", tuneLength=20),
                           glm = caretModelSpec(method = "glm", family= "binomial"),
                           lasso = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=1,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           nb = caretModelSpec(method="naive_bayes", tuneLength=20),
                           nnet = caretModelSpec(method="avNNet", tuneLength=20),
                           rf = caretModelSpec(method="rf", tuneLength=20),
                           ridge = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=0,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           svm = caretModelSpec(method="svmRadial", tuneLength=20),
                           xgb = caretModelSpec(method="xgbTree", tuneLength=20) 
                   ),  
                   trControl=fitControl, continue_on_fail = T,
                   preProcess = c("nzv", "center", "scale", "corr")
)

Time-based scenario

# Create Training and Testing Datasets

numCores = detectCores()
cl = makePSOCKcluster(numCores , outfile ="../Results/trainLog.txt") # Telling R to run on # cores
registerDoParallel(cl)

## Time-based sampling scenario

df$date = ymd_hms(df$date)
df = arrange(df, date)
df$date = NULL
df$driver = NULL
ind = floor(nrow(df)  * 0.8)
trainData = df[1:ind,]
testData = df[(ind + 1): nrow(df),]


# Cross Validation Setup
fitControl = trainControl(
        method = "cv", # k-fold cross validation
        number = 5, # Number of Folds
        sampling = "down", # Down sampling
        search = "random", # Random search for parameter tuning when applicable
        summaryFunction = outputFun, # see functions.R file
        classProbs = T, # should class probabilities be returned
        selectionFunction = "best", # best fold
        savePredictions = "final",
        index = createResample(trainData$SCE, 5))


# Model Training
models = caretList(SCE ~., data = trainData, metric="ROC", 
                   tuneList= list(
                           cart= caretModelSpec(method="rpart", tuneLength=20),
                           glm = caretModelSpec(method = "glm", family= "binomial"),
                           lasso = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=1,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           nb = caretModelSpec(method="naive_bayes", tuneLength=20),
                           nnet = caretModelSpec(method="avNNet", tuneLength=20),
                           rf = caretModelSpec(method="rf", tuneLength=20),
                           ridge = caretModelSpec(method = "glmnet", family= "binomial", 
                                                  tuneGrid = expand.grid(.alpha=0,
                                                                         .lambda= seq(0.00001, 2, length=20))),
                           svm = caretModelSpec(method="svmRadial", tuneLength=20),
                           xgb = caretModelSpec(method="xgbTree", tuneLength=20) 
                   ),  
                   trControl=fitControl, continue_on_fail = T,
                   preProcess = c("nzv", "center", "scale", "corr")
)

Greedy Ensembles

numCores = detectCores()
cl = makePSOCKcluster(numCores , outfile ="../Results/greedy_all.txt") # Telling R to run on # cores
registerDoParallel(cl)

## e_all: all together
ens_all = models[c("cart","glm","lasso", "nb", "nnet", "rf", "ridge", "xgb")]
greedy_ensemble_all = caretEnsemble(
        ens_all, 
        metric="ROC",
        trControl= fitControl,
        preProcess = c("nzv", "center", "scale", "corr"))

ensResults_all = ensResults_fun(greedy_ensemble_all, testData, ensName = "e_all")


results = cbind(predResults, ensResults_all)
stopCluster(cl)

## e1: ridge and xgb

numCores = detectCores()
cl = makePSOCKcluster(numCores , outfile ="../Results/greedy1.txt") # Telling R to run on # cores
registerDoParallel(cl)

ens1 = models[c("ridge", "xgb")]
greedy_ensemble1 = caretEnsemble(
        ens1, 
        metric="ROC",
        trControl= fitControl,
        preProcess = c("nzv", "center", "scale", "corr"))

ensResults1 = ensResults_fun(greedy_ensemble1, testData, ensName = "e1")
results = cbind(results, ensResults1)

## e2: svm cart glm

ens2 = models[c("cart", "glm", "svm")]
greedy_ensemble2 = caretEnsemble(
        ens2, 
        metric="ROC",
        trControl= fitControl,
        preProcess = c("nzv", "center", "scale", "corr"))

ensResults2 = ensResults_fun(greedy_ensemble2, testData, ensName = "e2")
results = cbind(results, ensResults2)

stopCluster(cl)

4 Models’ evaluation and insight

We report five performance measures for each machine learning model which are (a) accuracy, which measures the number of observations (both positive and negative) that were correctly classified; (b) sensitivity, proportion of the true positive observations among the ones predicted to be positive; (c) specificity, which measures the proportion of the true negative observations among the ones predicted to be negative; and (d) Gmean, which measures the geometric mean between sensitivity and specificity, and consequently gives a balanced measure of model performance on the majority and minority classes. The last one is AUC whcih is the area under the curve (AUC) captures how well the model predicts actual 0s as 0s and actual 1s as 1s.

As a final step in our anlaysis we also measure the variable impprtance based on the random sampling scenario. We have used two methods to perform variable importance: (a) the standard variable importance methodology from the R caret package, which attempts to measure the contribution of each predictor on each class of the response variable through ROC curve analysis on training set, and (b) a sequential approach to perform variable importance on the testing set, where we have divided the predictors into five blocks and examined the change in prediction accuracy as variables from a given block are made available to the models.

Random-based sampling scenario

Driver-based scenario

Time-based scenario

Ensemble models

Variable importance

Resulst from method (a)

# Method (a)

## By specefiying "useModel = FALSE" we use default variable importance methodology from caret package

xgb.var = varImp(models$xgb, useModel = FALSE)
varImp = xgb[,c(2,1)]
colnames(varImp)[1:2] = c("Predictor", "Importance")
varImp$Importance = round(varImp$Importance, 4)
varImp = varImp[order(varImp$Importance),]

varImp %>%
        ggplot() +
        geom_col(aes(x = Importance, y = reorder(Predictor, Importance)),
                 color = "black") + theme_bw() +
        xlab("Relative Importance") + ylab("Predictor")
ggsave(filename = "../Results/VarImp.png", width = 6.5, height = 4, dpi = 600,
       units = 'in', device = 'png')

Results from method (b)

Email: azm0127@auburn.edu | Website: Google Scholar ↩
Email: yazdi@miamioh.edu | Phone: +1-410-234-4522 | Website: Johns Hopkins University Official ↩
Email: miao.cai@slu.edu | Website: Google Scholar ↩
Email: qzh0011@auburn.edu | Website: LinkedIn Site ↩
Email: azv0019@auburn.edu | Phone: +1-334-844-1425 | Website: Auburn University Official ↩
Email: steve.rigdon@slu.edu | Phone: +1-314-977-8127 | Website: Saint Louis University Official ↩
Email: davisk4@miamioh.edu | Phone: +1-513-529-0354 | Website: Miami University Official ↩
Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: Miami University Official ↩

Predicting SCE Using Statistical and Machine Learning Methods

Amir Mehdizadeh 1

Mohammad Ali Alamdar Yazdi 2

Miao Cai 3

Qiong Hu 4

Alexander Vinel 5

Steve Rigdon 6

Karen Davis 7

Fadel M. Megahed 8

December 18, 2020

1 Preliminaries

1.1 Initilization

1.2 Loading Data

1.3 Session Information for Reproducibility

2 Descripitive Analysis

Summary of the Continuous Predictors

Predictor Histograms

Box Plots by SCE

Correlation Plot

3 Predictive Modeling

Random-based sampling scenario

Driver-based scenario

Time-based scenario

Greedy Ensembles

4 Models’ evaluation and insight

Random-based sampling scenario

Driver-based scenario

Time-based scenario

Ensemble models

Variable importance

Amir Mehdizadeh ¹

Mohammad Ali Alamdar Yazdi ²

Miao Cai ³

Qiong Hu ⁴

Alexander Vinel ⁵

Steve Rigdon ⁶

Karen Davis ⁷

Fadel M. Megahed ⁸