Tuesday, at the end of my 5-hour crash course on machine learning for actuaries, Pierre asked me an interesting question about computational time of different techniques. I’ve been presenting the philosophy of various algorithm, but I forgot to mention computational time. I wanted to try several classification algorithms on the dataset used to illustrate the techniques

> rm(list=ls()) > myocarde=read.table( "http://freakonometrics.free.fr/myocarde.csv", head=TRUE,sep=";") > levels(myocarde$PRONO)=c("Death","Survival")

But the dataset is rather small, with 71 observations and 7 explanatory variables. So I decided to replicate the observations, and to add some covariates,

> levels(myocarde$PRONO)=c("Death","Survival") > idx=rep(1:nrow(myocarde),each=100) > TPS=matrix(NA,30,10) > myocarde_large=myocarde[idx,] > k=23 > M=data.frame(matrix(rnorm(k* + nrow(myocarde_large)),nrow(myocarde_large),k)) > names(M)=paste("X",1:k,sep="") > myocarde_large=cbind(myocarde_large,M) > dim(myocarde_large) [1] 7100 31 > object.size(myocarde_large) 2049.064 kbytes

The dataset is not big… but at least, it does not take 0.0001 sec. to run a regression. Actually, to run a **logistic regression**, it takes 0.1 second

> system.time(fit< glm(PRONO~., + data=myocarde_large, family="binomial")) user system elapsed 0.114 0.016 0.134 > object.size(fit) 9,313.600 kbytes

And I was surprised that the regression object was 9Mo, which is more than four times the size of the dataset. With a large dataset, 100 times larger,

> dim(myocarde_large_2) [1] 710000 31

it takes 20 sec.

> system.time(fit<-glm(PRONO~., + data=myocarde_large_2, family="binomial")) utilisateur système écoulé 16.394 2.576 19.819 > object.size(fit) 90,9025.600 kbytes

and the object is ‘only’ ten times bigger.

Note that with a spline, computational time is rather similar

> library(splines) > system.time(fit<-glm(PRONO~bs(INSYS)+., + data=myocarde_large, family="binomial")) user system elapsed 0.142 0.000 0.143 > object.size(fit) 9663.856 kbytes

If we use another function, more specifically the one I use for *multinomial* regressions, it is two times longer

> library(VGAM) > system.time(fit1<-vglm(PRONO~., + data=myocarde_large, family="multinomial")) user system elapsed 0.200 0.020 0.226 > object.size(fit1) 6569.464 kbytes

while the object is smaller. Now, if we use a stepwise procedure, backward, it is a bit long : almost one minute, 500 times longer than a single logistic regression

> system.time(fit<-step(glm(PRONO~.,data=myocarde_large, family="binomial"))) ... Step: AIC=4118.15 PRONO ~ FRCAR + INCAR + INSYS + PRDIA + PVENT + REPUL + X16 Df Deviance AIC <none> 4102.2 4118.2 - X16 1 4104.6 4118.6 - PRDIA 1 4113.4 4127.4 - INCAR 1 4188.4 4202.4 - REPUL 1 4203.9 4217.9 - PVENT 1 4215.5 4229.5 - FRCAR 1 4254.1 4268.1 - INSYS 1 4286.8 4300.8 user system elapsed 50.327 0.050 50.368 > object.size(fit) 6,652.160 kbytes

I also wanted to try caret. This package is nice to compare models. In a review of the bookComputational Actuarial Science with R in JRSS-A, Andrey Kosteko noticed that this package was not even mentioned, and it was missing. So I tried a logistic regression

> library(caret) > system.time(fit<-train(PRONO~., + data=myocarde_large,method="glm")) user system elapsed 5.908 0.032 5.954 > object.size(fit) 12,676.944 kbytes

It took 6 seconds (50 times more than a standard call of the glm function), and the object is rather big. It is even worst if we try to run a stepwise procedure

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="glmStepAIC")) ... Step: AIC=4118.15 .outcome ~ FRCAR + INCAR + INSYS + PRDIA + PVENT + REPUL + X16 Df Deviance AIC <none> 4102.2 4118.2 - X16 1 4104.6 4118.6 - PRDIA 1 4113.4 4127.4 - INCAR 1 4188.4 4202.4 - REPUL 1 4203.9 4217.9 - PVENT 1 4215.5 4229.5 - FRCAR 1 4254.1 4268.1 - INSYS 1 4286.8 4300.8 user system elapsed 1063.399 2.926 1068.060 > object.size(fit) 9,978.808 kbytes

which took 15 minutes, with only 30 covariates… Here is the plot (I used microbenchmark to plot it)

Let us consider some **trees**.

> library(rpart) > system.time(fit<-rpart(PRONO~., + data=myocarde_large)) user system elapsed 0.341 0.000 0.345 > object.size(fit4) 544.664 kbytes

Here it is fast, and the object is rather small. And if we change the complexity parameter, to get a deeper tree, it is almost the same

> system.time(fit<-rpart(PRONO~., + data=myocarde_large,cp=.001)) user system elapsed 0.346 0.000 0.346 > object.size(fit) 544.824 kbytes

But again, if we run the same function through caret, it is more than ten times slower,

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="rpart")) user system elapsed 4.076 0.005 4.077 > object.size(fit) 5,587.288 kbytes

and the object is ten times bigger. Now consider some **random forest**.

> library(randomForest) > system.time(fit<-randomForest(PRONO~., + data=myocarde_large,ntree=50)) user system elapsed 0.672 0.000 0.671 > object.size(fit) 1,751.528 kbytes

With ‘only’ 50 trees, it is only two times longer to get the output. But with 500 trees (the default value) it takes twenty times more (with a reasonable proportional time, growing 500 trees instead of 50)

> system.time(fit<-randomForest(PRONO~., + data=myocarde_large,ntree=500)) user system elapsed 6.644 0.180 6.821 > object.size(fit) 5,133.928 kbytes

If we change the number of covariates to use, at each node, we can see that there is almost no impact. With 5 covariates (which is the square root of the total number of covariates, i.e. it is the default value), it takes 6 seconds,

> system.time(fit<-randomForest(PRONO~., + data=myocarde_large,mtry=5)) user system elapsed 6.266 0.076 6.338 > object.size(fit) 5,161.928 kbytes

but if we use 10, it is almost the same (even less)

> system.time(fit<-randomForest(PRONO~., + data=myocarde_large,mtry=10)) user system elapsed 5.666 0.076 5.737 > object.size(fit) 2,501.928 bytes

If we use the random forest algorithm within caret, it takes 10 minutes,

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="rf")) user system elapsed 609.790 2.111 613.515

and the visualisation is

If we consider a *k*-nearest neighbor technique, with caret again, it takes some time, with again 10 minutes

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="knn")) user system elapsed 66.994 0.088 67.327 > object.size(fit) 5,660.696 kbytes

which is almost the same time as a bagging algorithm, on trees

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="treebag")) Le chargement a nécessité le package : plyr user system elapsed 60.526 0.567 61.641 > object.size(fit) 72,048.480 kbytes

but this time, the object is quite big !

We can also consider **SVM** techniques, with standard Euclidean distance

> library(kernlab) > system.time(fit<-ksvm(PRONO~., + data=myocarde_large, + prob.model=TRUE, kernel="vanilladot")) Setting default kernel parameters user system elapsed 14.471 0.076 14.698 > object.size(fit) 801.120 kbytes

or using some kernel

> system.time(fit<-ksvm(PRONO~., + data=myocarde_large, + prob.model=TRUE, kernel="rbfdot")) user system elapsed 9.469 0.052 9.701 > object.size(fit) 846.824 kbytes

Both techniques take around 10 seconds, much more than our basic logistic regression (one hundred times more). And again, if we try to use caret to do the same, it takes a while….

> system.time(fit<-train(PRONO~., + data=myocarde_large, method="svmRadial")) user system elapsed 360.421 2.007 364.669 > object.size(fit) 4,027.880 kbytes

The output is the following

I also wanted to try some functions, like ridge and LASSO.

> library(glmnet) > idx=which(names(myocarde_large)=="PRONO") > y=myocarde_large[,idx] > x=as.matrix(myocarde_large[,-idx]) > system.time(fit<-glmnet(x,y,alpha=0,lambda=.05, + family="binomial")) user system elapsed 0.013 0.000 0.052 > system.time(fit<-glmnet(x,y,alpha=1,lambda=.05, + family="binomial")) user system elapsed 0.014 0.000 0.013

I was surprised to see how fast it. And if we use cross validation to quantify the penalty

> system.time(fit10<-cv.glmnet(x,y,alpha=1, + type="auc",nlambda=100, + family="binomial")) user system elapsed 11.831 0.000 11.831

It takes some time… but it is reasonnable, compared with other techniques. And finally, consider some **boosting** packages.

> system.time(fit<-gbm.step(data=myocarde_large, + gbm.x = (1:(ncol(myocarde_large)-1))[-idx], + gbm.y = ncol(myocarde_large), + family = "bernoulli", tree.complexity = 5, + learning.rate = 0.01, bag.fraction = 0.5)) user system elapsed 364.784 0.428 365.755 > object.size(fit) 8,607.048 kbytes

That one was long. More than 6 minutes. Using the glmboost package via caret was much faster, this time

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="glmboost")) user system elapsed 13.573 0.024 13.592 > object.size(fit) 6,717.400 bytes

While using gbm via caret was ten times longer,

> system.time(fit<-train(PRONO~., + data=myocarde_large,method="gbm")) user system elapsed 121.739 0.360 122.466 > object.size(fit) 7,115.512 kbytes

All that was done one a laptop. I now have to run the same codes on a faster machine, to try much larger datasets….

Thanks, nice post!.

Just a remainder about the possibility to run caret::train in parallel if your machine has multiple cores (not in Windows and not if you run any of the RWeka imported algorithms).

See the last example in train’s help page as the reference on how easy is to run that with just two lines of code.

Interesting post, thanks!

Maybe that’s what you did but I’d like to make sure: use the function ‘caret::train’, with the parameter ‘trControl = trainControl(method = “none”, …)’. Otherwise, ‘caret::train’ makes a resampling by default, with default tuning parameters (it’s actually training several resampled models instead of one). Which could explain why the computation is longer or why the object is sometimes bigger.

I guess you’re right ! it’s odd that a simple logistic regression is ten times (!) longer…

When you consider that the caret function you’re using fits 26 logistic regression models, 10 times longer seems pretty good to me =D

Interesting post. Maybe that’s what you did but I’d like to make sure: Use the function ‘caret::train’, with the parameter ‘trControl = trainControl(method = “none”, …)’. Otherwise, ‘caret::train’ makes a resampling by default, with default tuning parameters (it’s actually training several resampled models instead of one). Which could explain why the computation is longer or why the object is sometimes bigger.

I suppose that these will largely depend on the implementation, won’t it? For instance, see http://scikit-learn.org/ml-benchmarks/

Also, if I recall correctly, you can get pretty fast results with stochastic gradient descent too, in case you’d want to add some algorithms to your benchmark 🙂

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

Thanks for your posts!

actually, one point of my talk is that there are usually two important things to consider a machine learning algorithm : the algorithm and its implementation. Some algorithms are nice, and clever, but slow, because of the way it was implemented. Others can have nice functions… I did not mention also algorithms ‘that do not converge’, which is a nightmare in real life

Indeed, there can be a 100x or even larger difference in training time for the various implementations. For example you can run logistic regression as glmnet(…, family = “binomial”, lambda = 0) or glm(…, family = binomial()) and it’s a huge difference. See a minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.) here: https://github.com/szilard/benchm-ml