Sunday, July 26, 2015

Predicting Titanic deaths on Kaggle II: gbm

Following my previous post I have decided to try and use a different method: generalized boosted regression models (gbm). I have read the background in Elements of Statistical Learning and arthur charpentier's nice post on it. This data is a nice occasion to get my hands dirty.

Data 

Data as before. However, I have added some more variables. In addition, during the analysis it appeared that gbm does not like to have logical variables in the x-variables. One missing value of Fare in the test set gets the median value in order to avoid having missing values in the data. I must say I like using dplyr for this data handing. It allows me to use the same code for training and test with minimum effort.
library(dplyr)
library(gbm)

set.seed(4321)
titanic <- read.csv('train.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        train = sample(c(TRUE,FALSE),
            size=891,
            replace=TRUE, 
            prob=c(.9,.1)   ) )

test <- read.csv('test.csv') %>%
    mutate(.,
        Embarked=factor(Embarked,levels=levels(titanic$Embarked)),
        Pclass=factor(Pclass),
  #      Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1))
)
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)

Age

Age has missing values, thus the first task is to fill those. Since gbm is the method used for the main analysis, I will be used it for age too. This has the added advantage that I can exercise with both a numerical and a categorical variable as response.
One of the things with boosting is that it opens itself to over fitting. Boosting consists of adding trees which are structured to improve fit. At some point the trees will just start boost the noise rather than the structure. Gbm comes with a cross validation (cv) option, which is the preferred way to get the predictive qualities of models, and cv is used to determine the optimum number of trees. But, there is catch, it throws an error if there are variables in the data.frame which are not used in the model. Hence in the code below first the data is selected, and subsequently the model run.
The model parameters, interaction.depth=4, shrinkage=0.0005 come from a bit of experimenting. n.trees has to be high enough that it is clear the optimum number of trees is lower than the number estimated. It seems n.cores=2 works under both windows and linux.
forage <- filter(titanic,!is.na(titanic$Age)) %>%
     select(.,Age,SibSp,Parch,Fare,Sex,Pclass,Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe)

rfa1 <- gbm(Age ~ ., 
    data=forage,
    interaction.depth=4,
    cv.folds=10,
    n.trees=8000,
    shrinkage=0.0005,
    n.cores=2)

gbm.perf(rfa1)
Using cv method...
[1] 6824
It is time here to confess that I have been working on this over several sessions. It appears that when I created the code, 7118 trees were optimum, while I stored that data for a session with 6824 trees. Thus is the way of these methods, unlike traditional statistical methods, they have a different result any time. But, as can be seen from the plot, the difference should be minimal.
titanic$AGE<- titanic$Age
titanic$AGE[is.na(titanic$AGE)] <- predict(rfa1,titanic,n.trees=7118)[is.na(titanic$Age)]
test$AGE<- test$Age
test$AGE[is.na(test$AGE)] <- predict(rfa1,test,n.trees=7118)[is.na(test$Age)]

Survival

During the calculations I learned that the response should be a float containing 0 and 1. With two categories there are various distributions to be used: bernoulli, huberized and adaboost. Using the 10% test data I had set apart, it seemed adaboost worked best for these data. 
gb1 <- filter(titanic,train) %>%
    select(.,age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived)%>%
      mutate(Survived=c(0,1)[Survived]) # not integer or factor but float
#table(gb1$Survived)
gb1m <-      gbm(Survived ~ .,
         cv.folds=11,
         n.cores=2,
         interaction.depth=5,
         shrinkage = 0.0005,
         distribution='adaboost',
    data=gb1,
    n.trees=10000)
gbm.perf(gb1m)
Using cv method...
[1] 6355

In my code I have used 6000 trees. 
One thing about gbm is that it does not respond with categories. It is a proportion answers for either category.
preds <- predict(gb1m,titanic,
    n.trees=6000,type='response')
density(preds) %>% plot
Thus there is a need or opportunity to determine the cut off point. For this my test set comes in somewhat handy.
preds2<- preds[!titanic$train]
target <- c(0,1)[titanic$Survived[!titanic$train]]
sapply(seq(.3,.7,.01),function(step)
  c(step,sum(ifelse(preds2<step,0,1)!=target)))
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
[1,]  0.3  0.31  0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39   0.4  0.41
[2,] 17.0 17.00 17.00 17.00 17.00 17.00 17.00 17.00 18.00 17.00  16.0 16.00
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,]  0.42  0.43  0.44  0.45  0.46  0.47  0.48  0.49   0.5  0.51  0.52  0.53
[2,] 16.00 16.00 15.00 14.00 14.00 14.00 13.00 15.00  15.0 15.00 16.00 16.00
     [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
[1,]  0.54  0.55  0.56  0.57  0.58  0.59   0.6  0.61  0.62  0.63  0.64  0.65
[2,] 16.00 17.00 17.00 17.00 17.00 18.00  18.0 18.00 18.00 19.00 20.00 20.00
     [,37] [,38] [,39] [,40] [,41]
[1,]  0.66  0.67  0.68  0.69   0.7
[2,] 20.00 21.00 21.00 21.00  21.0
It is a bit messy output, but at 0.48 the least errors are found.

Predictions

This is fairly straightforward. I am not unhappy to report an improvement, bringing me from tail to middle region of the peloton.
pp <- predict(gb1m,test,n.trees=6000,type='response')
pp <- ifelse(pp<0.48,0,1)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='gbm.csv',
    row.names=FALSE,
    quote=FALSE)

No comments:

Post a Comment