Introduction to Machine Learning
NPFL 054
http://ufal.mff.cuni.cz/course/npfl054
Barbora Hladká Martin Holub Ivana Lukšová
{Hladka | Holub | Luksova}@ufal.mff.cuni.cz
Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
Practical feature selection and ensemble learning
1 NLI task
2 Feature selection
3 Ensemble learning - bagging and boosting
4 Ensemble learning - Random Forests
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 2/34
Native language identification
ARA CHI FRE GER HIN ITA
JPN KOR SPA TEL TUR
to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high Prompt: P2
Changing your habbits is not an easy thing , but one is urged to do it for a number of reasons.
A successful teacher should renew his lectures subjects . Proficiency: high Prompt: P2 This indicates that he is up
to date and also refreashes his mind constantly . Also , this gives him a good reputation among other colleagues . Proficiency: high Prompt: P1
university , I keep getting newer information from different books and resources and include it in my lectures .
Proficiency: low Prompt: P8
It is troublesome , as it seems , but it keeps me freash in information and with a good status among my colleagues .
Proficiency: high Prompt: P7 you are constantly trying to change your ways of making , or synthesizing , chemical compunds , you might come out with less expensive methods . Proficiency: medium Prompt: P2
The newer method is considered an invention in its own , and it also saves money .
Proficiency: high Prompt: P2 to be specialized in one
specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: low
Prompt: P2 It is troublesome , as it
seems , but it keeps me freash in information and with a good status among my colleagues .
Proficiency: high Prompt: P2
to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high Prompt: P2 Also , you might think of changing your view , clothes or even your hair cut .
Proficiency: medium Prompt: P2
When I change my glasses color , for example , this would be attractive to my students and colleagues and will make me feel better .
Proficiency: low Prompt: P7 Alternating work and dormancy in your life pace with activities and exercise is of great benefit to the body and the mind .
Proficiency: medium Prompt: P2
Lastly , change is a different decision in the human 's life , but it is important for many good reasons , and gets back on the human with good benefits . Proficiency: high Prompt: P2
In contrast , many people believe that changing is really important in people 's lives .
Proficiency: high Prompt: P2to be specialized in one
specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: medium Prompt: P8
Native language identification
Identifying the native language (L1) of a writer based on a sample of their writing in a second language (L2)
Our data:
• L1s: ARA, CHIN, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, TUR
• L2: English
• Real-world objects: For each L1, 1000 texts in L2 from The ETS Corpus of Non-Native Written English (former TOEFL11)
• Target class: L1
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 4/34
Features used in the intro
Relative character frequencies, i.e. 96 numerical features
"Finally having people with many academic broad know"
<SPACE> a b c d e
0.17073171 0.14634146 0.02439024 0.04878049 0.04878049 0.07317073
m n o F g h
0.04878049 0.09756098 0.07317073 0.02439024 0.02439024 0.04878049
i k l p r t
0.09756098 0.02439024 0.07317073 0.04878049 0.02439024 0.02439024
v w y
0.02439024 0.04878049 0.04878049
T:Load the data and remove the "file" column.
Split it randomly to train and test set.
Feature selection
Recallfeature selection:
• reduce the feature space dimension in the dataset
• analyse the impact/importance of the extracted features
Feature selection methods can be divided into
• filters
• wrappers
• embedded methods
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 6/34
Feature selection
Recallfeature selection:
• reduce the feature space dimension in the dataset
• analyse the impact/importance of the extracted features Feature selection methods can be divided into
• filters
• wrappers
• embedded methods
Feature selection in R - package FSelector
FSelectorcontains algorithms for:
• filtering features based on different criteria
• entropy based filters
• chi-squared based filter
• ...
• wrapping classifiers and search feature space
• greedy approaches: forward search, backward search, best first search
• exhaustive search
• hill climbing search
Look at the package documentation
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 7/34
FSelector - filtering features
Entropy based filters
• information.gain
H(Class) +H(Attribute)−H(Class;Attribute)
• gain.ratio
H(Class) +H(Attribute)−H(Class;Attribute) H(Attribute)
• symmetrical.uncertainty
2H(Class) +H(Attribute)−H(Class;Attribute) H(Attribute) +H(Class)
FSelector - filtering features
T:Install and loadFSelectorpackage
T:Find out 5 most useful features with theinformation.gain (look at the help)
Useful function: cutoff.k
> weights1=information.gain(class ~ ., data)
> weights1
attr_importance c.1.1 0.029832319 c.1.2 0.007194058 c.1.3 0.000000000 c.1.4 0.013223123 c.1.5 0.114330009 ...
> feature.subset1=cutoff.k(weights1, 5)
> feature.subset1
[1] "c.1.5" "c.1.77" "c.1.89" "c.1.64" "c.1.21"
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 9/34
FSelector - filtering features
T:Install and loadFSelectorpackage
T:Find out 5 most useful features with theinformation.gain (look at the help)
Useful function: cutoff.k
> weights1=information.gain(class ~ ., data)
> weights1
attr_importance c.1.1 0.029832319 c.1.2 0.007194058 c.1.3 0.000000000 c.1.4 0.013223123 c.1.5 0.114330009 ...
> feature.subset1=cutoff.k(weights1, 5)
> feature.subset1
[1] "c.1.5" "c.1.77" "c.1.89" "c.1.64" "c.1.21"
FSelector - filtering features
T:Compare the results with other 2 entropy based filters
> weights2=gain.ratio(class ~ ., data)
> feature.subset2=cutoff.k(weights2, 5)
> feature.subset2
[1] "c.1.14" "c.1.63" "c.1.71" "c.1.5" "c.1.77"
> weights3=symmetrical.uncertainty(class ~ ., data)
> feature.subset3=cutoff.k(weights3, 5)
> feature.subset3
[1] "c.1.5" "c.1.77" "c.1.64" "c.1.89" "c.1.71"
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 10/34
FSelector - filtering features
T:Compare the results with other 2 entropy based filters
> weights2=gain.ratio(class ~ ., data)
> feature.subset2=cutoff.k(weights2, 5)
> feature.subset2
[1] "c.1.14" "c.1.63" "c.1.71" "c.1.5" "c.1.77"
> weights3=symmetrical.uncertainty(class ~ ., data)
> feature.subset3=cutoff.k(weights3, 5)
> feature.subset3
[1] "c.1.5" "c.1.77" "c.1.64" "c.1.89" "c.1.71"
Information gain feature importance
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 11/34
Gain ratio feature importance
Symmetrical uncertainty feature importance
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 13/34
FSelector - filtering features
Useful function: as.simple.formula
> f1=as.simple.formula(feature.subset1, "class")
> f1
class ~ c.1.5 + c.1.77 + c.1.89 + c.1.64 + c.1.21
T:Use the selected feature subsets to train 3 different classifiers and compare their performance.
FSelector - filtering features
T:Use the selected feature subsets to train 3 different classifiers and compare their performance.
> model.f1=rpart(f1, train, cp=0.001)
> model.f2=rpart(f2, train, cp=0.001)
> model.f3=rpart(f3, train, cp=0.001)
> sum(mean(test$class == predict(model.f1, test, type="c"))) [1] 0.2345455
> sum(mean(test$class == predict(model.f2, test, type="c"))) [1] 0.2254545
> sum(mean(test$class == predict(model.f3, test, type="c"))) [1] 0.2318182
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 15/34
FSelector - wrapping a rpart classifier
You need to define a function that returns a value to evaluate the given feature subset
evaluator <- function(subset) {
#k-fold cross validation k=5
splits=runif(nrow(data.small)) results=sapply(1:k, function(i) {
test.idx=(splits >= (i - 1) / k) & (splits < i / k) train.idx=!test.idx
test=data.small[test.idx, , drop=FALSE]
train=data.small[train.idx, , drop=FALSE]
tree=rpart(as.simple.formula(subset, "class"), train) error.rate=mean(test$class != predict(tree, test, type="c")) return(1 - error.rate)
})
print(subset) print(mean(results)) return(mean(results)) }
FSelector - wrapping a rpart classifier
HW:Use the forward search algorithm to get a feature subset
# forward search on all features except the class
> subset=forward.search(names(data.small)[-97], evaluator)
> subset
[1] "c.1.5" "c.1.21" "c.1.44" "c.1.60" "c.1.64" "c.1.77" "c.1.86"
Then compare the classifier performance with the previous filter based methods.
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 17/34
Bagging and boosting
Recall:
• methods for combining multiple classifiers
• the more complementary classifiers, the better results usually can be achieved
• remedy for unstable algorithms (such as decision trees)
• output can be the most simply obtained by voting (can be weighted)
Bagging and boosting
Bagging (= "Bootstrap Aggregation")
• method generates slightly different training sets using sampling with replacement
• each training sample is used to learn a possibly different predictor
• output is obtained by majority voting
Boosting
• sequential training of weak classifiers
• each base classifier is trained on data that is weighted based on the performance of the previous classifier
• output is obtained by weighted majority voting
• AdaBoost
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 19/34
Bagging and boosting
Bagging (= "Bootstrap Aggregation")
• method generates slightly different training sets using sampling with replacement
• each training sample is used to learn a possibly different predictor
• output is obtained by majority voting
Boosting
• sequential training of weak classifiers
• each base classifier is trained on data that is weighted based on the performance of the previous classifier
• output is obtained by weighted majority voting
• AdaBoost
Bagging and boosting in R - package ada
packageada
• only binary classification
• usesrpartdecision trees
• nice visualisation
• functionadaprovides a variety stochastic boosting models
– based on the work published by Friedman, et al. (Additive Logistic Regression: A Statistical View of Boosting, 2000)
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 20/34
Bagging and boosting in R - package adabag
packageadabag
• simpler than ada
• also usesrpartdecision trees
• multiple class prediction
• bagging: Breiman’s Bagging algorithm with functionbagging
• boosting: Adaboost.M1 algorithm with functionboosting
bagging in package adabag
> model.bagging = bagging(class~., data=train, mfinal=50, control=rpart.control(cp=0.001))
Parameters:
• formula
• data
• mfinal- number of trained classifiers
• control- parameters used in the training ofrpart classifiers
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 22/34
bagging in package adabag
> model.bagging$trees
... displays trained rpart classfiers
> pred.bagging = predict.bagging(model.bagging, newdata = test, newmfinal=30)
• newmfinal- number of trees used in the prediction (can be used for pruning the model)
> sum(mean(test$class == pred.bagging$class) [1] 0.285454545454545
boosting in package adabag
> model.adaboost = boosting(class~., data=train, mfinal=50, control=rpart.control(cp=0.001))
Parameters:
• formula
• data
• mfinal- number of trained classifiers
• control- parameters used in the training ofrpart classifiers
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 24/34
boosting in package adabag
> model.adaboost$trees
... displays trained rpart classfiers
# displays the weights of classifiers
> model.adaboost$weights
[1] 0.002000003 0.002000003 0.002000003 0.002000003 0.002000003 ...
[199] 0.002000003 0.002000003
> pred.adaboost = predict.bagging(model.bagging, newdata = test, newmfinal=30)
• newmfinal- number of trees used in the prediction (can be used for pruning the model)
> sum(mean(test$class == pred.adaboost$class) [1] 0.291818181818182
Random Forests
• an ensemble method based on decision trees and bagging
• builds a number of random decision trees and then uses voting
• very good (state-of-the-art) prediction performance
• avoid overfitting
• 2 types of randomness:
• random creation of training sets using sampling with replacement
• random selection of features used in the training of each classifier
• each tree in ensemble is grown out fully - they each overfit, but in different ways and the mistakes are averaged over them all
• important: the more trees in the ensemble, the better
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 26/34
Random Forests
• an ensemble method based on decision trees and bagging
• builds a number of random decision trees and then uses voting
• very good (state-of-the-art) prediction performance
• avoid overfitting
• 2 types of randomness:
• random creation of training sets using sampling with replacement
• random selection of features used in the training of each classifier
• each tree in ensemble is grown out fully - they each overfit, but in different ways and the mistakes are averaged over them all
• important: the more trees in the ensemble, the better
Random Forests in R
packagerandomForest
predictor = randomForest(formula, data=, ntree=, mtry=)
Parameters:
• ntree- number of trees in the ensemble, default value 500
• mtry√ - number of features selected for training of one tree, default value
#featuresCount for classification, #featuresCount/3 for regression For other parameters, see help
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 27/34
Random Forests in R
T:Split the data to train and test set T:Train a random forest on train set
> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor
> str(model.rf)
T:Predict on test set and compute the accuracy of the predictor
> sum(mean(test$class == predict(model.rf, test))) 0.3654545
Random Forests in R
T:Split the data to train and test set T:Train a random forest on train set
> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor
> str(model.rf)
T:Predict on test set and compute the accuracy of the predictor
> sum(mean(test$class == predict(model.rf, test))) 0.3654545
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 28/34
Random Forests in R
T:Split the data to train and test set T:Train a random forest on train set
> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor
> str(model.rf)
T:Predict on test set and compute the accuracy of the predictor
> sum(mean(test$class == predict(model.rf, test))) 0.3654545
Random Forests in R
T:Split the data to train and test set T:Train a random forest on train set
> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor
> str(model.rf)
T:Predict on test set and compute the accuracy of the predictor
> sum(mean(test$class == predict(model.rf, test))) 0.3654545
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 28/34
Ensemble methods - back to the feature selection
Predictors in theadabag andrpartpackages compute feature importance during the training process
> model.bagging$imp
c.1.1 c.1.10 c.1.11 c.1.12 c.1.13 c.1.14
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 2.22814204 ...
> model.adaboost$imp
c.1.1 c.1.10 c.1.11 c.1.12 c.1.13 c.1.14
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 2.2491597 ...
Ensemble methods - back to the feature selection
Predictors in theadabag andrandomForestpackages compute feature importanceduring the training process
> importance(model.rf) MeanDecreaseGini c.1.1 1.585058e+02 c.1.2 2.228907e+01 c.1.3 5.502966e+00 c.1.4 3.807578e+01 c.1.5 3.509182e+02
...
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 30/34
Variable importance in model.bagging
> barplot(model.bagging$imp[order(model.bagging$imp, decreasing = TRUE) [1:7]], ylim = c(0, 50), main = "Bagging - Variables rel. importance",
col = "lightblue")
Variable importance in model.adaboost
> barplot(model.adaboost$imp[order(model.adaboost$imp, decreasing = TRUE) [1:7]], ylim = c(0, 50), main = "Adaboost - Variables rel. importance",
col = "lightblue")
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 32/34
Variable importance in model.rf
> varImpPlot(model.rf)
Homework exercises
• Look at the documentation of packagesFSelector,ada,adabagand randomForest
• Good luck with your term project!
NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 34/34