• Nebyly nalezeny žádné výsledky

Introduction to Machine Learning NPFL 054

N/A
N/A
Protected

Academic year: 2022

Podíl "Introduction to Machine Learning NPFL 054"

Copied!
42
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Introduction to Machine Learning

NPFL 054

http://ufal.mff.cuni.cz/course/npfl054

Barbora Hladká Martin Holub Ivana Lukšová

{Hladka | Holub | Luksova}@ufal.mff.cuni.cz

Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

(2)

Practical feature selection and ensemble learning

1 NLI task

2 Feature selection

3 Ensemble learning - bagging and boosting

4 Ensemble learning - Random Forests

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 2/34

(3)

Native language identification

ARA CHI FRE GER HIN ITA

JPN KOR SPA TEL TUR

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high Prompt: P2

Changing your habbits is not an easy thing , but one is urged to do it for a number of reasons.

A successful teacher should renew his lectures subjects . Proficiency: high Prompt: P2 This indicates that he is up

to date and also refreashes his mind constantly . Also , this gives him a good reputation among other colleagues . Proficiency: high Prompt: P1

university , I keep getting newer information from different books and resources and include it in my lectures .

Proficiency: low Prompt: P8

It is troublesome , as it seems , but it keeps me freash in information and with a good status among my colleagues .

Proficiency: high Prompt: P7 you are constantly trying to change your ways of making , or synthesizing , chemical compunds , you might come out with less expensive methods . Proficiency: medium Prompt: P2

The newer method is considered an invention in its own , and it also saves money .

Proficiency: high Prompt: P2 to be specialized in one

specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: low

Prompt: P2 It is troublesome , as it

seems , but it keeps me freash in information and with a good status among my colleagues .

Proficiency: high Prompt: P2

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high Prompt: P2 Also , you might think of changing your view , clothes or even your hair cut .

Proficiency: medium Prompt: P2

When I change my glasses color , for example , this would be attractive to my students and colleagues and will make me feel better .

Proficiency: low Prompt: P7 Alternating work and dormancy in your life pace with activities and exercise is of great benefit to the body and the mind .

Proficiency: medium Prompt: P2

Lastly , change is a different decision in the human 's life , but it is important for many good reasons , and gets back on the human with good benefits . Proficiency: high Prompt: P2

In contrast , many people believe that changing is really important in people 's lives .

Proficiency: high Prompt: P2to be specialized in one

specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: medium Prompt: P8

(4)

Native language identification

Identifying the native language (L1) of a writer based on a sample of their writing in a second language (L2)

Our data:

• L1s: ARA, CHIN, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, TUR

• L2: English

• Real-world objects: For each L1, 1000 texts in L2 from The ETS Corpus of Non-Native Written English (former TOEFL11)

Target class: L1

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 4/34

(5)

Features used in the intro

Relative character frequencies, i.e. 96 numerical features

"Finally having people with many academic broad know"

<SPACE> a b c d e

0.17073171 0.14634146 0.02439024 0.04878049 0.04878049 0.07317073

m n o F g h

0.04878049 0.09756098 0.07317073 0.02439024 0.02439024 0.04878049

i k l p r t

0.09756098 0.02439024 0.07317073 0.04878049 0.02439024 0.02439024

v w y

0.02439024 0.04878049 0.04878049

T:Load the data and remove the "file" column.

Split it randomly to train and test set.

(6)

Feature selection

Recallfeature selection:

• reduce the feature space dimension in the dataset

• analyse the impact/importance of the extracted features

Feature selection methods can be divided into

• filters

• wrappers

• embedded methods

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 6/34

(7)

Feature selection

Recallfeature selection:

• reduce the feature space dimension in the dataset

• analyse the impact/importance of the extracted features Feature selection methods can be divided into

• filters

• wrappers

• embedded methods

(8)

Feature selection in R - package FSelector

FSelectorcontains algorithms for:

• filtering features based on different criteria

entropy based filters

chi-squared based filter

...

• wrapping classifiers and search feature space

greedy approaches: forward search, backward search, best first search

exhaustive search

hill climbing search

Look at the package documentation

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 7/34

(9)

FSelector - filtering features

Entropy based filters

• information.gain

H(Class) +H(Attribute)−H(Class;Attribute)

• gain.ratio

H(Class) +H(Attribute)−H(Class;Attribute) H(Attribute)

• symmetrical.uncertainty

2H(Class) +H(Attribute)H(Class;Attribute) H(Attribute) +H(Class)

(10)

FSelector - filtering features

T:Install and loadFSelectorpackage

T:Find out 5 most useful features with theinformation.gain (look at the help)

Useful function: cutoff.k

> weights1=information.gain(class ~ ., data)

> weights1

attr_importance c.1.1 0.029832319 c.1.2 0.007194058 c.1.3 0.000000000 c.1.4 0.013223123 c.1.5 0.114330009 ...

> feature.subset1=cutoff.k(weights1, 5)

> feature.subset1

[1] "c.1.5" "c.1.77" "c.1.89" "c.1.64" "c.1.21"

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 9/34

(11)

FSelector - filtering features

T:Install and loadFSelectorpackage

T:Find out 5 most useful features with theinformation.gain (look at the help)

Useful function: cutoff.k

> weights1=information.gain(class ~ ., data)

> weights1

attr_importance c.1.1 0.029832319 c.1.2 0.007194058 c.1.3 0.000000000 c.1.4 0.013223123 c.1.5 0.114330009 ...

> feature.subset1=cutoff.k(weights1, 5)

> feature.subset1

[1] "c.1.5" "c.1.77" "c.1.89" "c.1.64" "c.1.21"

(12)

FSelector - filtering features

T:Compare the results with other 2 entropy based filters

> weights2=gain.ratio(class ~ ., data)

> feature.subset2=cutoff.k(weights2, 5)

> feature.subset2

[1] "c.1.14" "c.1.63" "c.1.71" "c.1.5" "c.1.77"

> weights3=symmetrical.uncertainty(class ~ ., data)

> feature.subset3=cutoff.k(weights3, 5)

> feature.subset3

[1] "c.1.5" "c.1.77" "c.1.64" "c.1.89" "c.1.71"

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 10/34

(13)

FSelector - filtering features

T:Compare the results with other 2 entropy based filters

> weights2=gain.ratio(class ~ ., data)

> feature.subset2=cutoff.k(weights2, 5)

> feature.subset2

[1] "c.1.14" "c.1.63" "c.1.71" "c.1.5" "c.1.77"

> weights3=symmetrical.uncertainty(class ~ ., data)

> feature.subset3=cutoff.k(weights3, 5)

> feature.subset3

[1] "c.1.5" "c.1.77" "c.1.64" "c.1.89" "c.1.71"

(14)

Information gain feature importance

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 11/34

(15)

Gain ratio feature importance

(16)

Symmetrical uncertainty feature importance

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 13/34

(17)

FSelector - filtering features

Useful function: as.simple.formula

> f1=as.simple.formula(feature.subset1, "class")

> f1

class ~ c.1.5 + c.1.77 + c.1.89 + c.1.64 + c.1.21

T:Use the selected feature subsets to train 3 different classifiers and compare their performance.

(18)

FSelector - filtering features

T:Use the selected feature subsets to train 3 different classifiers and compare their performance.

> model.f1=rpart(f1, train, cp=0.001)

> model.f2=rpart(f2, train, cp=0.001)

> model.f3=rpart(f3, train, cp=0.001)

> sum(mean(test$class == predict(model.f1, test, type="c"))) [1] 0.2345455

> sum(mean(test$class == predict(model.f2, test, type="c"))) [1] 0.2254545

> sum(mean(test$class == predict(model.f3, test, type="c"))) [1] 0.2318182

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 15/34

(19)

FSelector - wrapping a rpart classifier

You need to define a function that returns a value to evaluate the given feature subset

evaluator <- function(subset) {

#k-fold cross validation k=5

splits=runif(nrow(data.small)) results=sapply(1:k, function(i) {

test.idx=(splits >= (i - 1) / k) & (splits < i / k) train.idx=!test.idx

test=data.small[test.idx, , drop=FALSE]

train=data.small[train.idx, , drop=FALSE]

tree=rpart(as.simple.formula(subset, "class"), train) error.rate=mean(test$class != predict(tree, test, type="c")) return(1 - error.rate)

})

print(subset) print(mean(results)) return(mean(results)) }

(20)

FSelector - wrapping a rpart classifier

HW:Use the forward search algorithm to get a feature subset

# forward search on all features except the class

> subset=forward.search(names(data.small)[-97], evaluator)

> subset

[1] "c.1.5" "c.1.21" "c.1.44" "c.1.60" "c.1.64" "c.1.77" "c.1.86"

Then compare the classifier performance with the previous filter based methods.

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 17/34

(21)

Bagging and boosting

Recall:

• methods for combining multiple classifiers

• the more complementary classifiers, the better results usually can be achieved

• remedy for unstable algorithms (such as decision trees)

• output can be the most simply obtained by voting (can be weighted)

(22)

Bagging and boosting

Bagging (= "Bootstrap Aggregation")

• method generates slightly different training sets using sampling with replacement

• each training sample is used to learn a possibly different predictor

• output is obtained by majority voting

Boosting

• sequential training of weak classifiers

• each base classifier is trained on data that is weighted based on the performance of the previous classifier

• output is obtained by weighted majority voting

AdaBoost

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 19/34

(23)

Bagging and boosting

Bagging (= "Bootstrap Aggregation")

• method generates slightly different training sets using sampling with replacement

• each training sample is used to learn a possibly different predictor

• output is obtained by majority voting

Boosting

• sequential training of weak classifiers

• each base classifier is trained on data that is weighted based on the performance of the previous classifier

• output is obtained by weighted majority voting

AdaBoost

(24)

Bagging and boosting in R - package ada

packageada

• only binary classification

• usesrpartdecision trees

• nice visualisation

• functionadaprovides a variety stochastic boosting models

– based on the work published by Friedman, et al. (Additive Logistic Regression: A Statistical View of Boosting, 2000)

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 20/34

(25)

Bagging and boosting in R - package adabag

packageadabag

• simpler than ada

• also usesrpartdecision trees

• multiple class prediction

• bagging: Breiman’s Bagging algorithm with functionbagging

• boosting: Adaboost.M1 algorithm with functionboosting

(26)

bagging in package adabag

> model.bagging = bagging(class~., data=train, mfinal=50, control=rpart.control(cp=0.001))

Parameters:

• formula

• data

• mfinal- number of trained classifiers

• control- parameters used in the training ofrpart classifiers

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 22/34

(27)

bagging in package adabag

> model.bagging$trees

... displays trained rpart classfiers

> pred.bagging = predict.bagging(model.bagging, newdata = test, newmfinal=30)

• newmfinal- number of trees used in the prediction (can be used for pruning the model)

> sum(mean(test$class == pred.bagging$class) [1] 0.285454545454545

(28)

boosting in package adabag

> model.adaboost = boosting(class~., data=train, mfinal=50, control=rpart.control(cp=0.001))

Parameters:

• formula

• data

• mfinal- number of trained classifiers

• control- parameters used in the training ofrpart classifiers

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 24/34

(29)

boosting in package adabag

> model.adaboost$trees

... displays trained rpart classfiers

# displays the weights of classifiers

> model.adaboost$weights

[1] 0.002000003 0.002000003 0.002000003 0.002000003 0.002000003 ...

[199] 0.002000003 0.002000003

> pred.adaboost = predict.bagging(model.bagging, newdata = test, newmfinal=30)

• newmfinal- number of trees used in the prediction (can be used for pruning the model)

> sum(mean(test$class == pred.adaboost$class) [1] 0.291818181818182

(30)

Random Forests

• an ensemble method based on decision trees and bagging

• builds a number of random decision trees and then uses voting

• very good (state-of-the-art) prediction performance

• avoid overfitting

• 2 types of randomness:

random creation of training sets using sampling with replacement

random selection of features used in the training of each classifier

• each tree in ensemble is grown out fully - they each overfit, but in different ways and the mistakes are averaged over them all

• important: the more trees in the ensemble, the better

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 26/34

(31)

Random Forests

• an ensemble method based on decision trees and bagging

• builds a number of random decision trees and then uses voting

• very good (state-of-the-art) prediction performance

• avoid overfitting

• 2 types of randomness:

random creation of training sets using sampling with replacement

random selection of features used in the training of each classifier

• each tree in ensemble is grown out fully - they each overfit, but in different ways and the mistakes are averaged over them all

• important: the more trees in the ensemble, the better

(32)

Random Forests in R

packagerandomForest

predictor = randomForest(formula, data=, ntree=, mtry=)

Parameters:

• ntree- number of trees in the ensemble, default value 500

• mtry√ - number of features selected for training of one tree, default value

#featuresCount for classification, #featuresCount/3 for regression For other parameters, see help

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 27/34

(33)

Random Forests in R

T:Split the data to train and test set T:Train a random forest on train set

> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor

> str(model.rf)

T:Predict on test set and compute the accuracy of the predictor

> sum(mean(test$class == predict(model.rf, test))) 0.3654545

(34)

Random Forests in R

T:Split the data to train and test set T:Train a random forest on train set

> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor

> str(model.rf)

T:Predict on test set and compute the accuracy of the predictor

> sum(mean(test$class == predict(model.rf, test))) 0.3654545

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 28/34

(35)

Random Forests in R

T:Split the data to train and test set T:Train a random forest on train set

> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor

> str(model.rf)

T:Predict on test set and compute the accuracy of the predictor

> sum(mean(test$class == predict(model.rf, test))) 0.3654545

(36)

Random Forests in R

T:Split the data to train and test set T:Train a random forest on train set

> model.rf = randomForest(class~., data=train, ntree=500) T:Look at the structure of trained predictor

> str(model.rf)

T:Predict on test set and compute the accuracy of the predictor

> sum(mean(test$class == predict(model.rf, test))) 0.3654545

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 28/34

(37)

Ensemble methods - back to the feature selection

Predictors in theadabag andrpartpackages compute feature importance during the training process

> model.bagging$imp

c.1.1 c.1.10 c.1.11 c.1.12 c.1.13 c.1.14

0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 2.22814204 ...

> model.adaboost$imp

c.1.1 c.1.10 c.1.11 c.1.12 c.1.13 c.1.14

0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 2.2491597 ...

(38)

Ensemble methods - back to the feature selection

Predictors in theadabag andrandomForestpackages compute feature importanceduring the training process

> importance(model.rf) MeanDecreaseGini c.1.1 1.585058e+02 c.1.2 2.228907e+01 c.1.3 5.502966e+00 c.1.4 3.807578e+01 c.1.5 3.509182e+02

...

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 30/34

(39)

Variable importance in model.bagging

> barplot(model.bagging$imp[order(model.bagging$imp, decreasing = TRUE) [1:7]], ylim = c(0, 50), main = "Bagging - Variables rel. importance",

col = "lightblue")

(40)

Variable importance in model.adaboost

> barplot(model.adaboost$imp[order(model.adaboost$imp, decreasing = TRUE) [1:7]], ylim = c(0, 50), main = "Adaboost - Variables rel. importance",

col = "lightblue")

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 32/34

(41)

Variable importance in model.rf

> varImpPlot(model.rf)

(42)

Homework exercises

• Look at the documentation of packagesFSelector,ada,adabagand randomForest

• Good luck with your term project!

NPFL054, 2014 Hladká & Holub & Lukšová Lab 13, page 34/34

Odkazy

Související dokumenty

• embedded methods – perform feature selection during the process of training.. Filters, wrappers, and

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high

HW: Find a feature subset so that the learning algorithm results in two different decision trees using two different splitting criteria.. (hint - look at the

ESSLLI ’2013 Hladká &amp; Holub Day 1, page 39/59.. Training data vs. test data. Supervised machine learning

Figure 3.6, page 76 of Deep Learning Book, http://deeplearningbook.org... Why Normal Distribution Central

– Remarks on the relation of Machine Learning and Data Science – Motivations for deep architectures.. – Remarks on historical development and

data instance = feature vector (+ output value, if it is known) A data instance is either a feature vector or a complete example... Supervised

λ = 0, so that the minimum RSS value falls into the region of ridge regression parameter estimates then the alternative formulation yields the least square estimates. NPFL054,