• Nebyly nalezeny žádné výsledky

Introduction to Machine Learning NPFL 054

N/A
N/A
Protected

Academic year: 2022

Podíl "Introduction to Machine Learning NPFL 054"

Copied!
42
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Introduction to Machine Learning

NPFL 054

http://ufal.mff.cuni.cz/course/npfl054

Barbora Hladká hladka@ufal.mff.cuni.cz

Martin Holub holub@ufal.mff.cuni.cz

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

NPFL054, 2022 Hladká & Holub Lecture 6, page 1/42

(2)

Lecture #6

Outline

Logistic regression

Evaluation of binary classifiers

NPFL054, 2022 Hladká & Holub Lecture 6, page 2/42

(3)

Decision boundary

A task of binary classification: Y = {0, 1}

Decision boundary takes a form of function f and partitions a feature space into two sets, one for each class.

NPFL054, 2022 Hladká & Holub Lecture 6, page 3/42

(4)

Binary classification Hyperplane

Hyperplane is a linear decision boundary of the form Θ > x = 0

where direction of hθ 1 , θ 2 , . . . , θ m i is perpendicular to the hyperplane and θ 0 determines position of the hyperplane with respect to the origin

NPFL054, 2022 Hladká & Holub Lecture 6, page 4/42

(5)

Hyperplane

• point if m = 1, line if m = 2, plane if m = 3, . . .

• we can use hyperplane for classification so that

f (x) =

1 if θ 0 + θ 1 x 1 + · · · + θ m x m ≥ 0 0 if θ 0 + θ 1 x 1 + · · · + θ m x m < 0

linear classifiers classify examples using hyperplanes

NPFL054, 2022 Hladká & Holub Lecture 6, page 5/42

(6)

Binary classification

Can we use linear regression?

NPFL054, 2022 Hladká & Holub Lecture 6, page 6/42

(7)

Can we use linear regression?

Fit the data with a linear function f

NPFL054, 2022 Hladká & Holub Lecture 6, page 7/42

(8)

Binary classification

Can we use linear regression?

Classify

• if f (x) ≥ 0.5, predict 1

• if f (x) < 0.5, predict 0

NPFL054, 2022 Hladká & Holub Lecture 6, page 8/42

(9)

Can we use linear regression?

Add one more training instance

NPFL054, 2022 Hladká & Holub Lecture 6, page 9/42

(10)

Binary classification

Can we use linear regression?

We are heading for the logistic regresession algorithm.

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

A

1

Y

● ● ● ●

●● ● ● ● ●

NPFL054, 2022 Hladká & Holub Lecture 6, page 10/42

(11)

NPFL054, 2022 Hladká & Holub Lecture 6, page 11/42

(12)

Logistic regression

Logistic regression is a classification algorithm.

Its target hypothesis f for a binary classification has a form of sigmoid function

f (x; Θ) = 1

1 + e −Θ > x = e Θ > x 1 + e Θ > x

−6 −4 −2 0 2 4 6

0.00.20.40.60.81.0

Sigmoid function

z

g(z)

g (z ) = 1+e 1 −z

• lim z→+∞ g (z ) = 1

• lim z→−∞ g (z ) = 0

NPFL054, 2022 Hladká & Holub Lecture 6, page 12/42

(13)

f (x; Θ) = 1 1 + e −θ 0 −θ 1 x 1

−4 −2 0 2 4

0.00.20.40.60.81.0

A1

Y

θ0=0 θ0=3 θ0= −3 θ1=2

NPFL054, 2022 Hladká & Holub Lecture 6, page 13/42

(14)

Logistic regression

f (x; Θ) = 1 1 + e −θ 0 −θ 1 x 1

−4 −2 0 2 4

0.00.20.40.60.81.0

A1

Y

θ1=2 θ1= −2 θ1=6 θ0=0

NPFL054, 2022 Hladká & Holub Lecture 6, page 14/42

(15)

Classification rule

Predict a target value using ˆ f (x; ˆ Θ) so that

• if ˆ f (x; ˆ Θ) ≥ 0.5, i.e. ˆ Θ > x ≥ 0, predict 1

• if ˆ f (x; ˆ Θ) < 0.5, i.e. ˆ Θ > x < 0, predict 0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.00.20.40.60.81.0

^f

A1

Y

● ● ● ●

●● ● ● ● ●

NPFL054, 2022 Hladká & Holub Lecture 6, page 15/42

(16)

Logistic regression Derivation

Interpretation of f (x; Θ): it models the conditional probability Pr(y = 1|x; Θ) f (x; Θ) = Pr(y = 1|x; Θ)

1. categorical attribute Y = {0, 1}

2.

(( (( (( (( (( (

y = θ 0 + θ 1 x 1 · · · + θ m x m , see above → model Pr(Y = y |x), e.g.

Pr(Y = 1|x)

3.

(( (( (( (( (( (( (( ( (

Pr(Y = 1|x) = θ 0 + θ 1 x 1 · · · + θ m x m , see above

4. Model odds(Pr(Y = 1|x)) = Pr(Y Pr(Y =1|x) =0|x) = 1−Pr(Y Pr(Y =1|x) =1|x) ∈ h0, +∞)

NPFL054, 2022 Hladká & Holub Lecture 6, page 16/42

(17)

odds = Pr(success)/ Pr(failure) Example: Titanic data set

> d <- read.csv("train.csv")

> attach(d)

> table(Sex, Survived) Survived

Sex 0 1

female 81 233 male 468 109

> detach()

• the odds of surviving for male:

Pr(Survived = 1|Sex = male)/ Pr(Survived = 0|Sex = male) = 109 486 = 0.23

• the odds of surving for female:

Pr(Survived = 1|Sex = female)/ Pr(Survived = 0|Sex = female) = 233 81 = 2.88

• the ratio of the odds for female to the odds for male 2.88/0.23 = 12.52

NPFL054, 2022 Hladká & Holub Lecture 6, page 17/42

(18)

Logit

5. Transform h0, +∞) to (−∞, +∞): model

logit(Pr(Y = 1|x)) = ln(odds(Pr(Y = 1|x))) = ln( Pr(Y = 1|x) 1 − Pr(Y = 1|x) ) 6. Use linear regression

ln( Pr(Y = 1|x)

1 − Pr(Y = 1|x) ) = θ 0 + θ 1 x 1 + · · · + θ m x m i.e.,

Pr(Y = 1|x) = 1

1 + e −θ 0 −θ 1 x 1 −···−θ m x m

f (x i ; Θ) = Pr(Y i = 1|x i ; Θ) = 1

1+e −Θ > xi

NPFL054, 2022 Hladká & Holub Lecture 6, page 18/42

(19)

Binary features

• Use female = {1, 0} instead of Sex = {female, male}

• in Linear regression y = θ 0 + θ 1 ∗ female

θ 0 is the average y for male

θ 0 + θ 1 is the average y for female

θ 1 is the average difference in y between female and male

• in Logistic regresion p = Pr(Survive = 1|x, Θ), ln 1−p p = θ 0 + θ 1 ∗ female

If female == 0

p = p 1 → ln( 1−p p 1

1 ) = θ 0 → 1−p p 1

1 = e θ 0

• the intercept θ 0 is the log odds for men

If female == 1

p = p 2 → 1−p p 2

2 = e θ 0 1

• odds ratio = 1−p p 2

2 / 1−p p 1

1 = e θ 1

• the parameter θ 1 is the log odds ratio between female and male

NPFL054, 2022 Hladká & Holub Lecture 6, page 19/42

(20)

Parameter interpretation Numerical features

θ i gives an average change in logit(f (x)) with one-unit change in A i holding all other features fixed

NPFL054, 2022 Hladká & Holub Lecture 6, page 20/42

(21)

Loss function

L(Θ) = −

n

X

i=1

y i log P(y i |x i ; Θ) + (1 − y i ) log(1 − P(y i |x i ; Θ))

See Maximum Likelihood Principle for derivation of this loss function.

Optimization problem

Θ ? = argmin Θ L(Θ)

NPFL054, 2022 Hladká & Holub Lecture 6, page 21/42

(22)

Parameter estimates

L(Θ) = − P n

i=1 y i log P(y i |x i ; Θ) + (1 − y i ) log(1 − P(y i |x i ; Θ))

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4 6 8 10

L ( Θ )

f ( x

i

)

−log(p)

y

i

= 1

NPFL054, 2022 Hladká & Holub Lecture 6, page 22/42

Odkazy

Související dokumenty

decreasing values of their associated eigenvalues.. It is an example of orthonormal basis, so called naive basis I.. examples in

the covariance of any pair of distinct features is zero, and the variance of each of our new features is listed along the diagonal... Properties

• embedded methods – perform feature selection during the process of training.. Filters, wrappers, and

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high

HW: Find a feature subset so that the learning algorithm results in two different decision trees using two different splitting criteria.. (hint - look at the

ESSLLI ’2013 Hladká &amp; Holub Day 1, page 39/59.. Training data vs. test data. Supervised machine learning

Figure 3.6, page 76 of Deep Learning Book, http://deeplearningbook.org... Why Normal Distribution Central

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high