Introduction to Machine Learning NPFL 054

(1)

Introduction to Machine Learning

NPFL 054

http://ufal.mff.cuni.cz/course/npfl054

Barbora Hladká hladka@ufal.mff.cuni.cz

Martin Holub holub@ufal.mff.cuni.cz

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

NPFL054, 2022 Hladká & Holub Lecture 6, page 1/42

(2)

Lecture #6

Outline

• Logistic regression

• Evaluation of binary classifiers

NPFL054, 2022 Hladká & Holub Lecture 6, page 2/42

(3)

Decision boundary

A task of binary classification: Y = {0, 1}

Decision boundary takes a form of function f and partitions a feature space into two sets, one for each class.

NPFL054, 2022 Hladká & Holub Lecture 6, page 3/42

(4)

Binary classification Hyperplane

Hyperplane is a linear decision boundary of the form Θ ^> x = 0

where direction of hθ ₁ , θ ₂ , . . . , θ _m i is perpendicular to the hyperplane and θ ₀ determines position of the hyperplane with respect to the origin

NPFL054, 2022 Hladká & Holub Lecture 6, page 4/42

(5)

Hyperplane

• point if m = 1, line if m = 2, plane if m = 3, . . .

• we can use hyperplane for classification so that

f (x) =

1 if θ ₀ + θ ₁ x ₁ + · · · + θ _m x _m ≥ 0 0 if θ ₀ + θ ₁ x ₁ + · · · + θ _m x _m < 0

• linear classifiers classify examples using hyperplanes

NPFL054, 2022 Hladká & Holub Lecture 6, page 5/42

(6)

Binary classification

Can we use linear regression?

NPFL054, 2022 Hladká & Holub Lecture 6, page 6/42

(7)

Can we use linear regression?

Fit the data with a linear function f

NPFL054, 2022 Hladká & Holub Lecture 6, page 7/42

(8)

Binary classification

Can we use linear regression?

Classify

• if f (x) ≥ 0.5, predict 1

• if f (x) < 0.5, predict 0

NPFL054, 2022 Hladká & Holub Lecture 6, page 8/42

(9)

Can we use linear regression?

Add one more training instance

NPFL054, 2022 Hladká & Holub Lecture 6, page 9/42

(10)

Binary classification

Can we use linear regression?

We are heading for the logistic regresession algorithm.

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

A

1

Y

● ● ● ●

●● ● ● ● ●

NPFL054, 2022 Hladká & Holub Lecture 6, page 10/42

(11)

NPFL054, 2022 Hladká & Holub Lecture 6, page 11/42

(12)

Logistic regression

Logistic regression is a classification algorithm.

Its target hypothesis f for a binary classification has a form of sigmoid function

f (x; Θ) = 1

1 + e ^−Θ ^> ^x = e ^Θ ^> ^x 1 + e ^Θ ^> ^x

−6 −4 −2 0 2 4 6

0.00.20.40.60.81.0

Sigmoid function

z

g(z)

• g (z ) = _1+e ¹ −z

• lim z→+∞ g (z ) = 1

• lim _z→−∞ g (z ) = 0

NPFL054, 2022 Hladká & Holub Lecture 6, page 12/42

(13)

f (x; Θ) = 1 1 + e ^−θ ⁰ ^−θ ¹ ^x ¹

−4 −2 0 2 4

0.00.20.40.60.81.0

A1

Y

θ0=0 θ0=3 θ0= −3 θ1=2

NPFL054, 2022 Hladká & Holub Lecture 6, page 13/42

(14)

Logistic regression

f (x; Θ) = 1 1 + e ^−θ ⁰ ^−θ ¹ ^x ¹

−4 −2 0 2 4

0.00.20.40.60.81.0

A1

Y

θ1=2 θ1= −2 θ1=6 θ0=0

NPFL054, 2022 Hladká & Holub Lecture 6, page 14/42

(15)

Classification rule

Predict a target value using ˆ f (x; ˆ Θ) so that

• if ˆ f (x; ˆ Θ) ≥ 0.5, i.e. ˆ Θ ^> x ≥ 0, predict 1

• if ˆ f (x; ˆ Θ) < 0.5, i.e. ˆ Θ ^> x < 0, predict 0

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.00.20.40.60.81.0

^f

A1

Y

● ● ● ●

●● ● ● ● ●

NPFL054, 2022 Hladká & Holub Lecture 6, page 15/42

(16)

Logistic regression Derivation

Interpretation of f (x; Θ): it models the conditional probability Pr(y = 1|x; Θ) f (x; Θ) = Pr(y = 1|x; Θ)

1. categorical attribute Y = {0, 1}

2. (( (( (( (( (( (

y = θ 0 + θ 1 x 1 · · · + θ m x m , see above → model Pr(Y = y |x), e.g.

Pr(Y = 1|x)

3. (( (( (( (( (( (( (( ( (

Pr(Y = 1|x) = θ 0 + θ 1 x 1 · · · + θ m x m , see above

4. Model odds(Pr(Y = 1|x)) = ^Pr(Y _Pr(Y ^=1|x) _=0|x) = _1−Pr(Y ^Pr(Y ^=1|x) _=1|x) ∈ h0, +∞)

NPFL054, 2022 Hladká & Holub Lecture 6, page 16/42

(17)

odds = Pr(success)/ Pr(failure) Example: Titanic data set

> d <- read.csv("train.csv")

> attach(d)

> table(Sex, Survived) Survived

Sex 0 1

female 81 233 male 468 109

> detach()

• the odds of surviving for male:

Pr(Survived = 1|Sex = male)/ Pr(Survived = 0|Sex = male) = ¹⁰⁹ ₄₈₆ = 0.23

• the odds of surving for female:

Pr(Survived = 1|Sex = female)/ Pr(Survived = 0|Sex = female) = ²³³ ₈₁ = 2.88

• the ratio of the odds for female to the odds for male 2.88/0.23 = 12.52

NPFL054, 2022 Hladká & Holub Lecture 6, page 17/42

(18)

Logit

5. Transform h0, +∞) to (−∞, +∞): model

logit(Pr(Y = 1|x)) = ln(odds(Pr(Y = 1|x))) = ln( Pr(Y = 1|x) 1 − Pr(Y = 1|x) ) 6. Use linear regression

ln( Pr(Y = 1|x)

1 − Pr(Y = 1|x) ) = θ ₀ + θ ₁ x ₁ + · · · + θ _m x _m i.e.,

Pr(Y = 1|x) = 1

1 + e ^−θ ⁰ ^−θ ¹ ^x ¹ ^{−···−θ} ^m ^x ^m

f (x _i ; Θ) = Pr(Y _i = 1|x i ; Θ) = ¹

1+e ^−Θ ^> ^xi

NPFL054, 2022 Hladká & Holub Lecture 6, page 18/42

(19)

Binary features

• Use female = {1, 0} instead of Sex = {female, male}

• in Linear regression y = θ 0 + θ 1 ∗ female

• θ 0 is the average y for male

• θ 0 + θ 1 is the average y for female

• θ 1 is the average difference in y between female and male

• in Logistic regresion p = Pr(Survive = 1|x, Θ), ln _1−p ^p = θ ₀ + θ ₁ ∗ female

• If female == 0

• p = p 1 → ln( _1−p ^p ¹

1 ) = θ 0 → _1−p ^p ¹

1 = e ^θ ⁰

• the intercept θ 0 is the log odds for men

• If female == 1

• p = p 2 → _1−p ^p ²

2 = e ^θ ⁰ ^+θ ¹

• odds ratio = _1−p ^p ²

2 / _1−p ^p ¹

1 = e ^θ ¹

• the parameter θ 1 is the log odds ratio between female and male

NPFL054, 2022 Hladká & Holub Lecture 6, page 19/42

(20)

Parameter interpretation Numerical features

• θ _i gives an average change in logit(f (x)) with one-unit change in A _i holding all other features fixed

NPFL054, 2022 Hladká & Holub Lecture 6, page 20/42

(21)

• Loss function

L(Θ) = −

n

X

i=1

y _i log P(y _i |x i ; Θ) + (1 − y _i ) log(1 − P(y _i |x i ; Θ))

See Maximum Likelihood Principle for derivation of this loss function.

• Optimization problem

Θ ^? = argmin _Θ L(Θ)

NPFL054, 2022 Hladká & Holub Lecture 6, page 21/42

(22)

Parameter estimates

L(Θ) = − P n

i=1 y i log P(y i |x i ; Θ) + (1 − y i ) log(1 − P(y i |x i ; Θ))

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4 6 8 10

L ( Θ )

f ( x

i

)

−log(p)

y

i

Introduction to Machine Learning NPFL 054