• Nebyly nalezeny žádné výsledky

Introduction to Machine Learning NPFL 054

N/A
N/A
Protected

Academic year: 2022

Podíl "Introduction to Machine Learning NPFL 054"

Copied!
31
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Introduction to Machine Learning

NPFL 054

http://ufal.mff.cuni.cz/course/npfl054

Barbora Hladká hladka@ufal.mff.cuni.cz

Martin Holub holub@ufal.mff.cuni.cz

Charles University,

Faculty of Mathematics and Physics,

Institute of Formal and Applied Linguistics

(2)

Lecture #4

Outline

Linear regression

• Auto data set

(3)

Dataset Auto from the ISLR package

392 instances on the following 9 features

mpg Miles per gallon

cylinders Number of cylinders between 4 and 8 displacement Engine displacement (cu. inches) horsepower Engine horsepower

weight Vehicle weight (lbs.)

acceleration Time to accelerate from 0 to 60 mph (sec.) year Model year (modulo 100)

origin Origin of car (1. American, 2. European, 3. Japanese)

name Vehicle name

(4)

Dataset Auto from the ISLR package

mpg

3 4 5 6 7 8

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

● ●

● ●●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●●●

●● ●

● ●●●

●●

●●

●●

●●●

●● ●

●● ●●

●●●●

●●●

●●●

● ●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

● ●●●

●●●

●●

●●

●●

●●

●● ●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●●●

●●

●●●

● ●●

● ●

● ●●

● ●●

●●

●●●

● ●●

●●●

●●

● ●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

15002500 35004500

10 20 30 40

●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●●

●●

● ●

● ●●●

●●●

●●●●

●●

● ●

●●

●●●●

●●

●●●

●●

●● ●

●●●

●●

●●●●●●

● ●

●●

●●

●●

● ●

●●●

●●

●●

● ●●

●●● ●●

●●

●●

●●

●●

3 4 5 6 7 8

●● ●●●●●●●●●●●●

● ●

●●●

●●

●●

●● ●●

●●●

●●●●●● ●●●●

●●

●●●● ●

● ●●●

●●

●●●●●●●●●

●●

●●●

●●●●●

●●

● ●●●

●●●●

●●●

●●

● ●

● ●

●●●●

●●●●

● ●

● ● ●

●●

● ● ●

●●●●

● ●●●●●●●

● ●●

●●●

● ●

● ●

● ●

● ●●●

●●●●● ●●

●●●

●●

●●●

●●●

●●

● ●●●

●●

●●

●●●

●●●

● ●●

● ●

●●●

● ●

●●●● ●

● ● ●●

●● ●

●●●●●●●●●●

● ●●●●●

●●●●

● ●●●●●●●●●●●

● ●

●●

●● ●● ●

cylinders

●●●●●●●●●●● ●●●

● ●●●●

● ●

●●●

●●

● ●

● ●

●●●●

●●

●●

●●

● ●●●

● ●●

● ●

●●

●●● ●●

● ● ●

● ●

● ●●●●●

● ●●●●●●● ●

●●

● ●● ●

●●

● ●

● ● ●●

●●●

● ●

● ●

● ●●

● ●●

● ●

● ●

● ● ●●●●●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

● ● ●

●●●● ●●

● ●

● ●

●●

● ●

●● ●●

●●●

●●● ●

● ●

●●

● ●

●● ●

●●●

●●

● ●

● ●

●● ●●

●●●●●

● ●●

● ●

● ● ●

●●

●●● ●●●● ●

●●

● ●●●●

●●

● ●●

●●●

● ●●●●●

● ●

● ●

● ●

● ●●●

●●●●●●●●●● ●●●●●●

● ●●●

●●●

●●●●●●●●

●●●

● ●●●

● ●

●●●●●●●●●

●●●●●

●●●

●●●

● ●●●●

● ●

● ●●●

● ●● ●●●●●

●●

● ●●●●

● ● ●●

●●

● ●● ●

● ●●●●●●●● ●●

●●

●●

●●●

● ●

● ●● ●

●●

● ●

● ●

●●● ●●

● ●

●●●●●● ●●

● ●●

●●●

●●

●●

●●

● ●

● ●●

●●●●●●

● ●

●●

●●

●●

●●

●●●● ●

● ●

●●

●●●

● ●●●

● ●●

●●●

●● ●●●●●●

● ●●

●●●

●●

●●●

● ●

●● ● ●●●●●

●●

●●

● ●●●●

● ●

●●

● ●

●● ●

●●

●●● ● ●●

●●●●●

● ●

● ●

● ●

● ●●●

●●

● ●●●● ●● ●●●● ●● ●

● ●

●●

● ●● ●●

●●

●●●●●●

●●

●●●

●● ●●

●●

●●

● ●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●● ●

●●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

● ●

●● ●

●●

●●●●●●●

●●●●

● ●●

●●

●●

●●●●●●●●●

●●

●●●

●●●

● ●

● ●

● ●

● ●●

●●

● ●●

● ●

●●

●● ●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

horsepower

50 100 150 200

●●●

●●●

●●

●●

●●

●●●● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●●●

●●

● ● ●

●●●

●●

●●●

● ●●●

● ●

● ●

● ●

● ●●

●●●●

●●

●●●●●●● ●●●●

● ●●

●●

●●

●●

● ●●●●●

●●

●●●

●●

●●

●●●

●●●

10 20 30 40

1500 2500 3500 4500

●●

●●

●●●

●●

●●

●●

●●

●● ●●

●●

●● ●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●

●● ●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

● ●●

● ●

● ●

● ●

●●

●●

●●

●●●●

●●●

● ●

●●

●●

●●

●●●●●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

50 100 150 200

●●●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●●

●●●●

●●

● ●

●●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

● ●

● ●

●●

weight

(5)

Linear regression

(6)

Linear regression

Linear regression is a class of regression algorithms assuming that there is at least a linear dependence between a target attribute and features.

A target hypothesis f has a form of linear function

f (x; Θ) = θ 0 + θ 1 x 1 + · · · + θ m x m (1)

θ 0 , . . . , θ m are regression parameters

simple linear regression if m = 1

(7)

Linear regression

Notation

y =

y 1 . . . y n

x i = h1, x i1 , . . . , x im i

Θ > =

θ 0 . . . θ m

, X =

1 x 11 . . . x 1m

1 x 21 . . . x 2m . . . . . . . . . . . . 1 x n1 . . . x nm

Now we can write y = > , f (x) = Θ > x

(8)

Parameter interpretation

Numerical feature

θ i is the average change in y for a unit change in A i holding all other features fixed

(9)

Parameter interpretation

Categorical feature with k values

Replace the feature with k − 1 dummy numerical features DA 1 , . . . , DA k−1 Example: run simple linear regression mpg ∼ origin

DA 1 DA 2

American 0 0

European 1 0

Japanase 0 1

y = θ 0 + θ 1 DA 1 + θ 2 DA 2

y = θ 0 + θ 1 if the car is European

y = θ 0 + θ 2 if the car is Japanese

y = θ 0 if the car is American

θ 0 as the average mpg for American cars

θ 1 as the average difference in mpg between European and American cars

θ 2 as the average difference in mpg between Japanese and American cars

(10)

Parameter estimates Least Square Method

• residual y iy ˆ i , where ˆ y i = ˆ f (x i ) = ˆ Θ > x i

Loss function Residual Sum of Squares RSS( ˆ Θ) = P n

i=1 (y iy ˆ i ) 2

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●

●●●●

● ●

● ●

● ●●

● ●●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●●

1500 2000 2500 3000 3500 4000 4500 5000

10 20 30 40

ISLR: Auto data set

Weight

Miles P er Gallon

Residuum ve 3D

1000 2000 3000 4000 5000 6000

01020304050

0 50

100 150

200 250

weight

horse po w er

miles per gallon

● ●●

● ●

● ● ●

●●●●● ●

●●●

●●

●●

●●●●

● ●

●●

● ●●

●●

●●●●

●●

●●

● ●

● ●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●●

●●●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

(11)

Parameter estimates Least Square Method

Optimization problem

Θ ? = argmin Θ RSS(Θ)

The argmin operator will give Θ for which RSS(Θ) is minimal.

(12)

Parameter estimates Least Square Method

Solving the optimization problem analytically Normal Equations Calculus

Theorem

Θ ? is a least square solution to y = > ⇔ Θ ? is a solution to the Normal equation X > XΘ = X > y.

Θ ? = (X > X) −1 X > y

Computational complexity of a (m + 1) × (m + 1) matrix inversion is O(m + 1) 3 :-(

(13)

Parameter estimates Least Square Method

Solving the optimization problem numerically

Gradient Descent Algorithm

(14)

Gradient Descent Algorithm

Assume: simple regression, θ 0 = 0, θ 1 6= 0

(15)

Gradient Descent Algorithm

Assume: simple regression, θ 0 6= 0, θ 1 6= 0

theta0 30

40 50

60

theta1

−0.010

−0.005 0.000 L(theta0, theta1)

2e+05 4e+05 6e+05

Loss Function L has a minimum value at the red point

Contours of Loss Function

theta0

theta1

20000

40000 40000

60000 60000

80000

80000 1e+05

1e+05

120000 120000

140000 140000

160000

160000

180000 180000

2e+05

2e+05

220000 220000

240000

240000

260000

260000

280000

280000

3e+05

3e+05

320000

340000

340000

360000

360000

380000

380000

4e+05

4e+05

420000

440000

480000

5e+05

5e+05 560000 6e+05

30 40 50 60

−0.014 −0.012 −0.010 −0.008 −0.006 −0.004 −0.002 0.000

● ●

(16)

Gradient Descent Algorithm

Gradient descent algorithm is an optimization algorithm to find a local minimum

of a function f .

(17)

Gradient Descent Algorithm

1. Start with some x 0 .

(18)

Gradient Descent Algorithm

2. Keep changing x i to reduce f (x i )

Which direction to go? How big step to do?

(19)

Gradient Descent Algorithm

Credits: Andrew Ng

(20)

Gradient Descent Algorithm

• We are seeking the solution to the minimum of a function f (x). Given some initial value x 0 , we can change its value in many directions.

• What is the best direction to minimize f ? We take the gradient ∇f of f

∇f (x 1 , x 2 , . . . , x m ) = h ∂f (x 1 , x 2 , . . . , x m )

∂x 1

, . . . , ∂f (x 1 , x 2 , . . . , x m )

∂x m

i

• Intuitively, the gradient of f at any point tells which direction is the steepest

from that point and how steep it is. So we change x in the opposite direction

to lower the function value.

(21)

Gradient Descent Algorithm

Choice of the step: assume constant value

If the step is too small, GDA can be slow.

(22)

Gradient Descent Algorithm

Choice of the step

If the step is too large, GDA can overshoot the minimum.

It may fail to converge, or even diverge.

(23)

Gradient Descent Algorithm

repeat until convergence {

Θ K+1 := Θ Kα∇fK ) }

α is a positive step-size hyperparameter

(another option is to choose a different step size α k at each iteration )

I.e. simultaneously update θ j , j = 1, . . . , m

(24)

Linear regression

Gradient Descent Algorithm

For linear regression f = RSS

θ j K+1 := θ K jα 1 n

n

X

i=1

((Θ K ) > xy i )x ij

RSS is a convex function, so there is no local optimum, just global minimum.

(25)

Polynomial regression

Polynomial regression is an extension of linear regression where the relationship between features and target value is modelled as a d-th order polynomial.

Simple regression y = θ 0 + θ 1 x 1

Polynomial regression

y = θ 0 + θ 1 x 1 + θ 2 x 1 2 + . . . θ d x 1 d It is still a linear model with features A 1 , A 2 1 , . . . , A d 1 .

The linear in linear model refers to the hypothesis parameters, not to the features.

Thus, the parameters θ 0 , θ 1 , . . . , θ d can be easily estimated using least squares

linear regression.

(26)

Polynomial regression Auto data set

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

●●

● ● ●

50 100 150 200

10 20 30 40

ISLR: Auto data set

Miles P er Gallon

Linear Degree 2 Degree 5

NPFL054, 2022 Hladká & Holub Lecture 4, page 26/31

(27)

Assessing the accuracy of the model

Coefficient of determination R 2 measures the proportion of variation in a target value that is reduced by taking into account x

R 2 = TSS − RSS

TSS = 1 − RSS TSS where Total Sum of Squares TSS = P n

i=1 (y iy) 2 ; R 2 ∈ h0, 1i

Mean Squared Error MSE

MSE = 1

n · RSS

(28)

R 2

(29)

Population regression line vs. Least squares line

• Population regression line: θ 0 , . . . , θ m

• Least squares line: ˆ θ 0 , . . . , ˆ θ m

• Assume random variable Y , sample D = {y 1 , . . . , y n }

• Estimate population mean µ: ˆ µ, e.g., ˆ µ = y = P n i=1 y i

• Standard Error of ˆ µ: SEµ) 2 = σ n 2

(30)

Population regression line vs. Least squares line

How accurate is ˆ θ i as an estimate of θ i ?

• Statistical hypothesis testing (details will be provided later on):

H 0 (null hypothesis): θ i = 0; H 1 (alternate hypothesis): θ i <> 0,

i.e. there exists a relationship between the target attribute and the feature A i ; t-test, p value, significance level α (the more stars, the more significant feature), we reject H 0 if p <= α

• Adjusted R-squared = R 2 adjusted for the number of features used in the

model

(31)

Summary of Examination Requirements

• Linear regression, simple linear regression, polynomial regression

• Parameter interpretation

• Least Square Method

• Gradient Descent Algorithm

• Coefficient of Determination, Mean Squared Error

Odkazy

Související dokumenty

To avoid overfitting, practical implementations usually use pruning after building a relatively deep tree.. NPFL054, 2019 Hladká &amp; Holub Lecture 3,

Each node of a decision tree is associated with a subset of traning data Building a decision tree means to make a hierarchical sequence of splits.. Each practical algorithm must be

decreasing values of their associated eigenvalues.. It is an example of orthonormal basis, so called naive basis I.. examples in

the covariance of any pair of distinct features is zero, and the variance of each of our new features is listed along the diagonal... Properties

• embedded methods – perform feature selection during the process of training.. Filters, wrappers, and

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high

HW: Find a feature subset so that the learning algorithm results in two different decision trees using two different splitting criteria.. (hint - look at the

to be specialized in one specific subject is also a good thing it will make you a proffional in a specific subject , but when you it comes to socity they will Proficiency: high