Cluster analysis - Knowledge discovery methods

2. Knowledge discovery methods

2.1 Cluster analysis

2.1.1 Introduction

Cluster analysis or clustering is grouping a set of objects into groups or clusters so that those objects within one group are more similar to each other than to the others. This can be useful for market segmentation, understanding our target audience better so that we are able to offer specific products to specific people depending on the segment they fall into. It has many further uses, such as in medical imaging, crime analysis and biology.

There are different types of clustering, the two main categories being hierarchical and non-hierarchical. Hierarchical clustering creates a tree of clusters and it is well suited for tasks such as the taxonomy of animals or hierarchical structures of an organization. There are following non-hierarchical clustering algorithms: distribution-based, density-based and centroid-based.

Distribution-based clustering assumes data is composed of distributions and based on the distance from the distribution center it is calculated whether and how much the point belongs to that distribution. It can only be used when we know the type of distribution in our data. Density-based clustering connects areas of high density into clusters. Centroid-based clustering is best known and we will be using it in our example. It organizes the data into non-hierarchical clusters. Common and simple algorithm we will use is called K-means.[22]

Figure 2.1: Hierarchical vs non-hierarchical clustering.

2.1.2 K-means clustering

K-means is an iterative, unsupervised algorithm. It will iterate until each data point belongs to only one group. It is unsupervised because we do not know what the correct answer is in advance. We do not know how many clusters the data is supposed to have or where each point is supposed to belong. Due to its simplicity it is one of the most popular algorithms for clustering.[23]

K-means clustering algorithm steps:

1. Choose the number of clusters (possible techniques will be described below).

2. Randomly choose a centroid for each cluster (centroid should be the middle of a cluster).

3. Assign each point to the closest centroid.

4. Recompute the centroid by taking the average of all points in the cluster, then re-assign the points to the now nearest centroid.

5. Repeat the calculation of the centroids until points stop changing the clusters or in case of larger dataset until convergence is reached.[22]

Techniques for selecting the number of clusters:

• Elbow method

• Gap statistic

• Silhouette Coefficient

Elbow method

Elbow method is one of the heuristics that is used to determine the number of clusters in a dataset. Explained variation measures the proportion to which a mathematical model accounts for the dispersion in the data, that is how stretched or squeezed the data is. The elbow method looks at the percentage of variance explained as a function of the number of clusters. The idea is to select such a number of clusters that adding a further cluster would not increase the quality of the model. The method plots the explained variations as clusters and adding the first cluster will explain a lot of variance, it will add a lot of information for our model. But at some point the marginal gain will drop resulting in an angle in the plot. The number of clusters is chosen at this point. Despite its popularity, this technique is considered subjective and unreliable considering that the elbow cannot always be unambiguously identified.

In the example below we can see the entire process. First we see the creation of centroids in four images, that is in their final phase when they are at the center of all of the data points.

In the first image we only have one centroid, meaning we would have only one cluster and a distortion score of 608.47. Distortion can be understood as misrepresentation of a dataset.

The distortion is high because we have only one centroid trying to represent all of the data.

With two clusters the distortion lowers down to 130.40. With three clusters the distortion

significantly drops to 27.24. With four clusters it drops only to 20.62. The best way to visualize this is to use a line graph. For this example we see that the optimal number of clusters is three since the line flattens out after that point.

Figure 2.2: Clustering example with centroids.

Figure 2.3: Clustering example with elbow method.

2.2 Regression

2.2.1 Linear Regression

There are two types of linear regression. Simple linear regression allows us to study the relationship between two continuous variables. The model function is:

Y =a+bX+u

Variable denoted X is regarded as the explanatory variable, the variable that we use to predict the variable Y. Variable denoted Y is regarded as the response, the outcome of the model we get once we input the explanatory variable. Since Y depends on the X it is also called the dependent variable, whereas X is called the independent variable.[24] The residual value, u, which is the difference between the actual outcome and the predicted outcome, is included in the model to account for such slight variations. Variable a is the y-intercept (constant term) and bi is the slope coefficient for the explanatory variable.[25]

In the image below we can see an example application of linear regression. Blue dots represent data points and the red line represents the linear function onto which we are mapping our data.

Figure 2.4: Linear regression.

Illustrative examples include: predicting GPA based on SAT scores, predicting work perfor-mance based on IQ test scores and predicting soil erosion based on rainfall.

Linear regression makes five assumptions that have to be fulfilled in order to apply linear regression on our data[26]:

• Linearity: There has to be a linear relationship between the explanatory and the out-come variable. In the image below we can see an example of the data where no linearity is present.

Figure 2.5: No linearity.

• Homoscedasticity: Noise is the same across all explanatory variables.

• Independence: Observations are independent of each other.[26]

• Normality: Linear combination of the random variables should have a normal distribu-tion.[27]

• No or little multicollinearity: Multicollinearity occurs when the independent variables are too highly correlated with each other. Some examples are: height and weight (taller people likely weigh more), two variables that seem different but are in fact same (weight in kilos and pounds), one variable can be deducted from the other variable and two variables that are the same but have a different name.[28]

Multiple linear regression uses multiple explanatory variables to predict the outcome variable.

The model function is:

Y =a+b1X1 +b2X2 +b3X3 +...+btXt+u

As we can see it is very similar, the only difference is multiple independent variables. Vari-able denoted Y is regarded as the dependent variVari-able, Xi are the explanatory (independent) variables. Variable a is the y-intercept (constant term) and bi are slope coefficients for each explanatory variable. Variable u is the model’s error term.

Multiple linear regression is used more commonly than simple linear regression. This is due to the fact that in real world it is difficult to find a dependent variable that would be exaplained by only one variable. There are typically many variables that have an impact on something.

2.2.2 Logistic Regression

Logistic regression is used to explain the relationship between one or more explanatory vari-ables of any type and one binary dependent variable.[29] We are trying to predict whether something is True or False. Illustrative examples include: Checking whether the email is spam, whether the patient is healthy, whether the student will pass or fail and any other yes/no types of predictions.[30]

Figure 2.6: Logistic regression.

In document Hlavní práce75007_maln01.pdf, 4.8 MB Stáhnout (Stránka 27-32)