Algorithms - Hlavní práce70941

An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates a model from data [15]. Algorithms come in different levels of complexity and sometimes, basic approaches to solving a problem can yield results just as good as those requiring large computation capacity. That is why a “simplicity-first” methodology is considered a valuable practice. A classification algorithm OneR is a good example of a tool for simple data analysis.

Often, users need to try several different methods in order to choose the model that works best for them. This chapter provides a description of more complex algorithms that have a good track record in practical resolution of tasks described in chapter 2.1. Some of them were used in the practical part of this thesis.

2.2.1 Regression analysis

Regression analysis methods are often used in data mining, gaining popularity thanks to easily comprehensible results and the high-level nature of the algorithms. In regression, the main task is to attempt to explain the effect a set of independent variables has on the term. At the start, the β0 and β1 coefficients are unknown and need to be calculated using a statistical approach called least-squares criterion. The goal of such a criterion is to minimize the sum of squared differences between actual and predicted output values. Furthermore, linear regression may be defined by multiple independent variables instead of just one.

Figure 1 shows an optimal regression line – the one that is closest to all the dots – and vertical lines that indicate the distance (residual) between the dots and the predicted line.

This distance is then used in the residual sum of squares formula:

𝑅𝑆𝑆 = 𝑒₁²+ 𝑒₂²+ 𝑒_𝑛²

where ei – ith residual. The quality of the obtained model is usually calculated using the R-squared determination coefficient that describes the proportion of the variance of the

dependent variable. The values vary in the range from 0 to 1, with the results closer to 1 being more accurate. Alternatively, linear regression can be used for classification tasks, presenting the attributes in the binary form, although there are certain drawbacks to this approach [16].

Figure 1. Linear regression. Source: [17].

Many tasks are solved using binary classification, or they can be transformed into such format. A logistic model is especially good at dealing with this type of problems. Sigmoid function (or logistic function) is used as a base element, in the range from 0 to 1. Such a function can be expressed with the following equation:

𝑓(𝑦) = 𝑒^𝑦

1 + 𝑒^𝑦= 1 1 + 𝑒^−𝑦

where −∞ < 𝑦 < ∞, 𝑦 → ∞ && 𝑓(𝑦) → 1 and 𝑦 → −∞ && 𝑓(𝑦) → 0. To determine target classes, a threshold is needed – usually, the threshold value is 0.5. Calculation of the constant effect that one variable has on another is performed with the odds ratio [16]:

𝑙𝑛 ( 𝑝

1 − 𝑝) , 𝑝 = 𝑓(𝑦)

In case of tasks requiring us to work with more than 2 target classes, the One-vs-Rest (OvR) or One-vs-One (OvO) methods are the most common. Since not all algorithms are equipped with a native support of multi-class classification, it is advisable to check whether the task can be simplified into a binary form before starting the modelling process. By default, the Python Scikit-learn library suggests the use of One-vs-Rest method for logistics regression, where each individual class is compared to all other classes. Analysis of three groups in the form of class1 vs [class2, class3], class2 vs [class1, class3], class3 vs [class1, class2] is an apt example. The One-vs-One method also simplifies the problems by converting them into binary tasks but, unlike the previous method, it compares a given class to other classes one by one (class1 vs class2, class1 vs class3 etc.). The support vector machine algorithm supports this type of a solution, to provide an example [18].

Since the least squares method is prone to overfitting when used for regression (especially with small amounts of data), there is an optimization method called regularization that helps prevent overfitting by adding a regularization term to the residual sum of squares.

Lasso and ridge regressions respond well to such optimization. The core difference between them lies in the different forms of regularization. More specifically, lasso regression uses L1 regularization technique:

𝑅𝑆𝑆 + 𝛼 ∑|𝛽_𝑗|

𝑝

𝑗=1

while the ridge regression uses L2 regularization technique:

𝑅𝑆𝑆 + 𝛼 ∑ 𝛽_𝑗²

𝑝

𝑗=1

where the increase of the alpha parameter (tuning parameter) lowers the other coefficients.

That is why the lasso algorithm coefficients are often equal to 0, and why lasso models are easier to interpret than ridge regression models where the coefficients are usually nearing zero, but rarely reach it. Both models represent a linear type of a model, and both are currently very popular thanks to their high accuracy in prediction. The Scikit-learn library provides a cross-validation technique for alpha parameter evaluation [19].

2.2.2 Decision tree

Decision tree is a part of the Divide-and-Conquer algorithm design paradigm, and it also falls into the supervised learning category. The structure is made of hierarchy of nodes, where the attributes are tested, and leaves that represent the classification results. The tree grows downwards, from the very beginning until the moment when all elements are grouped correctly, or until the moment when the tree finishes the search operation on the training set [3]. Figure 2 shows the steps required to build a simple version of this algorithm.

When building decision trees, it is important to choose the right object classification criteria.

The dispensation of the group must be as accurate as possible to make sure that each subset falls into one class only. The work is always dependent on the attribute type.

Another interesting topic is how the issue of missing values is dealt with. When an element is missing, it either needs to be given special importance, or no importance at all. While the second option is easy to implement, it is too definite since it is very likely that useful information will be lost. A more logical and accurate approach would be to select the most popular branch by counting the elements that descend down the tree. Another option to deal with missing values is to use numeric weight in the range from 0 to 1, dividing the original attribute into parts. The weight is then distributed proportionally to the number of training instances going down that branch. Finally, when the objects reach the leaves, they need to be merged back [3].

Figure 2. Decision tree building algorithm. Source: [11].

Another topic that needs to be mentioned when discussing Decision tree algorithms is branch pruning, a technology that becomes necessary when dealing with complex trees with complicated structure. Subtree replacement and subtree raising methods are often used to transform these. The first method works by replacing subtrees with a specific leaf, which provides good results for test set use. The second method is more difficult to implement, but just as effective – a subtree can replace the entire parent node with more training examples than it has in the remaining dispensation [3].

The main advantages of the Decision trees are the following: ability to work with both numerical and categorical data types; easily comprehensible results in majority of tasks;

clearness of how the data are processed with Boolean logic; statistical tests support. Among the weaknesses, excessive tree complexity, instability of the results in case of minor changes, and the fact that the search for an optimal tree is an NP-complete problem need to be mentioned.

The description above provides a general explanation of how the Decision tree algorithms work in data analysis. However, there are different versions of these algorithms that differ in their approach to how the tree structure is built. One of the first such algorithms was the ID3 (Iterative Dichotomiser 3) algorithm invented by J. Ross Quinlan in the 80s. It passes through instances from the top to the bottom, selecting the best feature to create a node (the greedy approach) from each iteration. A process called information gain is used when working with features, calculating how well the attributes are classified. The trees grow until they reach maximum depth, and often need to be pruned when they do. The main drawback of this algorithm is the issue of overfitting that occurs even with small samples [20, 3].

C4.5 and C5.0 followed the ID3 algorithm, becoming improved versions of it. Apart from categorical type of instances that were already supported in the original algorithm, they are also capable of processing numeric values. Overfitting issue has also been dealt with, using a pruning strategy. Thanks to good results and documentation, the C4.5 algorithm has become a standard in decision trees and can be implemented in all popular data mining systems. The new C5.0 version was released under Quinlan’s license, introducing improvements to C4.5 and bringing better controls, reduced memory usage, and higher accuracy [11].

Classification and Regression Tree (CART) was designed to construct binary decision trees regardless of instance type, meaning that each node contains only two child nodes. It also differs from C4.5 by the type of pruning it uses, which in case of CART is cost-complexity pruning. It is based on the idea that first pruning subtree, relative to their size, leads to the smallest increase in error on the training data. Then, when the final predictive tree is being selected, validation is performed using a holdout set or cross-validation to estimate the error rate [3]. With some data, this method results in smaller trees than C4.5 would produce. A third difference lies in the use of the Gini index instead of entropy-based criteria in information gain. The Gini index has an upper hand of a slightly better computation time, but both methods are very popular in machine learning algorithms. It is also worth mentioning that the CART method became the first regression tree system to support numerical target variables [11].

2.2.3 Neural networks

“Neural networks offer a mathematical model that attempts to mimic the human brain.”

[11, p. 254] Miloš Křivan gives a more detailed definition: an artificial neural network is an oriented graph with dynamic, marked peaks and edges, and in can be expressed as five building blocks of neural networks are called perceptrons, with a core structure made of inputs, weights, bias, and activation function. Figure 3 shows a simple example of an artificial neuron. Groups of perceptrons form coherent networks with a goal to obtain one or several values [22].

Basic result extraction operations follow these steps: all inputs are multiplied by corresponding weights, the sum of obtained components is calculated, bias is added, then activation function is applied. The first initialization of weights can be performed in different manners, for instance, using equal probability distribution. The range can be between -0.03 and 0.03, with the numbers outside the range not being considered at all.

Another popular weight initialization method is Laplace-Gauss distribution that is dependent on expected value, dispersion, asymmetry coefficient, and kurtosis. Algorithms like He, Xavier, and LeCun Uniforms are often used in the field, utilizing the tools mentioned above [22].

Figure 3. Perceptron structure. Source: [17].

Activation function is also important for cooperation between artificial neurons. Without it, a group of perceptrons is unable to function properly, because the result would be equivalent to the work of just one neuron. The goal is to generate a number that will allow for all elements to become connected but remain different from one another. For every layer of a neural network, a specific activation function type must be selected. Here, Rectifier Linear Unit (ReLU), first presented in 2011, is a popular choice. It is a piecewise linear function comprised of several parts of straight lines. If the input is negative, the result will be 0, while for positive values the function will return the original, unchanged value. The function can be expressed with a following equation:

𝑓(𝑥) = 𝑥⁺= 𝑚𝑎𝑥 (0, 𝑥)

where x is equal to the perceptron’s input. Sometimes, additional changes are made in the function – namely coefficient decrease or left/right offset [22].

Determination of attribute probabilities for output neurons is performed with a softmax method. This stage, performed when all scores are known, allows the user to deduce by how much is each obtained result more probable than another. Figure 4 demonstrates in which stages the softmax method becomes involved. Values s1, s2, s3, and s4 represent the neuron output parameters, while results p1, p2, p3, and p4 represent the target probabilities.

Figure 4. Use of the softmax method with four neurons. Source: [23].

Currently, many types of neural networks exist, with the Multi-layer Perceptron being one of the most popular. A Multi-layer Perceptron (MLP) is a network comprised of layers of neurons with no connections between neurons in the same layer; instead, a neuron from one layer is connected to all neurons in the adjacent layer [20]. The algorithm uses a backpropagation method and a sigmoid activation function for all perceptrons except for the first ones – for those, different functions must be used. The essence of backpropagation lies in updating neuron weights, working from the output layer towards the input layer. It can be divided into several steps [22]:

1. Finding network errors after comparing the last node results to the marking.

2. Calculating output neuron delta using error, marking, and the output node value.

3. Calculating all remaining neuron deltas, starting at the end.

4. Updating the neuron weights considering deltas and perceptron outputs.

Multi-layer Perceptron is a reliable data analysis technique; the most popular software platforms support this algorithm. Its main drawback is its high dependency on the parameter settings (number of layers in the network etc.).

To work with a sequence of elements that affect one another, a different type of neural network is used – a Recurrent neural network (RNN). Solving these tasks requires using information from previous results (inputs). The key component is memory that stores a certain state. Unlike a Multi-layer Perceptron, RNN uses a modified version of backpropagation called backpropagation through time (BPTT). Principles underlying both these approaches are almost identical, except for the fact that BPTT is used for sequence data. Structure of an RNN can vary; it can be one-to-many, many-to-one, many-to-many, or one-to-one. One-to-many architecture uses a specific selection and generates further sequences. Other structures are represented in an analogous manner, based on their goals;

for instance, speech or text recognition, classification, regression, etc [24].

2.2.4 Ensemble methods

In some cases, the best way to solve a task is to consolidate the output of several different models. This led to the creation of tools capable of working with ensembles of algorithms, often achieving better analysis results than single methods. These models are divided in three groups – stacking, bagging, and boosting – and can be used for classification or regression tasks.

In “Data Science and Big Data Analytics”, the following description of the bagging approach is given: “Bagging (or bootstrap aggregating) uses the bootstrap technique that repeatedly samples with replacement from a dataset according to a uniform probability distribution. “With replacement” means that when a sample is selected for a training or testing set, the sample is still kept in the dataset and may be selected again. Because the sampling is with replacement, some samples may appear several times in a training or testing set, whereas others may be absent. A model or base classifier is trained separately on each bootstrap sample, and a test sample is assigned to the class that received the highest number of votes.” [16, p. 228] The Random Forest algorithm is classified as a bagging method, with some improvements. For instance, in cases where one high-value

predictor is present, Random Forest will not factor it in during the split process – this behavior is sometimes referred to as decorrelating of trees. To summarize, the main difference between the two approaches is in the choice of a predictor subset [17]. Despite Random Forest having multiple advantages (versatility, high accuracy of the prognosis, etc.), it has one significant downfall – processing times are long when working with big data.

The Boosting approach shares some principles with the bagging method: it uses voting and averaging methods to find algorithm outputs, combining the models of the same type. The main difference is in the fact that boosting is iterative – it attempts to correct the prediction errors of previous models. AdaBoost and Stochastic Gradient Boosting (used in the practical part of the thesis) are examples of some popular boosting algorithms. Gradient Boosting Machines use loss function (for example, logarithmic loss for classification), weak learners (for example, decision trees) to obtain the results. The algorithm adds more weight to attributes that contain errors, thus training the model. In each prognosis stage, trees are compared with the correct result, and then the gradient allowing the user to find the direction to reduce error rate is calculated. The parameters are similar to those of Random Forest, for instance, tree depth, iteration, or number of samples for leaf node [17].

2.2.5 K-nearest neighbors

K-nearest neighbors method falls into the category of supervised learning algorithms that can be applied to solve classification and regression tasks. Classifiers are based on the learning by analogy principle, where a given tuple is compared with training tuples that are similar to it. The algorithm attempts to find the k-tuples (a point in an n-dimensional space) that are closest to the new tuple [1]. Figure 5 clearly illustrates this behavior. If we assume that k=1, the circle will be classified as a triangle since it is closest to the unknown object. If k=3, the circle would still be classified as a triangle (since two triangles and only one square were found).

Figure 5. K-nearest neighbors algorithm with an unknown tuple. Source: [25].

The closeness level is calculated with the use of metrics, with Euclidean distance between tuples being the most common one. The Scikit-learn library also allows the use of Minkowski distance, which can be considered as a generalization of both the Euclidean distance and the Manhattan distance [26].

The algorithm does have its drawbacks – it is very demanding when it comes to memory, and the method works better with large datasets. It is recommended to use cross-validation to select the model parameters (e.g. number of neighbors).

In document Hlavní práce70941_lisa07.pdf, 2.2 MB Stáhnout (Stránka 15-23)