• Nebyly nalezeny žádné výsledky

Hlavní práce70941_lisa07.pdf, 2.2 MB Stáhnout

N/A
N/A
Protected

Academic year: 2022

Podíl "Hlavní práce70941_lisa07.pdf, 2.2 MB Stáhnout"

Copied!
72
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Prague University of Economics and Business

Faculty of Informatics and Statistics

REAL-WORLD DATA MINING TASK

MASTER THESIS

Study programme: Applied Informatics Field of study: Knowledge and Web Technologies

Author: Bc. Aleksandr Liskov Supervisor: prof. Ing. Petr Berka, CSc.

Prague, April 2021

(2)

Prohlášení

Prohlašuji, že jsem diplomovou práci „Real-world data mining task“ vypracoval samostatně za použití v práci uvedených pramenů a literatury.

V Praze dne 2. dubna, 2021

...

Bc. Aleksandr Liskov

(3)

Acknowledgement

I hereby wish to express my appreciation and gratitude to the supervisor of my thesis, prof. Ing. Petr Berka, CSc.

(4)

Abstract

The “Real-world data mining task” thesis deals with the issue of popularity of online news articles. The main goal of the thesis was to create a prediction model based on historic data of Mashable company. The analysis was performed using the CRISP-DM methodology, which consists of several stages: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Several libraries from Python programming language were used in solving the data mining task as well.

In the first chapters of the thesis, we provided theoretical information about the field, task types, required methods, and modern data analysis tools. Next, each stage of the CRISP- DM methodology is described, with the aim to use it in further practical work. Regression analysis, decision tree, ensemble methods, K-nearest neighbors and Multi-layer Perceptron neural network algorithms were used to predict whether articles would fall into popular or unpopular category. In the final chapters, result evaluation was performed, methods to improve the popularity of the company’s articles were suggested, and options of potential integration of the models into the real workflow were discussed.

After optimal parameter settings were set up, the Stochastic Gradient Boosting model performed best, achieving the highest indicators. This algorithm allowed us to analyze feature importance and arrive at certain conclusions regarding the effect that some groups of attributes have on the popularity of Mashable’s articles. Accuracy, precision, recall, F1- score, and AUC score metrics were used in the practical part of the thesis.

Keywords

CRISP-DM, data mining, online articles, prediction, Python.

JEL Classification

C8

(5)

Abstrakt

Diplomová práce „Real-world data mining task“ se zabývá problematikou popularity novinových článků na internetu. Hlavním cílem práce je vytvoření predikčního modelu na základě historických dat společnosti Mashable. Analýza je provedena s využitím metodologie CRISP-DM, která se skládá z několika fází: Business Understanding (porozumění problematice), Data Understanding (porozumění datům), Data Preparation (příprava dat), Modeling (modelování), Evaluation (vyhodnocení výsledků) a Deployment (využití výsledků). Pro řešení úloh dobývání znalostí z databází bylo rovněž využito několik knihoven programovacího jazyka Python.

V prvních kapitolách diplomové práce jsou prezentovány teoretické oblasti o zkoumané problematice, typech úloh, nezbytných metodách a moderních nástrojích pro informační analýzu. Dále je uveden podrobný popis každé fáze CRISP-DM metodologie, s cílem aplikovat je posléze i v praxi. Pro predikci zařazení článků do populární či nepopulární skupiny byly využity algoritmy regresní analýzy, rozhodovacích stromů, ensemble metod, k-nejbližších sousedů a neuronové sítě Multi-layer Perceptron. V posledních kapitolách je pak prezentováno hodnocení všech získaných výsledků, jsou navrženy metody pro zvýšení popularity článků společnosti a zváženy možnosti integrace modelů do reálného pracovního procesu tohoto novinkového portálu.

Nejlepších ukazatelů dosáhl po nastavení optimálních parametrů model Stochastic Gradient Boosting – tento algoritmus umožnil analyzovat tzv. feature importance a učinit určité závěry o tom, které skupiny atributů mají největší vliv na popularitu článků portálu Mashable. Mezi využité metriky patří accuracy, precision, recall, F1-score a hodnota AUC.

Klíčová slova

CRISP-DM, data mining, online články, predikce, Python.

JEL Klasifikace

C8

(6)

Contents

Introduction ... 9

1 Application field ... 10

2 Data mining and Knowledge Discovery ... 12

2.1 Types of data mining tasks ... 12

2.1.1 Classification ... 13

2.1.2 Estimation ... 13

2.1.3 Prediction... 13

2.1.4 Association rules ...14

2.1.5 Clustering ...14

2.1.6 Description and profiling ...14

2.2 Algorithms ... 15

2.2.1 Regression analysis ... 15

2.2.2 Decision tree ... 17

2.2.3 Neural networks ...19

2.2.4 Ensemble methods ... 21

2.2.5 K-nearest neighbors ... 22

2.3 Analytical tools ... 23

2.3.1 Weka ... 23

2.3.2 IBM SPSS Modeler ... 24

2.3.3 RapidMiner ... 25

2.3.4 Python ... 26

2.3.5 R ... 29

3 CRISP-DM methodology ... 30

3.1 Business Understanding ... 31

3.2 Data Understanding ... 31

3.3 Data Preparation ... 32

3.4 Modeling ... 32

3.5 Evaluation ... 33

3.6 Deployment ... 34

4 Real-world data mining task ... 35

4.1 Business Understanding ... 35

4.1.1 Determining Objectives ... 35

(7)

4.1.2 Assessing the Situation ... 35

4.1.3 Producing a Project Plan ... 37

4.2 Data Understanding... 37

4.2.1 Collecting Initial Data ... 37

4.2.2 Describing Data ... 38

4.2.3 Exploring Data ... 40

4.2.4 Verifying Data Quality... 46

4.3 Data Preparation ... 47

4.3.1 Selecting and Cleaning Data ... 47

4.3.2 Constructing and Formatting New Data ... 48

4.4 Modeling ... 51

4.4.1 Building the Models ... 51

4.4.2 Assessing the Model ... 56

4.5 Evaluation ... 60

4.6 Deployment ... 63

Conclusion ... 65

List of references ... 66

Attachments ...

A: Distribution of articles by channels and days ... I B: Confusion matrix for Random Forest, Decision Tree and regression models ... II C: Confusion matrix for KNN, Gradient Boosting and MLP models ... III

(8)

List of tables

Tab. 1. Description of the dataset attributes. ... 38

Tab. 2. Attributes divided into groups... 40

Tab. 3. Basic statistic of the dataset attributes. ...41

Tab. 4. Correlation between independent attributes ... 44

Tab. 5. A list of issues that must be resolved in data preprocessing. ... 47

Tab. 6. Dataset after data preparation. ... 49

Tab. 7. Train and test subsets ... 51

Tab. 8. Parameter settings for logistic regression ... 52

Tab. 9. Parameter settings for ridge regression ... 52

Tab. 10. Parameter settings for Random Forest. ... 53

Tab. 11. Parameter settings for a decision tree ... 54

Tab. 12. Parameter settings for K-nearest neighbors ... 54

Tab. 13. Parameter settings for Stochastic Gradient Boosting. ... 55

Tab. 14. Parameter settings for Multi-layer Perceptron. ... 55

Tab. 15. Logistic regression metrics for test data ... 56

Tab. 16. Logistic regression metric for train data. ... 57

Tab. 17. Ridge regression metrics for test data ... 57

Tab. 18. Ridge regression metrics for train data ... 57

Tab. 19. Random Forest metrics for test data ... 58

Tab. 20. Random Forest metrics for train data ... 58

Tab. 21. Decision Tree metrics for test data ... 58

Tab. 22. Decision Tree metrics for train data ... 58

Tab. 23. K-nearest neighbors metrics for test data ... 59

Tab. 24. K-nearest neighbors metrics for train data ... 59

Tab. 25. Stochastic Gradient Boosting metrics for test data ... 59

Tab. 26. Stochastic Gradient Boosting metrics for train data ... 60

Tab. 27. Multi-layer Perceptron metrics for test data ... 60

Tab. 28. Multi-layer Perceptron metrics for train data ... 60

Tab. 29. Model results in rank-order ... 62

(9)

Introduction

Over the past decades, various types of technology lead to a general increase in the data quantity. Social media, genetic engineering, internet of things, and many other fields constantly create vast amounts of information, with the goal of becoming successful in their respective fields. In order to be able to use all these possibilities, modern analysis methods capable of improving the process of discovering new, useful, and practical knowledge are necessary. Today, data mining and knowledge discovery in databases, fields that are focused precisely on that process, are among the most popular IT fields. Different authors give different descriptions of the relationship between the two disciplines – some perceive the two terms as synonymous, while some consider data mining to be just one part of the KDD process. One of the defining characteristics of data analysis are the well-understood stages of task processing, with some requiring certain technical skills (modeling, data preprocessing) and others requiring knowledge in the business area and critical approach to the search of new solutions. Good understanding of each stage helps prevent mistakes and facilitates work systematization when working on real-life projects [1].

Usually, data mining experts work with data that were first extracted and then uploaded into a database. Pattern discovery in these data is an integral part of the entire process, and the general expectation is that this process has to be automated, or at least partially automated. The demand from businesses is always high, because data mining results can often be transformed into an economic benefit. Thanks to large investments that were made into the data collection processes, almost every imaginable aspect is now open to data mining. Many internal processes, like money transfers, logistics, client evaluation, etc., are already a part of data analysis, just like external events like news and competition evaluation [2, 3].

In this thesis, we focused on popularity of online news articles published by Mashable media website. The popularity is measured by the number of shares – that is, how many times the internet users share the article online. Attributes that describe the target variable fall into different categories, like NLP, time data, or digital media. The data mining process allowed us to arrive at results that the company can use and integrate into future real projects. This is something that is very relevant in the present day, and it was one of the reasons why this particular data examination field was chosen for this thesis. Another important factor was the desire to apply the theoretical knowledge obtained during the master’s program in practice.

The main goal of this thesis is to create a prediction model for article popularity based on Mashable’s historic data. To achieve this goal, the CRISP-DM methodology and Python programming language were used.

The thesis is divided in two parts – theoretical and practical. The first chapters describe the research field, main types of data mining tasks, tools, and algorithms that were necessary to complete the task at hand. In chapter four, Mashable’s data were analyzed using the CRISP-DM methodology.

(10)

1 Application field

Online media companies publish articles several times per day, and for many people around the globe, this represents a significant source of information. Well-known news companies (BBC News, The New York Times, CNN, etc.) aim to create new and attractive content and make it available to all internet users, which is why several methods are used to analyze the content quality. One method is based on collecting information on articles published in the past with the aim to predict popularity and to find what features play a key role in content popularity. To evaluate popularity, features like the number of comments, shares, likes, etc.

can be used. The general assumption is that actions taken by a user are important for them, and that they express certain emotions and opinions.

Popularity prediction is often seen as either classification or regression task. All necessary attributes can be divided in two groups: content-based group, and temporal features group, with the first group containing attributes obtained from the text in the content, and the second group describing attributes extracted from the click time-series of an article. When collecting data from the content of news articles, the usual structure of these articles must be kept in mind. Often, such articles are made up of a headline, lead, events, consequences/reactions, and comments [4]. Typically, information presented in these subitems is stored in the Hypertext Markup Language elements, which is why specific tools are necessary for this type of tasks. Instances can be a part of the Natural Language Processing field (text subjectivity, rate of positive words, title polarity), digital media (number of videos, pictures, infographics), or website metadata. In some cases, information is completely dependent on the target source (links to articles of the same website). In the temporal features category, a good example would be a number of clicks within 30 minutes of publishing, or a number of article visits an hour after it was published [5].

Popularity prediction task can be approached using one of two main methods – by using attributes existing before or after an article was published. The first option is usually more time consuming, and the results do not always achieve high scores compared to the second approach. Works using attributes describing an article after its publication are more common, and are less vulnerable to erroneous statements. For instance, authors of “A Supervised Method to Predict the Popularity of News Articles” [6] used machine learning algorithms to predict the number of comments in the comment sections, basing their prediction on text and temporal features that existed before publication. They were able to build several models for classification and regression tasks and to find elements that affected the results the most. Instances from the “politics” and “economy” categories, as well as presence of pictures in the news articles, were determined as key in the work’s summary.

Another research project, “Predicting the Popularity of News Articles” [5], evaluated article popularity based on number of page views in the 24 hours after publication. The accuracy of regression models was around 0.8 for the coefficient of determination metric. The Washington Post data were used in the analysis, and it was determined that page view time series, normalized page view time series, and the number of author’s subscribers (number of people who shared the article within 30 minutes of publication) affected the results the

(11)

most, more than any other feature. In another work, “A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News” [7], the authors attempted to create an intelligent system for analysis of online articles. The model was comprised of several different levels (data extraction and processing, prediction, optimization stages) that complemented each other, and then the result was computed. Attributes following Natural Language Processing and shares of links were among the most important features.

Popularity of news articles can also be examined in a local or a global context. A local task is carried out inside a specific company, and thus certain selected attributes are only suitable for that specific situation, while the global context is aimed at new agencies, so data analysis is necessary [5].

The newspaper industry is able to benefit from data mining by integrating it into its business processes. Most often, the improvements affect customer relationship management, an area focused on churn prevention, creation of best customer profiles, etc. Real tasks require experts from different fields, like statistics, programming, or marketing, to work together, with marketing experts usually being the ones making sure that the data correspond to the business goals. The only remaining issue related to use of data mining in real projects is the novelty of the field, making it incompatible with strict standards and industry-wide operating procedures; confidentiality is also an issue that makes it necessary for companies to create new software methods able to deal with the tasks at hand. Questions can arise about when it is appropriate to approach the client after their subscription has ended, whether there are any seasonal trends in the company, or what is the profile of a subscriber who has never had a subscription before. Correctly defined business goals and the right data storage strategy are key in improving the data mining process and achieving the objective [8].

(12)

2 Data mining and Knowledge Discovery

The process of discovering patterns in data is defined as Data Mining [3]. This field of study was first touched upon in the early 90s, over the course of various conferences on artificial intelligence. The need to explore a new stage of data analysis arose with the rapid growth of the amount of information. Tools available at the time, development of technology, and strong interest shown by the academic community, businesses, and other parties facilitated the creation of a new discipline – Knowledge Discovery in Databases (KDD), with Data Mining becoming a part of it [1]. It was an innovative direction that allowed users to take a different approach to statistical methods of analysis, storage methods, data processing methods, and other topics related to acquiring knowledge from large volumes of data.

P. Tan, M. Steinbach and V. Kumar give a more detailed definition in their book: “Data mining is the process of automatically discovering useful information in large data repositories. Data mining techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown.” [9, p. 2] Data collection occurs through various operations performed by a specific user, which are then stored in a database. However, such information is often rendered in a format that is impractical for analysis which, in turn, leads to inaccurate results or mistakes in modelling.

That is why there are more, equally important, steps in KDD: clearing and integration, selection and transformation. The next step describes the process of data mining after which the results are processed during the evaluation and presentation stage.

2.1 Types of data mining tasks

In the process of discovering hidden knowledge, an analyst will often encounter large quantities of tasks that can be divided into different categories. These tasks can be regarded as either supervised or unsupervised learning. The first group requires human involvement in labeling of the data before a classifier is created, and falls under the predictive techniques.

The second category falls into the descriptive techniques and does not require data labeling.

Sometimes, semi-supervised learning is also applied – a mix of the two previous methods that uses partial object labeling to train the model. The last approach can be useful in scenarios working with large numbers of target instances.

In data analysis, each of the listed methods is used to process specific tasks [10]:

• Classification

• Estimation

• Prediction

• Association rules

• Clustering

• Description and profiling

(13)

2.1.1 Classification

The classification task falls into the supervised category. Its goal is to determine to which of the known classes does the investigated object belong. The target group (dependent variables) are determined based on the independent variables (those that are a part of the description) [11, 12].

Categorical data represent types of dependent parameters. An email inbox filter is a good example, classifying received mail as either spam or regular mail. It is worth noting that in classification approach, current behavior is emphasized, not the future prognosis. Among the most popular classification algorithms are logistic regression, Naïve Bayes classifier, Support Vector Machine, and decision trees.

Sometimes, classification results are subject to the issues of overfitting and underfitting. A model can start memorizing too many irrelevant properties, or the found patterns can be insufficient to find a correct solution to the task. Mistakes in attributes, missed values, attribute incompatibility, and incorrect choice of pattern-discovery method also strongly affect the outcome [13].

2.1.2 Estimation

Just like classification, estimation model helps the analyst find unknown output attribute.

The only difference is that in estimation task, the attribute is in a numeric format [11]. Good examples of estimation are fraud probability estimation in online transactions or a company’s investment appraisal after rebranding. Support Vector Machine, lasso, and linear regression algorithms work great with this type of tasks.

Data can also be transformed from one type to another. This becomes necessary when a certain analysis method is not supported by the algorithm. For instance, in a task containing people’s height in inches represented by numbers 60, 65, and 70, it is replaced with data 0, 1, and 2 (short, medium, and tall).

2.1.3 Prediction

Sometimes, experts need to work with data that can be used to perform a prognosis of the results. The key particularity of these tasks is the fact that they are focused on the future object value. It is the time segment that plays the most important part in these cases [11].

Prediction of next month’s prices of goods in an online store, determining which customers of a wireless carrier will change their plan half a year after signing a contract, prognosis of how many goals a team will score in 2021 season of English Premier League – all these are examples of prediction tasks.

All prediction results are based on historic data and are either in numeric or categorical format. To build a model, Naive Bayes, Support Vector Machine, and Random Forest algorithms are used often [12].

(14)

2.1.4 Association rules

At first, association rules were used to analyze the behavior of supermarket shoppers. All purchases were included in the research, with a goal to find connections between different goods and items. That is why association rules are also often mentioned in connection to market basket analysis [10]. A classic example is the 𝑋 → 𝑌 rule with 20% support and 80%

confidence. That means that 20% of customers buy products X and Y together, and there is an 80% probability that customers purchasing product X will also purchase product Y.

When put into practice, the rules are much more complex, because there can be multiple conditions between the elements.

In cases where the search for association rules is based on relations in a volume exceeding permissible norms, special algorithms are used, with Apriori, Eclat, and FP-Growth being among the most popular ones [3].

The task of finding association rules is in high demand not just in market basket analysis – other fields, like telco, medicine, or web mining are successfully using this method as well.

2.1.5 Clustering

“Clustering is the process of organizing objects into self‐similar groups by discovering the boundaries between these groups algorithmically using a number of different statistical algorithms and methods.” [13, p. 132] In general, clustering falls into the unsupervised learning category. Sometimes, the solutions can be determined clearly, without contamination, but more often, one is bound to encounter objects with imprecise description or probabilistic values. All results are always dependent on attributes available in the database. Moreover, it is important to consider the fact that some instances cannot be assigned to the same groups.

A Venn diagram demonstrates a situation where the examples are a part of more than one cluster. Hierarchical dispensation of objects where there are just a few clusters on the upper levels, with sub-clusters on the lower levels, can be visualized using dendrograms.

Probability of belonging to certain groups is recorded in simple tables [11].

K-mean, Mean-Shift, EM, and Density-based spatial clustering of applications with noise (DBSCAN) algorithms are often used to solve clustering tasks. A typical area of use is marketing, where experts use the results to make market and customer segmentation decisions and to develop the best strategy.

2.1.6 Description and profiling

Describing entry data can also be a data mining task. The main goal of such a task is to find a dominant structure or hidden connections. With these tasks, it is crucial to keep the results transparent and comprehensible to people [12].

Many companies attempt to analyze their clients’ data, or, more precisely, to create customer profiles. Such process can be broken down into a factual and a behavioral stage.

(15)

The first profile provides demographic information (name, salary, date of birth and so forth) and it can also contain conclusions that were made on the basis of transactional data (last month, the client’s most expensive purchase was over three thousand Czech crowns). The second profile determines the customer’s actions, for instance, it contains information that the client spends less than a thousand Czech crowns on weekends. Association rules and regression trees are suitable for this type of tasks [14].

2.2 Algorithms

An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates a model from data [15]. Algorithms come in different levels of complexity and sometimes, basic approaches to solving a problem can yield results just as good as those requiring large computation capacity. That is why a “simplicity-first” methodology is considered a valuable practice. A classification algorithm OneR is a good example of a tool for simple data analysis.

Often, users need to try several different methods in order to choose the model that works best for them. This chapter provides a description of more complex algorithms that have a good track record in practical resolution of tasks described in chapter 2.1. Some of them were used in the practical part of this thesis.

2.2.1 Regression analysis

Regression analysis methods are often used in data mining, gaining popularity thanks to easily comprehensible results and the high-level nature of the algorithms. In regression, the main task is to attempt to explain the effect a set of independent variables has on the outcome (dependent variable) [16].

Linear regression is a special type of analysis that can be expressed with the following equation:

𝑌 = 𝛽0+ 𝛽1𝑋1

where Y – dependent variable, X1 – independent variable, β0 – intercept term, β1 – slope term. At the start, the β0 and β1 coefficients are unknown and need to be calculated using a statistical approach called least-squares criterion. The goal of such a criterion is to minimize the sum of squared differences between actual and predicted output values. Furthermore, linear regression may be defined by multiple independent variables instead of just one.

Figure 1 shows an optimal regression line – the one that is closest to all the dots – and vertical lines that indicate the distance (residual) between the dots and the predicted line.

This distance is then used in the residual sum of squares formula:

𝑅𝑆𝑆 = 𝑒12+ 𝑒22+ 𝑒𝑛2

where ei – ith residual. The quality of the obtained model is usually calculated using the R- squared determination coefficient that describes the proportion of the variance of the

(16)

dependent variable. The values vary in the range from 0 to 1, with the results closer to 1 being more accurate. Alternatively, linear regression can be used for classification tasks, presenting the attributes in the binary form, although there are certain drawbacks to this approach [16].

Figure 1. Linear regression. Source: [17].

Many tasks are solved using binary classification, or they can be transformed into such format. A logistic model is especially good at dealing with this type of problems. Sigmoid function (or logistic function) is used as a base element, in the range from 0 to 1. Such a function can be expressed with the following equation:

𝑓(𝑦) = 𝑒𝑦

1 + 𝑒𝑦= 1 1 + 𝑒−𝑦

where −∞ < 𝑦 < ∞, 𝑦 → ∞ && 𝑓(𝑦) → 1 and 𝑦 → −∞ && 𝑓(𝑦) → 0. To determine target classes, a threshold is needed – usually, the threshold value is 0.5. Calculation of the constant effect that one variable has on another is performed with the odds ratio [16]:

𝑙𝑛 ( 𝑝

1 − 𝑝) , 𝑝 = 𝑓(𝑦)

In case of tasks requiring us to work with more than 2 target classes, the One-vs-Rest (OvR) or One-vs-One (OvO) methods are the most common. Since not all algorithms are equipped with a native support of multi-class classification, it is advisable to check whether the task can be simplified into a binary form before starting the modelling process. By default, the Python Scikit-learn library suggests the use of One-vs-Rest method for logistics regression, where each individual class is compared to all other classes. Analysis of three groups in the form of class1 vs [class2, class3], class2 vs [class1, class3], class3 vs [class1, class2] is an apt example. The One-vs-One method also simplifies the problems by converting them into binary tasks but, unlike the previous method, it compares a given class to other classes one by one (class1 vs class2, class1 vs class3 etc.). The support vector machine algorithm supports this type of a solution, to provide an example [18].

(17)

Since the least squares method is prone to overfitting when used for regression (especially with small amounts of data), there is an optimization method called regularization that helps prevent overfitting by adding a regularization term to the residual sum of squares.

Lasso and ridge regressions respond well to such optimization. The core difference between them lies in the different forms of regularization. More specifically, lasso regression uses L1 regularization technique:

𝑅𝑆𝑆 + 𝛼 ∑|𝛽𝑗|

𝑝

𝑗=1

while the ridge regression uses L2 regularization technique:

𝑅𝑆𝑆 + 𝛼 ∑ 𝛽𝑗2

𝑝

𝑗=1

where the increase of the alpha parameter (tuning parameter) lowers the other coefficients.

That is why the lasso algorithm coefficients are often equal to 0, and why lasso models are easier to interpret than ridge regression models where the coefficients are usually nearing zero, but rarely reach it. Both models represent a linear type of a model, and both are currently very popular thanks to their high accuracy in prediction. The Scikit-learn library provides a cross-validation technique for alpha parameter evaluation [19].

2.2.2 Decision tree

Decision tree is a part of the Divide-and-Conquer algorithm design paradigm, and it also falls into the supervised learning category. The structure is made of hierarchy of nodes, where the attributes are tested, and leaves that represent the classification results. The tree grows downwards, from the very beginning until the moment when all elements are grouped correctly, or until the moment when the tree finishes the search operation on the training set [3]. Figure 2 shows the steps required to build a simple version of this algorithm.

When building decision trees, it is important to choose the right object classification criteria.

The dispensation of the group must be as accurate as possible to make sure that each subset falls into one class only. The work is always dependent on the attribute type.

Another interesting topic is how the issue of missing values is dealt with. When an element is missing, it either needs to be given special importance, or no importance at all. While the second option is easy to implement, it is too definite since it is very likely that useful information will be lost. A more logical and accurate approach would be to select the most popular branch by counting the elements that descend down the tree. Another option to deal with missing values is to use numeric weight in the range from 0 to 1, dividing the original attribute into parts. The weight is then distributed proportionally to the number of training instances going down that branch. Finally, when the objects reach the leaves, they need to be merged back [3].

(18)

Figure 2. Decision tree building algorithm. Source: [11].

Another topic that needs to be mentioned when discussing Decision tree algorithms is branch pruning, a technology that becomes necessary when dealing with complex trees with complicated structure. Subtree replacement and subtree raising methods are often used to transform these. The first method works by replacing subtrees with a specific leaf, which provides good results for test set use. The second method is more difficult to implement, but just as effective – a subtree can replace the entire parent node with more training examples than it has in the remaining dispensation [3].

The main advantages of the Decision trees are the following: ability to work with both numerical and categorical data types; easily comprehensible results in majority of tasks;

clearness of how the data are processed with Boolean logic; statistical tests support. Among the weaknesses, excessive tree complexity, instability of the results in case of minor changes, and the fact that the search for an optimal tree is an NP-complete problem need to be mentioned.

The description above provides a general explanation of how the Decision tree algorithms work in data analysis. However, there are different versions of these algorithms that differ in their approach to how the tree structure is built. One of the first such algorithms was the ID3 (Iterative Dichotomiser 3) algorithm invented by J. Ross Quinlan in the 80s. It passes through instances from the top to the bottom, selecting the best feature to create a node (the greedy approach) from each iteration. A process called information gain is used when working with features, calculating how well the attributes are classified. The trees grow until they reach maximum depth, and often need to be pruned when they do. The main drawback of this algorithm is the issue of overfitting that occurs even with small samples [20, 3].

C4.5 and C5.0 followed the ID3 algorithm, becoming improved versions of it. Apart from categorical type of instances that were already supported in the original algorithm, they are also capable of processing numeric values. Overfitting issue has also been dealt with, using a pruning strategy. Thanks to good results and documentation, the C4.5 algorithm has become a standard in decision trees and can be implemented in all popular data mining systems. The new C5.0 version was released under Quinlan’s license, introducing improvements to C4.5 and bringing better controls, reduced memory usage, and higher accuracy [11].

(19)

Classification and Regression Tree (CART) was designed to construct binary decision trees regardless of instance type, meaning that each node contains only two child nodes. It also differs from C4.5 by the type of pruning it uses, which in case of CART is cost-complexity pruning. It is based on the idea that first pruning subtree, relative to their size, leads to the smallest increase in error on the training data. Then, when the final predictive tree is being selected, validation is performed using a holdout set or cross-validation to estimate the error rate [3]. With some data, this method results in smaller trees than C4.5 would produce. A third difference lies in the use of the Gini index instead of entropy-based criteria in information gain. The Gini index has an upper hand of a slightly better computation time, but both methods are very popular in machine learning algorithms. It is also worth mentioning that the CART method became the first regression tree system to support numerical target variables [11].

2.2.3 Neural networks

“Neural networks offer a mathematical model that attempts to mimic the human brain.”

[11, p. 254] Miloš Křivan gives a more detailed definition: an artificial neural network is an oriented graph with dynamic, marked peaks and edges, and in can be expressed as five ordered elements [𝑉, 𝐸, 𝜀, 𝑦, 𝑤] where:

• 𝑉 – number of peaks (neurons)

• 𝐸 – number of edges (synapses)

• 𝜀 – representation of incidence between peaks and edges (𝜀 ∶ 𝐸 → V × V)

• 𝑦 – dynamic evaluation of peaks (𝑦 ∶ 𝑉 × t → ℝ)

• 𝑤 – dynamic evaluation of edges (𝑤 ∶ 𝜀(𝐸) × 𝑇 → ℝ)

and for ∀𝑡 ∈ 𝑡 or ∀𝑇 ∈ 𝑇 there is 𝑦([𝑖, 𝑡]) ≡ 𝑦𝑖(𝑡) or 𝑤([𝑖, 𝑗, 𝑇]) ≡ 𝑤𝑖𝑗(𝑇) [21]. The basic building blocks of neural networks are called perceptrons, with a core structure made of inputs, weights, bias, and activation function. Figure 3 shows a simple example of an artificial neuron. Groups of perceptrons form coherent networks with a goal to obtain one or several values [22].

Basic result extraction operations follow these steps: all inputs are multiplied by corresponding weights, the sum of obtained components is calculated, bias is added, then activation function is applied. The first initialization of weights can be performed in different manners, for instance, using equal probability distribution. The range can be between -0.03 and 0.03, with the numbers outside the range not being considered at all.

Another popular weight initialization method is Laplace-Gauss distribution that is dependent on expected value, dispersion, asymmetry coefficient, and kurtosis. Algorithms like He, Xavier, and LeCun Uniforms are often used in the field, utilizing the tools mentioned above [22].

(20)

Figure 3. Perceptron structure. Source: [17].

Activation function is also important for cooperation between artificial neurons. Without it, a group of perceptrons is unable to function properly, because the result would be equivalent to the work of just one neuron. The goal is to generate a number that will allow for all elements to become connected but remain different from one another. For every layer of a neural network, a specific activation function type must be selected. Here, Rectifier Linear Unit (ReLU), first presented in 2011, is a popular choice. It is a piecewise linear function comprised of several parts of straight lines. If the input is negative, the result will be 0, while for positive values the function will return the original, unchanged value. The function can be expressed with a following equation:

𝑓(𝑥) = 𝑥+= 𝑚𝑎𝑥 (0, 𝑥)

where x is equal to the perceptron’s input. Sometimes, additional changes are made in the function – namely coefficient decrease or left/right offset [22].

Determination of attribute probabilities for output neurons is performed with a softmax method. This stage, performed when all scores are known, allows the user to deduce by how much is each obtained result more probable than another. Figure 4 demonstrates in which stages the softmax method becomes involved. Values s1, s2, s3, and s4 represent the neuron output parameters, while results p1, p2, p3, and p4 represent the target probabilities.

Figure 4. Use of the softmax method with four neurons. Source: [23].

(21)

Currently, many types of neural networks exist, with the Multi-layer Perceptron being one of the most popular. A Multi-layer Perceptron (MLP) is a network comprised of layers of neurons with no connections between neurons in the same layer; instead, a neuron from one layer is connected to all neurons in the adjacent layer [20]. The algorithm uses a backpropagation method and a sigmoid activation function for all perceptrons except for the first ones – for those, different functions must be used. The essence of backpropagation lies in updating neuron weights, working from the output layer towards the input layer. It can be divided into several steps [22]:

1. Finding network errors after comparing the last node results to the marking.

2. Calculating output neuron delta using error, marking, and the output node value.

3. Calculating all remaining neuron deltas, starting at the end.

4. Updating the neuron weights considering deltas and perceptron outputs.

Multi-layer Perceptron is a reliable data analysis technique; the most popular software platforms support this algorithm. Its main drawback is its high dependency on the parameter settings (number of layers in the network etc.).

To work with a sequence of elements that affect one another, a different type of neural network is used – a Recurrent neural network (RNN). Solving these tasks requires using information from previous results (inputs). The key component is memory that stores a certain state. Unlike a Multi-layer Perceptron, RNN uses a modified version of backpropagation called backpropagation through time (BPTT). Principles underlying both these approaches are almost identical, except for the fact that BPTT is used for sequence data. Structure of an RNN can vary; it can be one-to-many, many-to-one, many-to-many, or one-to-one. One-to-many architecture uses a specific selection and generates further sequences. Other structures are represented in an analogous manner, based on their goals;

for instance, speech or text recognition, classification, regression, etc [24].

2.2.4 Ensemble methods

In some cases, the best way to solve a task is to consolidate the output of several different models. This led to the creation of tools capable of working with ensembles of algorithms, often achieving better analysis results than single methods. These models are divided in three groups – stacking, bagging, and boosting – and can be used for classification or regression tasks.

In “Data Science and Big Data Analytics”, the following description of the bagging approach is given: “Bagging (or bootstrap aggregating) uses the bootstrap technique that repeatedly samples with replacement from a dataset according to a uniform probability distribution. “With replacement” means that when a sample is selected for a training or testing set, the sample is still kept in the dataset and may be selected again. Because the sampling is with replacement, some samples may appear several times in a training or testing set, whereas others may be absent. A model or base classifier is trained separately on each bootstrap sample, and a test sample is assigned to the class that received the highest number of votes.” [16, p. 228] The Random Forest algorithm is classified as a bagging method, with some improvements. For instance, in cases where one high-value

(22)

predictor is present, Random Forest will not factor it in during the split process – this behavior is sometimes referred to as decorrelating of trees. To summarize, the main difference between the two approaches is in the choice of a predictor subset [17]. Despite Random Forest having multiple advantages (versatility, high accuracy of the prognosis, etc.), it has one significant downfall – processing times are long when working with big data.

The Boosting approach shares some principles with the bagging method: it uses voting and averaging methods to find algorithm outputs, combining the models of the same type. The main difference is in the fact that boosting is iterative – it attempts to correct the prediction errors of previous models. AdaBoost and Stochastic Gradient Boosting (used in the practical part of the thesis) are examples of some popular boosting algorithms. Gradient Boosting Machines use loss function (for example, logarithmic loss for classification), weak learners (for example, decision trees) to obtain the results. The algorithm adds more weight to attributes that contain errors, thus training the model. In each prognosis stage, trees are compared with the correct result, and then the gradient allowing the user to find the direction to reduce error rate is calculated. The parameters are similar to those of Random Forest, for instance, tree depth, iteration, or number of samples for leaf node [17].

2.2.5 K-nearest neighbors

K-nearest neighbors method falls into the category of supervised learning algorithms that can be applied to solve classification and regression tasks. Classifiers are based on the learning by analogy principle, where a given tuple is compared with training tuples that are similar to it. The algorithm attempts to find the k-tuples (a point in an n-dimensional space) that are closest to the new tuple [1]. Figure 5 clearly illustrates this behavior. If we assume that k=1, the circle will be classified as a triangle since it is closest to the unknown object. If k=3, the circle would still be classified as a triangle (since two triangles and only one square were found).

Figure 5. K-nearest neighbors algorithm with an unknown tuple. Source: [25].

(23)

The closeness level is calculated with the use of metrics, with Euclidean distance between tuples being the most common one. The Scikit-learn library also allows the use of Minkowski distance, which can be considered as a generalization of both the Euclidean distance and the Manhattan distance [26].

The algorithm does have its drawbacks – it is very demanding when it comes to memory, and the method works better with large datasets. It is recommended to use cross-validation to select the model parameters (e.g. number of neighbors).

2.3 Analytical tools

To engage in data mining, a user must choose a tool suitable for the task they wish to perform. Some data mining systems are fee-based or require at least some basic programming skills, which makes these systems suboptimal in certain scenarios. Generally, the tools can be divided in two groups: ready-made solutions and programming languages.

In this chapter, we will provide examples from both groups, presenting tools that we had already used when working on other university projects. Currently, Python programming language and RapidMiner software are among the most popular platforms. Weka and IBM SPSS Modeler systems, although less popular, are also counted among reliable data analysis tools [27].

2.3.1 Weka

Weka (Waikato Environment for Knowledge Analysis) is a fully Java-based open-source software. The most popular version with a complete GUI allows the users to perform all necessary data mining operations, including data visualization, pre-processing, etc. No programming skills are required, making Weka a good option for the general public. It is also possible to operate Weka with the command line and Java API, a method which allows the user to benefit from their manual code-writing skills. One of Weka’s great advantages is its ability to integrate with R, Python, and Spark tools. However, this requires the user to install additional libraries and sometimes, the features of various approaches can become limited. Weka has a good track record in solving tasks with small and medium amounts of information. A detailed user documentation is available at the official website of the University of Waikato [13]. An example of data loaded in Weka Explorer can be seen in Figure 6.

(24)

Figure 6. Weka Explorer. Source: author.

2.3.2 IBM SPSS Modeler

IBM SPSS Modeler is a fee-based data mining tool developed by IBM that focuses on predictive tasks, allows easy integration with corporate business processes via CRISP-DM method, and supports large datasets. Users do not need any programming skills or knowledge since the software comes with a GUI. Other IBM products, such as Modeler Server, Modeler Batch or Modeler Solution Publisher, complement the IBM SPSS Modeler very well; a flexible architecture for large-scale projects can be built by combining these tools [28].

IBM offers several versions of the program, with each having a different set of features. For instance, the IBM SPSS Modeler Gold allows the client to perform text analysis and use more professional chart building options in comparison to the Professional release. There is also an option to expand the analysis with Python or R languages, which makes the data mining work more versatile. A detailed user documentation in various languages is available at the IBM’s official website.

By default, the SPS Modeler is a standalone application with a set of algorithms. Using the Model Server technology is recommended when working with large amounts of data. This setup works as a client/server architecture that processes difficult and demanding operations remotely. Another addon, the Modeler Administration Console, was created as a tool to set up the aforementioned server’s configuration. Modeler Batch is another product, used when it is better for the user to carry out tasks using the command line, while Modeler Solution Publisher technology creates a “packaged” version of a stream to make the work with information like aggregate data easier [28].

(25)

Figure 7. Examples of modeling using IBM SPSS Modeler. Source: author.

2.3.3 RapidMiner

RapidMiner is another popular data mining system providing users with a wide array of algorithms and data visualization tools. It is compatible with a variety of databases and file types and offers a user-friendly GUI. The company of the same name offers two versions of the software – a free Community Edition and a commercial Enterprise Edition. The main advantages of RapidMiner are cross-platform support, detailed documentation available at the company’s website, large number of practical examples that can be found in books and papers, and ability to integrate with programming languages. All these features make the platform perfect for the majority of projects [29].

Inside the system, the entire data mining process is modelled as an operator chain.

RapidMiner has over 400 of its own operators, which, in turn, have ports that allow the elements to interconnect. These objects contain their own parameters, and the application provides a description for all of them. A set of interconnected operators analyzing the data is called a process. The obtained results, data, and processes are stored in the Repository tab. A simple workflow usually consists of data import and preparation, model-building, model validation, and, finally, model application [30].

The RapidMiner app is also able to launch analytical processes remotely, while the RapidMiner Server gives clients an option to create projects together and share the results.

The entire process is comprised of seven key components: RapidMiner Studio, RapidMiner Server, RapidMiner Job Agent, RapidMiner Job Container, RapidMiner Server repository, Data sources, and Operations database. HTTP(S) protocols are used to transfer the data, with the Job Agent managing the life cycle. RapidMiner Studio is used to access the settings and customization options [29].

(26)

Figure 8. RapidMiner graphical user interface. Source: author.

2.3.4 Python

Python is a high-level general-purpose programming language [31]. According to well- established and reputable websites Tiobe and KDnuggets, in 2020 Python was dominating the list of the most preferred platforms for data analysis, and for other areas as well [27, 32]. Python owes its success to many different factors; among these are legibility, code brevity and intelligibility, support of many contemporary libraries, compatibility with other programming languages, platform-independent script launch, a large world-wide audience, etc. The software releases can be divided into two main categories – version 2.7 and Python 3. While they share a lot of similarities, there are also significant differences that are often critical for client products. The latest stable version 3.9 introduced changes in dictionary updates, work with time zones, decorator’s management, and a new PEG parser [33]. The PEP8 standard containing information on the best practices in scriptwriting is also an inseparable element of the Python language. Guidelines to help improve the code quality are another valuable feature that allows to spend less time on repeated understanding.

Package-management system pip is also worth mentioning – it allows users to easily download all third-party instruments, which makes it an essential part of the workflow.

The combination of different libraries allows users to achieve great results in data analysis.

NumPy and SciPy libraries have proven effective for tasks related to information manipulation, making it easier to perform tasks with multidimensional matrices and mathematical and numerical analysis. Pandas library can be characterized in a similar manner, helping with structured (tabular, multidimensional, potentially heterogeneous) and time series data. Pandas’ key elements are data structures – Series (one-dimensional labeled array with axis-labels) and DataFrame (dimensional labeled data structure).

DataFrame usually uses Series, lists, dicts, and other Python data types as entry values. The Pandas library was designed to help with the following types of problems: handling of missing data, working with columns and rows, intelligent slicing and data indexing,

(27)

hierarchical labeling of axes, reading/writing various data formats (including csv and excel), time-series data processing, etc. [34]. Figure 9 shows an example of Pandas library being used with the Jupyter Notebook web app.

Many convenient data visualization libraries are also available for Python. The abovementioned Pandas package built upon another library, Matplotlib, allows to quickly build images by simply using the plot() option and selecting the graph type. Below is a list of several other types of data visualization [16]:

• Bar plot is a chart (or graph) created to compare multiple independent categories (there is a space between the columns) in the form of rectangular bars, with inputs determining their size. The results can be represented vertically or horizontally by using bar() and barh() methods, respectively.

• Histograms provide a graphic representation of the frequency of numerical data in the columns without space between bars using the hist() method.

• Pie chart is used to separate individual parts or components from the whole. Most often, the elements are supplemented with a percentage ratio. It is also recommended to provide the sponsor audience with a maximum of three attributes. The rendering is performed with a pie() function.

• Scatter Plot shows interconnections between attributes at the X and the Y axes. The data are displayed as dots, usually covering 2-4 variables. Scatter plot could be used for the graph of the linear regression algorithm described in chapter 2.2.1, to give an example.

The scatter() method uses five basic arguments – X coordinates, Y coordinates, dot size, dot color, and keyword arguments.

• Box plots are constructed on the basis of five parameters: the minimum value, the first quartile, the second quartile, the third quartile, and the maximum value. Box plot is a great way to compare categorical attributes, distribution of quantitative values of a certain topic, and to identify outliers. The result is a rectangle with lines at opposite sides. It can be constructed using several different methods, with box() and boxplot() being the most commonly used.

Machine learning algorithm sets are also provided by various libraries. One of these is Scikit-learn, a freeware software designed for Python programming language that provides clients with access to a wide array of supervised and unsupervised learning methods, model evaluation tools, utilities aimed at solving real-world tasks with large amounts of data, and other useful data science tools. First, all Scikit-learn packages are imported as normal Python modules. Then, the desired object is pulled up and the parameters are set. Those parameters that cannon be learned directly within estimators are called hyper-parameters;

the Scikit-learn library has an option to tune the hyper-parameters using Random Search and Grid Search techniques or, more precisely, RandomizedSearchCV and GridSearchCV.

Both methods are optimized by cross-validated search over parameter settings, hence the

“CV” in their names. Apart from Random Search and Grid Search, the latest version of Scikit-learn now includes two new tools: HalvingGridSearchCV and HalvingRandomSearchCV that can sometimes achieve better results. Estimator objects, inspection utilities, results visualization (e.g. ROC curve) and dataset transformations can also be found on the list of basic work areas. When compared to other libraries, Scikit-learn can boast many strengths and advantages, however, it falls slightly short in the area of

(28)

neural networks where users would benefit more from using tools that were designed specifically for these purposes [26].

Another open-source library, Keras, helps users to solve machine learning problems with a focus on modern deep learning. The platform was created for meticulous work with algorithms, allowing the users to customize the features of virtually any method. Neural networks are the main focus of Keras that is able to build complex models with a minimal number of code lines thanks to working with low-level TensorFlow framework features [35].

In 2020, the combination of Keras and TensorFlow was the most popular deep learning method both in scientific and in business community, with companies like Netflix or Uber using this method in their products [36]. It is also worth mentioning that using Keras together with Scikit-learn is very convenient as well since initially, some tools are not available and need to be programmed manually. A good example of this would be the abovementioned search for hyper-parameters for algorithms, where the user can take the shortcut by simply pulling up the Scikit-learn methods.

TensorFlow was developed by Google and it is a default backend of the Keras library. Key elements of TensorFlow include computational graphs (networks of nodes and edges) and tensors (n-dimensional arrays). The new TensorFlow 2 received improvements like native use of eager execution, allowing dynamic model recognition and immediate operations evaluation without the need to use graphs. Another important upgrade is the support of imperative Python code with an option to convert it into graphs. Among the best practices of working with TensorFlow are the use of high-level API, following the Python code-writing standards, distribution of training across multiple GPUs, and TPUs with Keras. The TensorFlow developers have created an entire ecosystem designed for interworking. There is a TensorFlow Lite framework that is required to launch the library on mobile devices; a collection of Google datasets called TensorFlow Datasets; a machine learning library TensorFlow.js for JavaScript and browser deployment. At the same time, the TensorFlow platform is not inseparable from Keras, which allows the clients to choose their preferred software [37, 38].

Figure 9. Jupyter Notebook web application. Source: author.

(29)

2.3.5 R

R is a programming language and environment for statistical computing and graphics. In the data science community, R is considered to be a reliable, growing tool with a wide selection of libraries and integrated development environments (IDE). The R project was created on the basis of another programming language, S, introducing many key changes.

Today, it is released under a GNU General Public License. According to the developers, the product stands out thanks to the quality of plot compilation, effective data handling, and operators for calculations on arrays. It is worth noting that R has LaTeX-like documentation format, which helps clients to present their results. The list of fundamental data types includes numeric, integer, logical, character, and complex data. Popular data structures available to the clients include vectors, lists, DataFrames, and matrices. All freeware versions of the R language and its packages are available via CRAN (Comprehensive R Archive Network) software repository, with RStudio being the most popular IDE. RStudio is available in a desktop version, but it can also be accessed through a browser with a Linux server. The desktop configuration is available for all existing platforms [13, 39].

Figure 10. RStudio graphical user interface. Source: [40].

(30)

3 CRISP-DM methodology

In the second half of the 90s, a CRoss-Industry Standard Process for Data Mining (CRISP- DM) project was launched with the goal to create a set of clear-cut steps for receiving data mining results regardless of tools and data types. Such methodology allows to carry out real- world tasks more quickly, with smaller costs, and at minimized risks. At the very beginning, the project was being developed by just a few companies – the Daimler-Benz carmaker, the OHRA insurance provider, the NCR IT corporation, and Integral Solutions Ltd. company.

Together, they formed a consortium by the name of CRSIP-DM Special Interest Group, bringing more user attention to the CRISP-DM product and publishing the first version of the standard [41].

The CRISP-DM methodology breaks down the data mining process into six phases. Figure 11 shows all these phases and illustrates relationships between them. The sequence of phases is not fixed, their order can change. Each result of one phase impacts the next and sometimes, going back to the previous phase is necessary to introduce changes. The circle of big arrows represents the cyclical nature of the CRISP-DM process, while the arrows inside the circle represent the most frequent dependencies between phases [20].

Figure 11. Phases of CRISP-DM methodology. Source: [42].

Currently, other data mining process methodologies exist as well, however, CRISP-DM is still counted among the most frequently used solutions [43]. This was also one of the reasons why the CRISP-DM methodology was selected for the empirical part of this thesis.

Other reasons include the availability of detailed documentation on the IBM’s official website and first-hand experience with the use of this standard in practical, real cases that

(31)

was obtained over the course of the master’s program. The following chapter provides a more detailed description of each of the CRISP-DM phases.

3.1 Business Understanding

The goal of the Business Understanding phase is to understand the project objectives from a business perspective, to determine the requirements to meet these objectives, to understand how will the company benefit from data analysis, what are the factors that could affect the task at hand, and what is the probability of risks. If this phase is skipped, chances are high that all the subsequent efforts will be ineffectual. Usually, this phase takes up 10- 20% of the entire data mining project. The Business Understanding phase is comprised of several different steps [20, 44, 41]:

• Determining Business Objectives. This step is aimed at determining the client’s goals, to be able to guarantee that the solution will match their interests. Key questions and factors must be defined, and business task information must be gathered.

• Assessing the Situation. Assessment of all available tools is an important step, too. Here, the goal is to find risks and plan for their mitigation, to ascertain the availability of personnel required to carry out the tasks, to create a general glossary providing general definitions of the processes, and to determine a deadline. Financial costs and potential income of the project are also considered in this step.

• Determining Data Mining Goals. The aim of this step is to formulate the data mining tasks on the basis of the business goals determined earlier. All possible results are described using technical terms, for instance, prediction accuracy of churn customers.

In case of issues arising in the process of determining the goals, it is best to review the previous steps.

Producing a Project Plan. The steps described above help to compose a project plan, the key document in the entire data analysis process. Such a plan can be useful for other colleagues engaged in the same or similar work. Among other things, it also contains information on the results and dependencies and includes evaluation of the used tools.

3.2 Data Understanding

Statistically, experts spend around 20-30 % of the project time on this phase, focusing on the data with more scrutiny. In general, this is the phase when actual work with the data begins, and it is thus recommended to carefully review the quality of the results. The Data Understanding process uses visualization methods and various descriptive approaches to make the familiarization process easier.

The Data Understanding phase is comprised of the following steps [20, 44, 41]:

• Collecting Initial Data. The target data can be obtained from different sources, for instance, they can be extracted manually, or acquired from companies. The goal of this step is to upload the data into the storage, to deal with how the raw data will be

(32)

processed, whether there is enough data to achieve the goal, etc. Creating a report to share the knowledge within the team or for future use is considered a good practice here.

• Describing Data. Quality and quantity metrics are well-suited for this task. The analyst must evaluate the data types, the number of attributes, the need to complete the data using an external dataset, and create new labels for groups of attributes to improve clarity. Then, a report containing all the core information is compiled.

• Exploring Data. In this step, the data is explored using various types of visualization.

Then, ideas on how to accomplish the goals determined in the Business Understanding phase are laid down. This is also when first solutions start to appear, as well as potential different points of view. These are also recorded in the report.

• Verifying Data Quality. In the final step of Data Understanding phase, data are checked for potential issues, like missing data, different units, typos, the column names not matching the values in them, etc.

3.3 Data Preparation

In this phase, the data are finalized and prepared to be used in different analysis methods.

The two previous phases should help minimize obstacles and difficulties; however, experts commonly spend a lot of time (ranging from 50% to 70%) on data crunching. This phase is comprised of the following steps [20, 44, 41]:

• Selecting Data. First, it is necessary to determine what tables or specific subsets are needed to achieve the data mining goals, followed by an explanation of why some instances were excluded and why others will be used.

• Cleaning Data. The data selected in the previous step are processed to be fit for future work with algorithms. Details on existing issues can be found by referring to the Data Quality report. It is also recommended to document all changes.

• Constructing New Data. In certain situations, the dataset needs to be supplemented with new values, or changes must be made to the existing values. A typical example would be transformation from one type of attribute into another, or extraction of several columns into one that is better suited to achieve the goals.

• Integrating Data. Techniques like data merging or appending promote mutual transformation of several tables. The process is usually finalized with data aggregation and generation of new records.

• Formatting Data. Some tools require a change in data format (without value adjustment), for instance, the element sorting method or another type of syntax adjustment.

3.4 Modeling

In the modeling phase, algorithms are applied with the goal to select the best ones. The process is iterative since parameters (and often hyperparameters, too) must be set, followed by testing the method with the data. In some cases, experts are forced to go back to the

Odkazy

Související dokumenty

understand definitions (give positive and negative examples) and theorems (explain their meaning, neccessity of the assumptions, apply them in particular situations)..

Due to the panel nature of the data, I considered four different approaches to formal regression analysis: 1) clustered data analysis – data from periods 1, 3, and 5 (low-endowment)

Pro stálé voliče, zvláště ty na pravici, je naopak – s výjimkou KDU- ČSL – typická silná orientace na jasnou až krajní politickou orientaci (u 57,6 % voličů ODS

Výše uvedené výzkumy podkopaly předpoklady, na nichž je založen ten směr výzkumu stranických efektů na volbu strany, který využívá logiku kauzál- ního trychtýře a

Mohlo by se zdát, že tím, že muži s nízkým vzděláním nereagují na sňatkovou tíseň zvýšenou homogamíí, mnoho neztratí, protože zatímco se u žen pravděpodobnost vstupu

The study used a content analysis of Facebook sites with stress on six aspects: (1) page activity measured by the number of posts published, (2) page popularity measured by

Regression results based on System GMM are shown in Table 2. It can be seen from Table 2 that the relationship between corporate social responsibility and the quality of economic

The methods, which are most widely used in credit risk assessment and the evaluation of borrowers, usually belong to a group of parametric methods such as linear