Types of data mining tasks - Hlavní práce70941

In the process of discovering hidden knowledge, an analyst will often encounter large quantities of tasks that can be divided into different categories. These tasks can be regarded as either supervised or unsupervised learning. The first group requires human involvement in labeling of the data before a classifier is created, and falls under the predictive techniques.

The second category falls into the descriptive techniques and does not require data labeling.

Sometimes, semi-supervised learning is also applied – a mix of the two previous methods that uses partial object labeling to train the model. The last approach can be useful in scenarios working with large numbers of target instances.

In data analysis, each of the listed methods is used to process specific tasks [10]:

• Classification

2.1.1 Classification

The classification task falls into the supervised category. Its goal is to determine to which of the known classes does the investigated object belong. The target group (dependent variables) are determined based on the independent variables (those that are a part of the description) [11, 12].

Categorical data represent types of dependent parameters. An email inbox filter is a good example, classifying received mail as either spam or regular mail. It is worth noting that in classification approach, current behavior is emphasized, not the future prognosis. Among the most popular classification algorithms are logistic regression, Naïve Bayes classifier, Support Vector Machine, and decision trees.

Sometimes, classification results are subject to the issues of overfitting and underfitting. A model can start memorizing too many irrelevant properties, or the found patterns can be insufficient to find a correct solution to the task. Mistakes in attributes, missed values, attribute incompatibility, and incorrect choice of pattern-discovery method also strongly affect the outcome [13].

2.1.2 Estimation

Just like classification, estimation model helps the analyst find unknown output attribute.

The only difference is that in estimation task, the attribute is in a numeric format [11]. Good examples of estimation are fraud probability estimation in online transactions or a company’s investment appraisal after rebranding. Support Vector Machine, lasso, and linear regression algorithms work great with this type of tasks.

Data can also be transformed from one type to another. This becomes necessary when a certain analysis method is not supported by the algorithm. For instance, in a task containing people’s height in inches represented by numbers 60, 65, and 70, it is replaced with data 0, 1, and 2 (short, medium, and tall).

2.1.3 Prediction

Sometimes, experts need to work with data that can be used to perform a prognosis of the results. The key particularity of these tasks is the fact that they are focused on the future object value. It is the time segment that plays the most important part in these cases [11].

Prediction of next month’s prices of goods in an online store, determining which customers of a wireless carrier will change their plan half a year after signing a contract, prognosis of how many goals a team will score in 2021 season of English Premier League – all these are examples of prediction tasks.

All prediction results are based on historic data and are either in numeric or categorical format. To build a model, Naive Bayes, Support Vector Machine, and Random Forest algorithms are used often [12].

2.1.4 Association rules

At first, association rules were used to analyze the behavior of supermarket shoppers. All purchases were included in the research, with a goal to find connections between different goods and items. That is why association rules are also often mentioned in connection to market basket analysis [10]. A classic example is the 𝑋 → 𝑌 rule with 20% support and 80%

confidence. That means that 20% of customers buy products X and Y together, and there is an 80% probability that customers purchasing product X will also purchase product Y.

When put into practice, the rules are much more complex, because there can be multiple conditions between the elements.

In cases where the search for association rules is based on relations in a volume exceeding permissible norms, special algorithms are used, with Apriori, Eclat, and FP-Growth being among the most popular ones [3].

The task of finding association rules is in high demand not just in market basket analysis – other fields, like telco, medicine, or web mining are successfully using this method as well.

2.1.5 Clustering

“Clustering is the process of organizing objects into self‐similar groups by discovering the boundaries between these groups algorithmically using a number of different statistical algorithms and methods.” [13, p. 132] In general, clustering falls into the unsupervised learning category. Sometimes, the solutions can be determined clearly, without contamination, but more often, one is bound to encounter objects with imprecise description or probabilistic values. All results are always dependent on attributes available in the database. Moreover, it is important to consider the fact that some instances cannot be assigned to the same groups.

A Venn diagram demonstrates a situation where the examples are a part of more than one cluster. Hierarchical dispensation of objects where there are just a few clusters on the upper levels, with sub-clusters on the lower levels, can be visualized using dendrograms.

Probability of belonging to certain groups is recorded in simple tables [11].

K-mean, Mean-Shift, EM, and Density-based spatial clustering of applications with noise (DBSCAN) algorithms are often used to solve clustering tasks. A typical area of use is marketing, where experts use the results to make market and customer segmentation decisions and to develop the best strategy.

2.1.6 Description and profiling

Describing entry data can also be a data mining task. The main goal of such a task is to find a dominant structure or hidden connections. With these tasks, it is crucial to keep the results transparent and comprehensible to people [12].

Many companies attempt to analyze their clients’ data, or, more precisely, to create customer profiles. Such process can be broken down into a factual and a behavioral stage.

The first profile provides demographic information (name, salary, date of birth and so forth) and it can also contain conclusions that were made on the basis of transactional data (last month, the client’s most expensive purchase was over three thousand Czech crowns). The second profile determines the customer’s actions, for instance, it contains information that the client spends less than a thousand Czech crowns on weekends. Association rules and regression trees are suitable for this type of tasks [14].

In document Hlavní práce70941_lisa07.pdf, 2.2 MB Stáhnout (Stránka 12-15)