• Nebyly nalezeny žádné výsledky

3.6 Deployment

4.4.2 Assessing the Model

In this chapter, we analyzed the results achieved after the algorithm parameters were selected. The Scikit-learn library provided us with the classification_report method that includes the following metrics: accuracy, precision, recall, support, and F1-score (F-measure). Below, two versions of assessment of each model (using the train and test data) are given. The thesis attachment (B, C) also includes a confusion matrix with a detailed statistic of predicted and actual values (to facilitate a better understanding, values were filtered in such a way that the True positive class contains information on the group of popular articles).

All metrics mentioned above are intricately connected and calculated using the following formulas:

where TP and FN stand for True positive rate and False negative rate (the frequency of being correct and incorrect, respectively, when the instance is actually positive). Likewise, TN and FP stand for True negative rate and False positive rate (for the instances that are actually negative). The support metric implies the number of elements of a specific class within the dataset [2, 26].

• Logistic regression. The total accuracy of the model was 66% on the test data, with precision, recall, and F1-score values of unpopular articles being 1-2% higher compared to the other class. The True positive group of the confusion matrix contained 3706 correctly classified popular articles, while the True negative group contained 3935 articles.

Similar indicators were observed when the model was applied to train data, with the accuracy of 66% and maximum values reaching 66% for precision and 67% for recall and F1-score. When compared to other methods, the logistic regression method can be classified as an algorithm that achieved average results.

Tab. 15. Logistic regression metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,67 0,67 0,67 5864

Popular 0,66 0,65 0,66 5671

Accuracy 0,66 11535

Tab. 16. Logistic regression metric for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,66 0,67 0,67 13680

Popular 0,66 0,65 0,65 13233

Accuracy 0,66 26913

• Ridge regression. With test data, the algorithm achieved the same accuracy as the logistic regression – 66%. The recall result of the unpopular class was slightly better, reaching 69%, however, the recall and F-1 score of the other category were slightly worse. The True negative value of the confusion matrix in the unpopular category of Mashable’s articles was above 4000, which is a high indicator.

The train data contained a total of 26913 instances, with the accuracy coefficient (66%) unchanged. In general, the criteria matched those of the test subset, with differences observed only in the recall rate and F1-score, reaching 63% and 64%

respectively. Therefore, the ridge regression model falls into the category of average-performing algorithms as well.

Tab. 17. Ridge regression metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,66 0,69 0,67 5864

Popular 0,66 0,64 0,65 5671

Accuracy 0,66 11535

Tab. 18. Ridge regression metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,66 0,69 0,67 13680

Popular 0,66 0,63 0,64 13233

Accuracy 0,66 26913

• Random Forest. This model was selected using a new Scikit-learn method called HalvingGridSearchCV and achieved an overall accuracy rate of 67%. The lowest precision score was 67% on test data, while the maximum recall reached 68% and was observed in the unpopular articles class. The lowest parameter – a recall of 65%

– was observed among attributes in the popular class. The number of True positives in the popular class of articles was 3702, with additional 1859 articles that the algorithm incorrectly classified as popular. The train data scores were higher when compared to the test subset, with a 22% difference in the accuracy, which suggest the possibility of overfitting.

Tab. 19. Random Forest metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,67 0,68 0,68 5864

Popular 0,67 0,65 0,66 5671

Accuracy 0,67 11535

Tab. 20. Random Forest metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,89 0,9 0,89 13680

Popular 0,9 0,88 0,89 13233

Accuracy 0,89 26913

Decision Tree. Table 21 shows the results of yet another model. This algorithm’s precision was 4-5% lower than precision of the previous methods applied to test data. The recall rate was also low, especially in the popular class. The maximum score of the precision criterion was at 62% for both article categories. Despite the fact that the general scores were worse, the algorithm was able to correctly categorize 3763 popular and 3387 unpopular articles.

With the train subset, the results were slightly better, with an accuracy of around 67%. The F1-score difference between the two classes reached 2%, with the recall reaching a rather high rate of 69%.

Tab. 21. Decision Tree metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,62 0,64 0,63 5864

Popular 0,62 0,6 0,61 5671

Accuracy 0,62 11535

Tab. 22. Decision Tree metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,67 0,69 0,68 13680

Popular 0,67 0,65 0,66 13233

Accuracy 0,67 26913

• K-nearest neighbors. The next model achieved a total accuracy of 63% when applied to 11535 attributes in the test data. From 5864 published articles, 4013 were correctly classified as True negative unpopular object of the confusion matrix.

However, in the popular class, the K-nearest neighbors method achieved the lowest recall score when compared to all other previous models, reaching 58% only. The maximum value of the precision coefficient was 64%, which is a lower score than that achieved by the Random Forest algorithm, Logistic Regression, and Ridge Regression.

The indicators achieved with the train subset were all slightly higher compared to the test data. The difference in precision score of both classes was at 3%, with the unpopular class achieving the best success rate, just like with the test sample, reaching 74%.

Tab. 23. K-nearest neighbors metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,63 0,68 0,65 5864

Popular 0,64 0,58 0,61 5671

Accuracy 0,63 11535

Tab. 24. K-nearest neighbors metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,68 0,74 0,71 13680

Popular 0,71 0,63 0,67 13233

Accuracy 0,69 26913

• Stochastic Gradient Boosting. The recall value of the Ensemble method applied to the unpopular class in the test subset reached 68%, with the algorithm classifying 4009 articles correctly and 1855 incorrectly. This indicator is one of the highest achieved so far. Furthermore, the accuracy rate reached 67%. The precision criterion for both classes scored 67% in the Scikit-learn report as well.

The indicators were slightly better with the train data – for instance, the accuracy was 4% higher compared to the test sample. The resulting information can serve as a basis for the conclusion that the use of the HalvingGridSearchCV allowed the Stochastic Gradient Boosting classifier to achieve more balanced and accurate results when compared to other algorithms.

Tab. 25. Stochastic Gradient Boosting metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,67 0,68 0,68 5864

Popular 0,67 0,65 0,66 5671

Accuracy 0,67 11535

Tab. 26. Stochastic Gradient Boosting metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,71 0,72 0,71 13680

Popular 0,71 0,69 0,7 13233

Accuracy 0,71 26913

• Multi-layer Perceptron. A neural network with the hyperbolic tangent function achieved the best recall rate (71%) out of all models used in the practical part of the thesis. However, other indicators were less stellar – for instance, the precision criterion of popular attributes was 67%, with the True positive rate containing 3507 correctly classified articles (Stochastic Gradient Boosting, Random Forest, and both regressions performed better). The recall rate of the popular class (62%) significantly lowered the result of the F1-score. The performance was slightly better with the train data, where almost all values were improved by 1%.

Tab. 27. Multi-layer Perceptron metrics for test data. Source: author.

Precision Recall F1-score Support

Unpopular 0,66 0,71 0,68 5864

Popular 0,67 0,62 0,64 5671

Accuracy 0,66 11535

Tab. 28. Multi-layer Perceptron metrics for train data. Source: author.

Precision Recall F1-score Support

Unpopular 0,66 0,71 0,69 13680

Popular 0,68 0,63 0,65 13233

Accuracy 0,67 26913

4.5 Evaluation

In the evaluation stage, all obtained results were brought together, the importance of each algorithm was assessed, and the most accurate method to solve the main data mining task was chosen. Since the goal of this thesis was to build a prediction model, we evaluated the achieved results in the context of the test sample from the entire dataset. It is also worth mentioning that average indicators of each criterion, like precision or recall, were calculated using the classification_report method of the Scikit-learn library. Out of the two feasible

attributes, the macro average method (averaging the unweighted mean per label) was chosen for further practical work [26].

In chapter 4.1.1, a list of metrics for method quality evaluation is provided. From this list, the AUC was selected as the key factor to be monitored when comparing different models.

One of the reasons why this metric was chosen is its connection to the ROC curve. It allows to interpret results with extreme ease, which is helpful in task-solving, especially when it comes to object classification. AUC stands for “the area under the curve”, and the larger the area is, the better the resulting model performance. The AUC value is expressed as a fraction of the unit square and lies between 0 and 1 [2].

The algorithm results are listed in table 29. The best AUC score (0.73) was achieved by the Stochastic Gradient Boosting, Multi-layer Perceptron, and Random Forest models. Of these three methods, the neural network’s indicators were slightly lower in all of the remaining criteria, at 66%. The possibility of overfitting becoming an issue in the Random Forest algorithm, described in the previous chapter, must be considered as well.

Logistic and Ridge regressions achieved an average AUC score – 72%. This is a good result compared to other models, achieved after transforming the data with the QuantileTransformer using quantiles information. Such transformation had a positive effect on the algorithms that were able to get 66% of the predictions correct, being wrong 34% of the time.

K-nearest Neighbors and Decision Tree method results were slightly worse. The K-nearest neighbors performance in the prediction task was 4% below that of the neural network and the models based on the ensemble approach. All other criteria were also significantly lower than the top results achieved in modeling, and in 37% of cases, articles were classified incorrectly. In the prediction task working with Mashable’s articles, the average accuracy of the Decision Tree was the lowest, only 62%. The area under the curve coefficient improved the general impression slightly, but it was still 8% below the neural network. These models are least suited for our data mining task.

The last factor in algorithm evaluation was the time it took to build the model. Table 29 clearly demonstrates that regressive methods and the Decision Tree method completed the task faster than other methods, while Stochastic Gradient Boosting, Multi-layer Perceptron, and K-nearest neighbors all took more time to process the data. Out of the listed algorithms, Random Forest took the longest to finish the task.

After summarizing all the available information, a conclusion was made that the Stochastic Gradient Boosting method was the most relevant and the most accurate, outperforming the random classification model by 23%. Other metrics confirm the result success rate – the algorithm predicted the class into which a Mashable article would fall incorrectly in just 33% of cases. Despite the fact that the initial data quality presented many issues and problems, the Gradient Boosting algorithm was able to achieve high coefficient scores after the parameters were set up.

Tab. 29. Model results in rank-order. Source: author.

Model name AUC

score Accuracy Precision Recall F1-score Runtime Stochastic Gradient

Boosting 0,73 67 % 67 % 67 % 67 % Average

Random Forest 0,73 67 % 67 % 67 % 67 % High

Multi-layer Perceptron 0,73 66 % 66 % 66 % 66 % Average

Ridge Regression 0,72 66 % 66 % 66 % 66 % Low

Logistic Regression 0,72 66 % 66 % 66 % 66 % Low

K-nearest neighbors 0,69 63 % 63 % 63 % 63 % Average

Decision tree 0,65 62 % 62 % 62 % 62 % Low

Figure 17 provides a graphic representation of the area under the ROC curve for each method, after optimal parameter selection.

Figure 17. ROC curve representation of all models. Source: author.

4.6 Deployment

With the Scikit-learn library, it is possible to not only deploy the prediction task, but, in some cases, to extract the importance of each feature’s role in the final result. Figure 18 provides a list of Stochastic Gradient Boosting elements that were a part of the data mining process. From these data, several attributes with a rather high coefficient can be highlighted, with the top five consisting of kw_avg_avg, self_reference_avg_shares, is_weekend, kw_min_avg, and data_channel_is_entertainment. To facilitate better understanding, the attributes are divided based on the areas they belong into:

• Keywords

• Number of shares of the referenced articles

• Article publication period

• Channel type

This list confirms the statement made in chapter 4.2.3 concerning the effect that channel type and publication day have on the popularity of published articles. Using similar knowledge, the company (or any other interested party) can adjust their articles, manage the publication quality and, when possible, refuse or accept requests of third-party authors to publish the news content. Below is a list of steps that are recommended (based on the analysis results) to help make Mashable’s online articles more popular:

• Insert at least one image or one video into every article.

• The articles should express opinions and emotions utilizing popular keywords (unfortunately a list of such keywords of the Mashable website is unavailable, but it is safe to assume that these would be words describing current hot topics, like coronavirus, names of popular brands, etc.).

• Content should be published in data channels like Entertainment and Social Media.

• More focus on publishing news on weekends is advisable.

• The expressed mood should be positive rather than negative.

• The articles need to contain other published content that was shared at least 2950 times.

The project can also be integrated and automated to become a part of a real workflow. One way to do this would be to combine the intellectual models of the data mining task with the Mashable website. Since the Python programming language offers many different libraries that can be used for web scraping (Beautiful Soup, Selenium, Scrappy) and web app development (Django, Flask), the authors have an opportunity to improve results through collection of new data. After information extraction, in can be easily deposited into the database – the Python language supports several, with SQLite, MySQL, and PostgreSQL being the most well-known. Object-relational Mapping (ORM) where databases are connected with the concept of Object-oriented programming, which helps speed up the development process, should be considered as well.

Streamlit is another popular library, created specifically for the purpose of building data driven applications. It supports Scikit-learn, TensorFlow, and Keras libraries. The key

advantages of Streamlit are quick and simple app development that requires no knowledge of backend technologies, network protocols, Cascading Style Sheets (CSS) or Hypertext Markup Language (HTML). It is also a great choice when the results need to be presented to the company’s personnel. Alternatively, clickstream analysis, a fast-developing field, can also play a useful role in improvement of data mining processes. A web mining approach can help broaden the perspective on the topic of increasing the popularity of online articles.

Our models are capable of working with other target attributes as well, but the parameter settings are dependent on the type of analyzed data. Another task where the algorithms could be used is the task to divide the news portal users into groups for targeted email newsletters informing them about various events, to offer relevant content to the users, or to show ads that are relevant to the users. However, further project development will depend on Mashable’s policies.

Figure 18. Feature importance of online articles. Source: author.

Conclusion

The thesis focused on popularity of Mashable’s news articles. To achieve our main goal, several prediction models were created using Python programming language libraries, with one method then being chosen as the best solution. Several ways to improve publication structure in order to have more shares were proposed as well.

The theoretical part consists of several chapters providing information on types of data mining tasks, algorithms, and stages of the CRISP-DM methodology that were then applied to the real data. Description of different modeling methods was the most time-consuming part of the thesis since some algorithms can be implemented in several different ways, and not all versions are available in the Scikit-learn library. Documentation available at official websites of individual methods and knowledge acquired during the master’s university classes were extremely useful in this part of the thesis.

The practical part of the thesis describes the data mining process, from definition of business goals to suggesting ways in which the results could be integrated in the company’s projects. Python libraries allowed us to achieve all our tasks using the CRISP-DM methodology which, in turn, guided us through each stage. Stochastic Gradient Boosting was the most accurate model, reaching accuracy of 67% for all metrics with the exception of the AUC level – a ROC graph showed that area under the curve was 0.73, which is 23%

better when compared to random classification model. Next, feature importance coefficients that affected the results were extracted for all attributes. A bar plot chart determined a group of instances with the largest values. Considering all these results, we created a list of recommendations for Mashable’s editors that would allow them to enhance and evaluate the company’s content to make it more popular. Some algorithms can also be applied in practice. For instance, the neural network applied to this dataset achieved results that were slightly worse, but its recall coefficient was the highest of all the models that we built.

In the practical part, the stages related to description and transformation of data (Data Understanding and Data Preparation) were the most time-consuming, which was likely caused by the need to go back to the previous iterations several times. The modeling process was slightly less time-consuming, but still rather lengthy. Search for the best parameters with the use of GridSearchCV and HalvingGridSearchCV sometimes took hours to complete.

In conclusion, we would like to say that data mining is a tool that allows the users to achieve useful results and, in turn, improve life quality of many people around the globe. In the last few years, intriguing technology and fascinating research projects were created in large numbers, along with a number of new job opportunities, which is what makes this field extremely future-oriented, with huge potential.

List of references

[1] HAN, Jiawei, KAMBER, Micheline and PEI, Jian. Data Mining: Concepts and Techniques. 3rded. Burlington: Morgan Kaufmann, 2011. 744 p. ISBN 978-9380931913.

[2] PROVOST, Foster. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. Newton: O'Reilly Media, 2013. 414 p. ISBN 978-1449361327.

[3] WITTEN, Ian H., FRANK, Eibe and HALL, Mark A. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Burlington: Morgan Kaufmann, 2011. 664 p. ISBN 978-0123748560.

[4] VAN DIJK, T.A. Discourse Analysis: Its Development and Application to the Structure of News. Journal of communication, 1983, vol. 33, no. 2, p. 20-43.

[5] KENESHLOO, Y. – WANG, S. Predicting the Popularity of News Articles. Conference:

Proceedings of the 2016 SIAM International Conference on Data Mining, June 2017, p.

441–449.

[6] BALALI, A. – ASADPOUR, M. – FAILI, H. A Supervised Method to Predict the Popularity of News Articles. Computación y Sistemas, January 2017, vol. 21, no. 4, p. 703-716.

[7] FERNANDES, K. – VINAGRE, P. – CORTEZ, P. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Portuguese Conference on Artificial Intelligence, March 2017, p. 535-546.

[8] GUNNARSSON, C. L. – WALKER, M. M. – WALATKA, V. – SWANN, K. Lessons learned: A case study using data mining in the newspaper industry. Journal of Database Marketing & Customer Strategy Management, July 2007, vol. 14, no. 4, p. 271-280.

[9] TAN, Pang-Ning, STEINBACH, Michael and KUMAR, Vipin. Introduction to Data Mining. London: Pearson, 2005. 792 p. ISBN 978-0321321367.

[10] LINOFF, Gordon and BERRY, Michael. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 3rd ed. Hoboken: Wiley, 2011. 888 p.

ISBN 978-0470650936.

[11] ROIGER, Richard. Data Mining: A Tutorial-Based Primer. 2nd ed. London: Chapman

[11] ROIGER, Richard. Data Mining: A Tutorial-Based Primer. 2nd ed. London: Chapman