Bc.AdamJankovec Machinelearningbasedqueryanalysisforanonlinemedicalclinic Master’sthesis

(1)

Instructions

Use CRISP-DM related methodology to assist online medical clinic uLekare.cz in setting up a machine learning based automation. Analyse related processes in the company, design and devise machine learning approaches to automate processes and save costs. Build a prototype of ML based patient query analysis support system. Evaluate the performance on real data and discuss possible outcomes and further extensions.

Electronically approved by Ing. Karel Klouda, Ph.D. on 10 February 2021 in Prague.

Assignment of master’s thesis

Title: Machine learning based query analysis for an online medical clinic

Student: Bc. Adam Jankovec

Supervisor: doc. Ing. Pavel Kordík, Ph.D.

Study program: Informatics

Branch / specialization: Knowledge Engineering

Department: Department of Applied Mathematics

Validity: until the end of summer semester 2022/2023

(2)

(3)

Master’s thesis

Machine learning based query analysis for an online medical clinic

Bc. Adam Jankovec

Department of Applied Mathematics Supervisor: doc. Ing. Pavel Kord´ık, Ph.D.

May 3, 2021

(4)

(5)

Acknowledgements

First of all, I would like to thank the supervisor, doc. Ing. Pavel Kord´ık, Ph.D., for investing his time and connecting me with uLékaˇre.cz. Many thanks to Ing. Stanislav Valáˇsek for his mentorship and everyone from uLékaˇre.cz for a warm welcome and introduction to the company’s processes. Last but not least, thanks to my family and my love for their endless support during my five-year journey to higher education.

(6)

(7)

Declaration

I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis.

I acknowledge that my thesis is subject to the rights and obligations stipu- lated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46(6) of the Act, I hereby grant a nonexclusive authorization (license) to utilize this thesis, including any and all computer programs incorporated therein or attached thereto and all corresponding documentation (hereinafter collectively referred to as the “Work”), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quantity.

In Prague on May 3, 2021 . . . .

(8)

This thesis is school work as defined by Copyright Act of the Czech Republic.

It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act).

Citation of this thesis

Jankovec, Adam. Machine learning based query analysis for an online medical clinic. Master’s thesis. Czech Technical University in Prague, Faculty of Information Technology, 2021.

(9)

Abstrakt

Tato práce se zamˇeˇruje na aplikaci strojového uˇcen´ı pro automatizaci pro- ces˚u online lékaˇrské kliniky s pouˇzit´ım metodologie Cross-industry standard process for data mining. Výsledkem je úspˇeˇsné nasazen´ı sluˇzby, která na základˇe dotaz˚u pacient˚u pˇredpov´ıdá lékaˇrské obory, do produkˇcn´ıho prostˇred´ı a navázán´ı spolupráce mezi Fakultou informaˇcn´ıch technologi´ı a spoleˇcnost´ı uLékaˇre.cz.

Kl´ıˇcová slova strojové uˇcen´ı, klasifikace, zpracován´ı pˇrirozeného jazyka, vnoˇren´ı slov, zdravotnictv´ı

Abstract

This thesis focuses on machine learning automation of processes of an online medical clinic, while following the Cross-industry standard process for data mining. The result is a successful deployment, into a production environment, of a service, which predicts medical fields based on patients’ queries, and the establishment of cooperation between the Faculty of Information Tech- nology and the uL´ekaˇre.cz company.

Keywords machine learning, classification, NLP, embedding, healthcare

(10)

(11)

List of Figures

1.1 Popular KDDM process models . . . 4

1.2 CRISP-DM Process Diagram . . . 4

2.1 Question form . . . 9

3.1 Database Diagram . . . 16

3.2 Pie chart of urgent and non-urgent questions . . . 17

3.3 Specialisation histogram . . . 18

3.4 Distribution of languages . . . 18

4.1 Birth year distribution by urgency . . . 24

4.2 Question length distribution by urgency . . . 24

5.1 Bag of Words as Count Vector . . . 29

5.2 LASER Architecture . . . 31

5.3 LASER Embeddings in 2D Space . . . 32

5.4 BERT Architecture . . . 33

5.5 Masking of words for BERT . . . 33

5.6 Cross-lingual language model pretraining . . . 36

5.7 Confusion matrix example . . . 38

5.8 Receiver Operating Characteristic . . . 39

5.9 UI concept of urgency prediction . . . 40

5.10 Top 100 non-urgent words . . . 42

5.11 Top 100 urgent words . . . 42

5.12 Example of MorphoDiTa preprocessing . . . 43

5.13 Separation of classes with LASER . . . 43

5.14 LASER’s classifier training . . . 44

5.15 Confusion matrix of LASER + NN. . . 45

5.16 ROC AUC of LASER + NN. . . 45

5.17 Separation of classes with ALBERT . . . 46

5.18 Confusion matrix of ALBERT + NN. . . 46

(14)

5.21 Separation of classes with XLM-R . . . 48

5.22 XLM-R’s classifier training . . . 49

5.23 Confusion matrix of XLM-R + NN. . . 49

5.24 ROC AUC of XLM-R + NN. . . 49

5.25 Model performance comparison chart . . . 51

5.26 Tf-idf x LogReg prediction histograms . . . 51

5.27 Sample question . . . 54

5.28 Extracted keywords . . . 54

5.29 Specialisation histograms of XLM-R and Tf-idf . . . 55

5.30 Clustering of specialisations . . . 56

5.31 Confusion matrix of specialisation prediction . . . 57

A.1 Examples of specialisation word clouds . . . 71

A.2 Scatterplot of wordclouds . . . 72

A.3 Cluster observations . . . 73

(15)

List of Tables

2.1 Project plan . . . 13

3.1 Database statistics . . . 17

5.1 Selected modelling techniques . . . 28

5.2 LASER’s NN classifier architecture . . . 44

5.3 ALBERT’s NN classifier architecture . . . 46

5.4 XLM-R’s NN classifier architecture . . . 49

5.5 Model performance comparison on urgency prediction . . . 50

A.1 Mapping of specialisations into groups. . . 73

(16)

(17)

Introduction

The COVID-19 pandemic has heightened several kinds of uncertainty, but one trend has become clear: it has vastly accelerated digital adoption. Businesses, which were able to adapt to digital platforms thrived, in general, while traditional retailers with weak online strategies dwindled.

Healthcare is amongst the fields where digital platforms are booming.

The risk of contracting the virus in the doctor’s waiting room is deterring the patients from approaching their doctors in person; therefore, telemedicine (the practice of caring for patients remotely) is becoming increasingly popular. During the pandemic, the uL´ekaˇre.cz company – the most prominent Czech online medical clinic and further referred to as the Client – has managed to triple in size, and is looking for ways to automate their processes through machine learning.

The thesis aims to apply Cross-Industry Standard Process for Data Mining methodology while assisting the Client in setting up a machine learning based automation. Among the thesis goals are: analyse related processes in the company, design and devise machine learning approaches to automate processes and save costs, build a prototype of an ML-based patient query analysis support system, evaluate the performance on actual data and discuss possible outcomes and further extensions.

This thesis consists of 6 chapters. The first chapter (Knowledge Discovery and Data Mining) describes popular data science methodologies, including the Cross-Industry Standard Process for Data Mining. The second chapter (Business Understanding) introduces the Client, his processes, identified areas for improvement and primary task selection. The third chapter (Data Understanding) analyses the Client’s data related to the selected task. The fourth chapter (Data preparation) documents common data and text preprocessing methods and their application. The fifth chapter (Modelling &

Evaluation) explains, applies and evaluates selected modelling techniques on the selected tasks. The sixth chapter (Model Deployment) describes the deployment of the final model into the Client’s production environment.

(18)

(19)

Chapter 1 Knowledge Discovery and Data Mining

In technical development, data is considered the critical factor for a company’s economic success. However, collecting and storing data is no warranty for growth and profit. The actual value lies in extracting both structured and unstructured information, and those who are doing it best – Google, Facebook, Amazon, Microsoft – are the oil barons of the information economy. [1]

To extract the knowledge from data, a knowledge discovery approach needs to be established to ensure the success of Knowledge Discovery and Data Mining (KDDM) projects. A methodology is adopted for better organisation and separation of concerns of process’ phases. The process models should be ideally independent of specific applications, tools, and vendors. [2]

1.1 Popular KDDM models

Figure 1.1 illustrates common KDDM process models, including the Cross- industry standard process for data mining (CRISP-DM), which is currently the most popular and broadly adopted KDDM model [2].

Although the CRISP-DM is currently the most popular model, it is no longer actively maintained. Further, the framework itself has not been updated on working with new technologies, such as Big Data. An alternative framework, named Team Data Science Process (TDSP), was introduced by Microsoft and aimed at including Big Data as a data source while increasing the Data Understanding phase’s complexity. [3]

The TDSP’s two advantages are that it is more modern, with updated technology stacks and considerations, and Microsoft provides more in-depth documentation than the CRISP-DM. Its disadvantages are that it is verbose and can make the process unnecessarily complex. Therefore, it is encouraged to use the CRISP-DM for a straightforward and effective process. [4]

(20)

Figure 1.1: Popular KDDM process models. Source: [2]

Figure 1.2: CRISP-DM Process Diagram. Source: [4]

1.2 Cross-Industry Standard Process for Data Mining

“The CRISP-DM process model is a high-level, extensible process that is an ef- fective framework for data science projects. Similarly to other Agile software development processes, CRISP-DM is an iterative process framework. Each step can be revisited as many times as needed to refine problem understanding and results. This iterative cycle enables information to be shared and lessons to be learned between project activities. Rather than trying to perfect one stage before moving to another, the project team can create a minimally viable product in a rapid-prototyping mode.” [4]

(21)

1.2. Cross-Industry Standard Process for Data Mining The CRISP-DM methodology defines 6 phases, which are shown in Fig- ure 1.2. The phases are following:

1. Business Understanding“focuses on understanding the project objectives and requirements from a business perspective, then converting it into a data mining problem definition and a preliminary plan designed to achieve the objectives.” [2, slide 24]

2. Data Understanding “starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect in- teresting subsets to form hypotheses for hidden information.” [2, slide 27]

3. Data Preparation“covers all activities to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection, and transformation and cleaning of data for modelling tools.” [2, slide 31]

4. Modelling “selects various modelling techniques, applies and calibrates them to optimal values. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.” [2, slide 34]

5. Evaluation “thoroughly evaluates the model and reviews the steps exe- cuted to construct the model to be sure it properly achieves the business objectives. A key objective is to determine if there is some critical busi- ness issue that has not been sufficiently considered. A decision on using the data mining results should be reached.” [2, slide 37]

6. Deployment“presents and organises the knowledge in a way that the cus- tomer can use it. However, depending on the requirements, the deploy- ment phase can be as simple as generating a report or as complex as im- plementing a repeatable process across the enterprise.” [2, slide 40]

(22)

(23)

Chapter 2 Business Understanding

This chapter introduces the Client, including his field of work, business model and processes. It provides a list of proposed improvements and summarises them with the primary modelling task, which was selected as the first project of the newly established cooperation.

2.1 Research

The first stage of the CRISP-DM process focuses on understanding the business, its processes and objectives, which will lead to a higher chance of success by performing the following tasks:

1. Determine business objectives, 2. Assess the current situation, 3. Determine data mining goals, 4. Produce project plan. [5]

Determining the business objectives requires a thorough understanding of the client’s needs from a business perspective. Uncovering the crucial factors before the project is initiated can mitigate the chance of expending efforts on producing the correct answers to the wrong questions. [2, slide 25]

Situation assessment involves more detailed fact-finding about the resources, constraints and other factors that need to be considered when determining the data mining goals. It is also essential to list the project requirements, which may consist of the schedule of completion, quality of the results, data security concerns, and any legal issues. [5]

Data mining goals state project objectives in technical terms, unlike business goals, which state the objectives in business terminology. An example of a business goal is“Increase catalogue sales to existing customers” and a data

(24)

mining goal is“Predict how many widgets a customer will buy given their pur- chases over the past three years, demographic information (age, salary, city) and the price of the item.” [2, slide 26]

According to [2, slide 26], a project plan describes the intended plan for achieving the data mining goals and the business goals. The anticipated set of steps should be documented, including an initial selection of tools and techniques. [5] also mentions that stages should be listed with their duration, inputs, outputs and dependencies.

2.2 About the Client

Our Client is “the biggest Czech online medical clinic and pioneer in tele- medicine and aims to streamline and improve the quality of Czech healthcare.

A team of more than 350 doctors, including dozens of specialists, will serve an average of 5 000 users per month. The counselling centre is available to patients for 24 hours, 7 days a week, both on the web and mobile.” [6]

The Client’s main success revolves around his service for responding to patient questions. Patients can ask questions about any medical issue on which they will receive a recommendation on how to treat their issue either from a general practitioner (GP) or from a qualified specialist. In total, the Client’s service has responded to over 340 000 patient questions. [6]

Besides the medical clinic, other services provided by the Client are:

• Appointment Setup – Post an issue online and get an appointment set up by a nurse to a relevant doctor,

• Virtual waiting room – Platform where clients can consult their own doctors online, without having to meet in person,

• Enterprise healthcare package – Complete healthcare for company’s employees with regular preventive examinations.

2.2.1 Services & Processes

This section describes each service in greater detail. It includes processes tied to each of the Client’s prominent services.

Medical Clinic

The medical clinic provides answers to any health-related question. It is important to emphasise, that the medical clinic does not replace a doctor who physically provides medical service. The clinic gives first feedback to the patient, which spares an unnecessary trip to the doctor’s office in approximately one third of cases.

(25)

2.2. About the Client

Figure 2.1: Question form. Source: [6]

After entering the clinic’s page, patients have to fill in a simple form, shown in Figure 2.1, where they type their question or describe their issue. They can also attach files (e.g. photos of the problematic area or medical reports) to help doctors with the diagnosis. A sample of a typical question is following:

“Dobrý den, jiˇz jsem dnes psala, avˇsak pani doktorka pochopila, ze jsem momentálnˇe tˇehotná. Tˇehotná jsem byla pred tˇremi lety, proto zasilam dotaz jeste jednou. V duhovce mam malé hnˇedé teˇcky, které si myslim objevily prave pred 3 lety v tehotenstvi.

Drive jsem si jich nevˇsimla, tyto teˇcky byly zkontrolovane min- uly rok ocnim l´ekaˇrem a sdˇeleno, ze se jedna o znaminka. Nyn´ı jsem si vˇsimla, ze kolem jedne zornicky, je hnˇed´e pole, viz foto.

Taková ta hnˇedsi ˇcást kolem ˇcerné teˇcky a druhé oko to nema tak zˇretelnˇe, je to tedy v poˇrádku nebo se jedna o nˇejakou abnormalitu?

Dekuji”

(26)

GPs are the first line of contact with the patient. They process the queue of questions, and according to the service-level agreement (SLA), each question needs to be processed within 24 hours from posting the question. GP’s question processing consists of the following actions:

• assign a diagnosis according to [7],

• assign a specialisation (medical field),

• create a title to summarize the question,

• provide an answer with accordance to the established standards,

• remove any personal data in the question (e.g. surname, personal identification number, . . . ) to preserve anonymity,

• select one of the following recommendations:

– Patient should visit his regular doctor,

– Patient should visit a doctor, other than his regular doctor, – Question is urgent, patient should go seek treatment at the emer-

gency immediately.

• decide if a specialist is required or not.

If the GP decides that a specialist is required to provide a diagnosis, the GP provides at least a basic answer and then selects an appropriate medical field out of the possible 79 specialisations.

Specialist is obligated to provide an answer within 72 hours. Specialists processing of the questions consists of the following actions:

• provide another, more precise answer,

• alter the diagnosis, specialisation, title or recommendation if necessary.

If a doctor’s visit is recommended, the Client’s medical team arranges an appointment after consulting with the patient about the desired time and location of the visit.

Appointment Setup

When a patient requests an appointment with a doctor in the Business-to- Consumer variant, it is required to fill out a form similar to the medical clinic question (Figure 2.1). After submitting a request describing his condition, a nurse will contact the patient and ask him/her about the preferred location and time. The nurse will then select a doctor with the corresponding specialisation and location and then set up an appointment according to the patient’s preferences. In the Business-to-Business variant, the appointment setup can be requested directly.

(27)

2.2. About the Client

Virtual Waiting Room

When a GP is registered at the platform, GP’s patients can ask questions without travelling to the doctor’s office. The patient is presented with the same form as in the medical clinic (Figure 2.1). The Client provides question preprocessing by issuing his GP’s to prepare a reply for the registered GP. All the registered GP has to do is accept the proposed reply, or alter it if needed, therefore, saving the time to reply to all patients.

Main upsides of the virtual waiting room, as presented by the Client, are:

• medical conditions can be consulted anywhere at any time,

• a doctor will reply as soon as possible,

• a patient can get an analysis of a condition without the need of coming in person, and risking exposure to other sick people in the waiting room,

• if a doctor decides that a visit is necessary, an appointment is set up automatically.

Enterprise Counseling Package

The Client provides an enterprise package for companies to ensure a more effective way of taking care of their employees and their family’s health. This package aims to provide free counselling for the employees and set up a cal- endar for regular preventive examinations. The Client provides health counselling for more than 72 000 employees of 120 companies across the Czech Republic. Partners, who provide these services to their employees, are, for example, ˇSkoda Auto, Alza, Ernst & Young, Havel & Partners, Linet, L´Oreal, Bluelink Services, Henkel, Danone.

2.2.2 Areas for Improvement

After examining the processes, the Client was presented with several proposed improvements, which would introduce a certain level of automation to their services. These improvements would save time for the doctors, which would lead to an increase in productivity and faster response times. The thesis supervisor has also provided difficulty and reliability estimates.

Specialisation Prediction

After a patient asks a question, a doctor has to assign a question specialisation manually. Automating this task could speed up the processing of questions.

At the moment, there are 79 specialisations to classify into.

Expected difficulty: easy

Expected reliability: low (due to the high number of classes)

(28)

Diagnosis Prediction

After a patient asks a question, a doctor has to assign a question diagnosis manually. Automating this task could speed up the processing of questions.

At the moment, there are 12246 diagnoses to classify into [7].

Expected difficulty: easy

Expected reliability: very low (due to the enormous number of classes) Sentence Completion

A doctor could be assisted with sentence completion when writing replies to the patients. An assistant could improve the answer quality.

Expected difficulty: mediocre Expected reliability: mediocre Personal Data Identification

Questions may contain personal data, which need to be erased before publish- ing the question. Data such as name and surname, address, personal identification number should be removed so that the question’s author can remain anonymous to the public. Doctors have to do this manually and sometimes personal data can get published unnoticed. Automating this task could speed up the processing of tasks and increase the reliability of anonymisation.

Expected difficulty: hard

Expected reliability: mediocre to high Question Urgency Prediction

After a patient asks a question, a doctor has to decide if the question is urgent or not. Automating this task could speed up the processing of questions and possibly save lives. Potentially urgent tasks could remain unprocessed for up to 24 hours. The system could mark potentially urgent question for a preliminary review so that the doctors can verify the urgency.

Expected difficulty: low Expected reliability: high Answer Quality Evaluation

The Client’s quality assurance section randomly checks doctor’s answers and evaluates them based on established criteria, such as “Contains vulgar terms”,

“Too short”, “Negative sentiment”. Automating this task would allow evaluation of all answers, not just random samples. Evaluation could be provided in real time while the doctor writes his answer to get instant feedback.

Expected difficulty: low

Expected reliability: mediocre to high

(29)

2.3. Summary

Stage Due date Outputs

Business Understanding 30. 9. 2020 Documented goal and processes Data Understanding 18. 10. 2020 Data distribution graphs Data Preparation 30. 10. 2020 Clean dataset

Research of approaches 18. 11. 2020 Five modelling techniques Data modelling 25. 12. 2020 Working pipelines

Evaluation 30. 12. 2020 Model performance comparison Acceptance 15. 1. 2021 Solution accepted by the Client Deployment 30. 1. 2021 Source codes with documentation

Table 2.1: Project plan

2.3 Summary

The Client has evaluated the proposed improvements and the three most valu- able ones are:

1. Urgency prediction, 2. Specialisation prediction, 3. Personal data identification.

Out of these proposals, the urgency prediction was selected as an ideal candi- date for the first project between the Client and the faculty due to the task’s expected low difficulty and high reliability. The Client was provided with the project plan, shown in Table 2.1, to track the project’s progress.

(30)

(31)

Chapter 3 Data Understanding

This chapter introduces the Client’s data by exploring the attributes and distributions and analysing statistically significant features. It also examines the data quality of the patient questions.

3.1 Research

The second stage of the CRISP-DM process requires the data listed in the project resources. It is recommended to record encountered problems and their resolutions, which will help with future replication of the project and execution of similar future projects. [5]

To describe the data, it is essential to examine the “surface” properties of the data, including its format, quantity (e.g. the number of records and fields in each table) and the meaning of each attribute. Any data requirements should also be evaluated. [5]

Data exploration addresses data mining questions using querying, data visualisation and reporting techniques. Examples of these techniques are:

distribution of key attributes, relationships between pairs or small numbers of attributes. [5]

To verify data quality, it is vital to check whether the completeness of the data (coverage of the required cases), the consistency (errors and their rarity) and the missing values (where they occur and their representation). If quality problems occur, possible solutions should be suggested. These solutions generally depend on both data and business knowledge. [5]

(32)

Figure 3.1: Simplified database diagram

3.2 Data Description

The Client’s database contains 128 database tables. Out of these tables, seven were identified as relevant to the problem. Figure 3.1 shows a simplified (relevant columns only) view of these tables. The Client had proposed relevant columns, which are described in Table 3.1.

(33)

3.3. Data Exploration

Column name Description Data Type

question Patient’s query string isUrgent Urgency indicator boolean birthYear Patient’s year of birth integer

gender Patient’s gender string questionDate When question was posted date

lang Language string

specId Specialisation identifier integer specName Specialisation name string

Table 3.1: Database statistics

Figure 3.2: Pie chart of urgent (97.27 %) and non-urgent (2.77 %) questions

3.3 Data Exploration

The questions are already labelled by the doctors as urgent or non-urgent;

therefore, the task of urgency prediction classifies as supervised learning. Fig- ure 3.2 shows that the data are unbalanced; the non-urgent urgent questions make up 97.23 %, while the non-urgent questions only 2.77 %.

Null values inspection has revealed that thegender column contains 0 non- null values; therefore, it should be removed. Other attributes do not contain any null values. Histogram of specialisations, illustrated in Figure 3.3, shows the imbalanced distribution of samples from each class, with some classes containing zero samples. Language attribute contains only 2 values: “cs” (Czech) and “en” (English). Figure 3.4 shows that the languages are also highly imbalanced because requests are 99.61 % Czech and only 0.39 % English.

(34)

Figure 3.3: Specialisation histogram

Figure 3.4: Distribution of languages: Czech 99.61 %, English 0.39 %

(35)

3.4. Data Quality

3.4 Data Quality

The client has provided information about test data in the dataset. Test samples are identified by a “*test*” string, which is present in the question column. There are 781 test rows in total.

Further inspection of the question column has shown that some questions contain fragments of template questions asked by the client. The form, which the patients have to fill, contains several questions, which were probably pre- written in the text area, and if the patient does not delete these template questions, they remain in the text. Examples of these fragments are: “Co vás tráp´ı? Jak se pot´ıˇze projevuj´ı a jak dlouho trvaj´ı?:”,“Jak dlouho problém trvá?:”,“Jak vám m˚uˇzeme pomoci?Na co se naˇsich lékaˇr˚u chcete zeptat?:”.

Application of regular expressions, which was targeting the unique string “?:”

has uncovered 15 types of template fragments, which should be removed.

3.5 Summary

The dataset contains 44 093 rows of highly unbalanced data with only 2.77 % of urgent questions, where Czech language is dominating the questions with 99.61 %. Inspection has revealed that the data need to be cleaned from test samples and template fragments, which can occur among the questions.

(36)

(37)

Chapter 4 Data Preparation

This chapter focuses on the aspects of general preprocessing methods such as handling unbalanced data, as well as methods for text processing, including lemmatisation and tokenisation.

4.1 Research

The data preparation phase covers all activities to construct the final dataset from the raw data and typically takes over 70 % of the time. The tasks are usually not performed in a specified order and include data selection, cleaning, construction, integration and formatting. [2, Slide 31]

4.1.1 Data preparation in CRISP-DM

Data selection decides the data to be used for analysis. The criteria for a decision may include the relevance of data to the data mining goal, the quality of the data, and technical constraints. A list of data to be included/excluded and the reasons for these decisions should be provided. [5]

Cleaning increases the data quality based on the selected analysis techniques using clean data subsets, insertion of default values or imputing the missing data by a model. A data cleaning report describes the decision and actions taken to address data quality problems and consideration of the data trans- formations their possible impact on the results. [5]

Construction produces derived attributes or entire new records or transformed values for existing attributes. Derived attributes are new attributes constructed from one or more existing attributes in the same record (e.g., length and width can be used to calculate an area as a new variable). Gener- ating new records can be used to cover cases, which do not appear in the raw data, e.g. customers who did not purchase in the past year. These records are not useful, but could be used for modelling purposes. [5]

(38)

Integration methods combine information from multiple databases, tables, or records to create new records or values. These methods include data merging and aggregation. Data merging refers to joining together two or more tables that have different information about the same objects. An example could be information about a store, where one table has the store’s general characteristics, and another table has the store’s summarised sales data. These tables both refer to the same store and can be combined into a new table.

Data aggregation refers to operations in which new values are computed by summarising information from multiple records or tables. An example could be a list of customer purchases, which can be transformed into a record with a customer and a number of purchases. [5]

4.1.2 Preprocessing of text data

The task of urgency prediction from a patient’s question requires processing of text data, which requires a unique set of steps from natural language processing (NLP) to prepare the data. The raw text needs to be preprocessed by converting into a well-defined sequence of linguistically-meaningful units to be machine-understandable. The processing methods are language-dependent and already implemented by projects such as [8].

“Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all NLP pipelines, including voice recognition software, search engine lookup, and machine learn- ing model training. It is an essential step because text data can vary. From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into the data. The ultimate goal of cleaning and preparing text data is to reduce the text to only the words needed for the data mining goals.” [9]

Tokenisation and noise removal are essential in the data cleaning step.

Unwanted information such as punctuation and accents, special characters, numeric digits, leading, ending, and vertical whitespace, HTML formatting, are all types of noise that depend on the source’s domain. Sentence splitting and tokenisation are used for splitting the text into sentences and words. [9]

Some data may require further preprocessing through a set of methods for text normalisation to reduce dimensionality. These methods include upper or lowercasing, stopword removal, stemming and lemmatisation. Stopwords are usually the most common words in a language, which do not provide any information (e.g., “a”, “an”, “the”). Stemming removes word prefixes and suf- fixes (e.g., transform “going” to “go”). Lemmatisation is a method for casting words to their root (canonical) forms (e.g. words: “are”,“is”,“was”,“were”, lemma: “be”,), which is a more involved process than stemming because it requires part-of-speech tagging. [9]

(39)

4.2. Application

4.2 Application

The Client had already identified potentially relevant columns for the data integration phase and provided an SQL query, shown in Listing 4.1, which returns 44 093 rows in total. Query columns are described in Table 3.1.

S E L E C T

q . t e x t AS q u e s t i o n , (

td . n a m e = ’ s u g g e s t e d _ r e l e v a n c e _ t y p e _ m o d e l ’ AND srt . u r g e n t = 1

) AS i s U r g e n t ,

q . b i r t h _ y e a r AS b i r t h Y e a r , q . gender ,

q . q u e s t i o n _ d a t e as q u e s t i o n D a t e l . n a m e AS lang ,

mp . id AS specId , mp . n a m e AS s p e c N a m e F R O M q u e s t i o n q

J O I N a n s w e r a ON q . id = a . q u e s t i o n _ i d J O I N s u g g e s t e d _ r e l e v a n c e _ t y p e srt ON

a . s u g g e s t e d _ r e l e v a n c e _ i d = srt . id

J O I N t r a n s l a t i o n t ON srt . n a m e = t . i d e n t i f i e r

J O I N t r a n s l a t i o n _ d o m a i n td ON t . t r a n s l a t i o n _ d o m a i n _ i d = td . id J O I N l o c a l e l ON q . l o c a l e _ i d = l . id

J O I N m e d i c a l _ p r o b l e m mp ON q . m e d i c a l _ p r o b l e m _ i d = mp . id

;

Listing 4.1: SQL query provided by the Client

In the feature construction phase, there was one idea for a feature, which could potentially affect the urgency. The question length (number of characters) could vary depending on the urgency. Maybe urgent questions are primarily short because people are in a hurry? Alternatively, the questions are mostly long because people have lots of symptoms to describe.

In the feature selection phase, only these attributes are considered for testing: question, gender, birth year, question length. IsUrgent is the target variable, and specId, specName are filled in by the doctor, so these can not be used to predict urgency. The gender attribute contains no values, so there is no need for testing. The birth year and question length were the only attributes tested for their significance using their distributions. Birth years and question lengths were separated by urgency, and the distributions (shown in Figure 4.1 and Figure 4.2) indicate that these attributes do not significantly separate urgent and non-urgent classes, so they will not aid in the classification. Therefore, only the question attribute will be used for modelling.

(40)

Figure 4.1: Birth year distribution by urgency

Figure 4.2: Question length distribution by urgency

(41)

4.3. Summary During data cleaning, test records were removed because they bring no value to the classification. If this regular expression matched a record’s question: “\ ∗test∗ \”, it was considered a test record. There were 781 test records in total. The second part of the cleaning process was removing template fragments, as described in Section Data Quality, because these fragments were not present in every record, they could potentially affect the modelling in an undesirable manner.

4.3 Summary

Data was successfully queried from the Client’s database with 44 093 rows in total. Feature selection has ruled out the gender, birth year, question length attributes; therefore, only the question itself can be used as the source variable for urgency prediction. Data cleaning has removed test records used by the Client, and template question fragments.

(42)

(43)

Chapter 5 Modelling & Evaluation

This chapter includes research of several deep and non-deep learning techniques of data modelling as well as baseline selection and research of the State of the Art. It compares the performance of selected models on the task of urgency prediction and also presents results on secondary tasks of keyword ex- traction and specialisation prediction.

5.1 Research & State of the art

This section explains the concept of text embedding and classification, the rea- soning during the selection of the modelling techniques, and provides details about each technique.

5.1.1 Modelling in CRISP-DM

The first step in modelling is a selection of modelling techniques. Documenting the modelling techniques as well as the specific assumptions about the data is essential. The example could be that all attributes have uniform distributions, no missing values allowed, or the class attribute must be symbolic. [5]

Building a procedure or mechanism to test the model’s quality and validity comes next. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, it is typical to separate the dataset into train and test sets, build the model on the train set, and estimate its quality on the separate test set. A primary component of the plan is determining how to divide the available dataset into training, test and validation datasets. [5]

When building a model, parameter settings should be listed with their chosen values, along with the rationale of the choice of parameter settings.

The resulting models should be described with the interpretation report and any encountered difficulties documented. [5]

(44)

Embedding Classifier Deep learning BoW K-NN / Logistic regression No

Tf-idf K-NN / Logistic regression No

LASER NN Classifier Yes

ALBERT NN Classifier Yes

XLM-RoBERTa NN Classifier Yes

Table 5.1: Selected modelling techniques

Model assessment is done according to the domain knowledge, data mining success criteria and test design. The results are usually generated with several different techniques, where each model should be ranked according to the evaluation criteria in relation to each other. Finally, parameter settings should be iterated until the best model is found. [5]

5.1.2 Text Embedding & Classification

Text classification is a crucial and well-proven method for organizing the collection of large scale documents, which has been widely used in many tasks in natural language processing or information retrieval. [10]

“Since the machine learning algorithms’ input must be a fixed-length fea- ture vector, the documents are often represented with a vector space model.

Each dimension corresponds to one word, and the dimensionality of the vec- tor is the size of the vocabulary. However, this kind of representation has two disadvantages:

1. the represented vector is often high dimensional and very sparse, which brings a challenge for a traditional machine learning algorithm,

2. it ignores the semantics of the words.” [10]

“Recently, word embeddings are becoming more and more popular and have shown excellent performance in various natural language processing tasks.

Each word is represented by a dense vector, and words with similar mean- ings will be close to each other in the vector space.” [10]

[11] has conducted a comprehensive review of more than 150 deep learning (models, which utilise neural networks) and non-deep learning (models, which do not utilise neural networks) based models for text classification. It explains each concept and then conducts a thorough evaluation of the models on multiple tasks. For the task of urgency classification, a mix of both approaches is summarised in Table 5.1, with one non-deep learning model as the baseline.

(45)

5.1. Research & State of the art

Figure 5.1: Bag of Words as Count Vector. Source: [12, Slide 4]

5.1.2.1 Bag of Words

According to [12, Slide 4],Bag of Words (BoW)is the simplest and most naive approach to represent a document, which is the reason why it was selected for the baseline model for the embedding. Words from each document are combined into a single vocabulary. For each word, the number of occurrences in a document is represented by the number in the specific column, as illustrated in Figure 5.1 Word occurrence can be represented either as a boolean or as an integer. If the latter variant is selected, the representation is also called a Count Vector. [12, Slide 4]

Implementation of this technique is available as theCountVectorizer class in the scikit-learn Python library, which converts a collection of text documents to a matrix of token counts using sparse representation of the counts. [13]

5.1.2.2 Term frequency–Inverse Document Frequency

[12, Slide 5] describes Term frequency–Inverse document frequency (Tf-idf) as a measure of how individual words are specific to a document in a collection of documents (corpus). Tf-idf is defined with the following expression:

tf-idf(t, d,D) =tf(t, d)×idf(t,D) = Count_t,d P

t⁰∈dCount_t⁰_,d × |D|

|d∈ D:t∈d|

where:

t is a specific term, d is a specific document,

(46)

D is a set of all documents (corpus),

Countt,d is the number of occurences of termt in documentd,

tf(t, d) is the how important termtis in the documentd(occurences of terms t, divided by the number of other terms ind),

idf(t,D) is how common termt is in the set of all documents D (number of documents divided by number of documents containing termt).

The Tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. [14]

5.1.3 State of the Art in Text Embedding

Following the success of word embeddings, there has been an increasing inter- est in learning continuous vector representations of longer linguistic units like sentences. These sentence embeddings are commonly obtained using a Re- current Neural Network (RNN) encoder, which is typically trained in an unsupervised way over large collections of unlabelled corpora. For instance, the skip-thought model couple the encoder with an auxiliary decoder and train the system to predict the surrounding sentences over a collection of books. [15]

It was later shown that more competitive results could be obtained by training the encoder over labelled Natural Language Inference (NLI) data.

This was later extended to multitask learning, combining different training objectives like that of skip-thought, NLI and machine translation. [15] l“While the previous methods consider a single language at a time, multilingual rep- resentations have attracted a large attention in recent times. Most of this re- search focuses on cross-lingual word embeddings, which are commonly learned jointly from parallel corpora. An alternative approach that is becoming in- creasingly popular is to train word embeddings separately for each language, and map them to a shared space based on a bilingual dictionary or even in a fully unsupervised manner.” [15]

“One of the biggest challenges in NLP is the shortage of training data.

Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human- labelled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general- purpose language representation models using the enormous amount of unan- notated text on the web (known as pre-training). The pre-trained model can

(47)

Figure 5.2: Architecture of LASER for multilingual sentence embeddings.

Source: [15]

then be fine-tuned on small-data NLP tasks like question answering and sen- timent analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.” [16]

The performance comparison, conducted by [11], was the starting point during the selection of the deep-learning-based models. According to [11], Bidirectional Encoder Representations from Transformers (BERT) based models and XLNet scored the highest, with Universal Language Model Fine-tuning (ULMFiT) also among the promising candidates. Although XLNet scored the highest, it was not selected because of its memory requirements. A search for Czech versions of these models has found the Czech version of A Lite BERT (ALBERT) [17] and the ULMFiT for Czech [18], which were both trained solely on the Czech corpus. However, after several failed attempts to run the ULMFIT for Czech, it was removed from the list. The supervisor had also suggested the LASER [15] by Facebook Research.

The final list of selected deep learning based techniques is summarised in Table 5.1. It contains two multilingual models (LASER, XLM-RoBERTa) and one model trained solely on Czech (Czech ALBERT).

5.1.3.1 LASER

[15] introduces an architecture to learn joint multilingual sentence representations for 93 languages (Czech and English being among them). Cross-lingual document classification is a typical application of multilingual representations.

Bitext mining is another natural application for multilingual sentence embeddings. Given two comparable corpora in different languages, the task consists of identifying sentence pairs that are translations of each other, as shown in Figure 5.3. “This is the first successful exploration of general-purpose mas- sively multilingual sentence representations.”

Figure 5.2 illustrates LASER’s architecture, which is based on [20]. “Sen- tence embeddings are obtained by applying a max-pooling over the output of a Bidirectional Long Short Term Memory (BiLSTM) encoder. These sen-

(48)

Figure 5.3: LASER’s multilingual embeddings in projected into two dimensional space. Same sentences in different languages are clustered closely together. Source: [19]

tence embeddings are used to initialize the decoder Long Short Term Memory (LSTM) through a linear transformation and are also concatenated to its in- put embeddings at every time step.” There is no other connection between the encoder and the decoder, as the goal it to have all the relevant information of the input sequence captured by the sentence embedding. [15]

The system uses a single encoder and decoder, which are shared by all languages involved. “For that purpose, a joint byte-pair encoding (BPE) vo- cabulary with 50 000 operations, which is learned on the concatenation of all training corpora. This way, the encoder has no explicit signal on what the input language is, encouraging it to learn language-independent representa- tions. In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence em- beddings at every time step.” The resulting sentence representations are 1024 dimensional. [15]

“LASER also offers several additional benefits:

1. It delivers extremely fast performance, processing up to 2,000 sentences per second on Graphics Processing Unit (GPU).

2. The sentence encoder is implemented in PyTorch with minimal external dependencies.

3. Languages with limited resources can benefit from joint training over many languages.

(49)

Figure 5.4: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs. Source: [16]

Figure 5.5: Example of masking words with BERT. Source: [16]

4. The model supports the use of multiple languages in one sentence.

5. Performance improves as new languages are added, as the system learns to recognize characteristics of language families.” [21]

Facebook has open-sourced their work, making LASER the first successful exploration of massively multilingual sentence representations to be shared publicly with the NLP community. The toolkit now works with more than 90 languages, written in 28 different alphabets. The code is available under a Berkeley Software Distribution (BSD) License in [19]. [21]

5.1.3.2 ALBERT

In 2018, researchers from Google had open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers (BERT). BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre- Training, Deep contextualized word representations using Embeddings from Language Models (ELMo, explained in [22]), and ULMFit, published in [23].

“However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text cor- pus (in this case, Wikipedia).” [16]

Visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown in Figure 5.4.

(50)

“The arrows indicate the information flow from one layer to the next. The top green boxes indicate the contextualized representation of each input word.” [16]

It is impossible to train bidirectional models by simply conditioning each word on its previous and next words since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.” Figure 5.5 illustrates a solution to this problem. “A technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.” [16]

[24] presents A Lite version of BERT (ALBERT). It advances the state- of-the-art performance on tasks such as the competitive Stanford Question Answering Dataset (SQuAD 2.0) [25] and the Large-scale ReAding Compre- hension Dataset From Examinations (RACE) benchmark [26]. ALBERT is being released as an open-source implementation on top of the TensorFlow library, and includes a number of ready-to-use ALBERT pre-trained language representation models.

“The key to optimizing performance, captured in ALBERT’s design, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations (e.g., the word “bank”). In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the con- text of financial transactions, and a different representation for “bank” in the context of river-flow management.” [27]

The efficiency is achieved by factorization of the embedding parametriza- tion. The embedding matrix is split between input-level embeddings with a relatively low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80 % reduction in the parameters of the projection block, at the expense of only a minor drop in performance – -0.1 on SQuAD2.0;

or -0.3 on RACE – with all other conditions the same as for BERT. [27]

Transformer-based neural network architectures rely on independent layers, which do not share parameters. ALBERT allows parameter-sharing across the layers, which slightly diminishes the accuracy, but the more compact size is well worth the tradeoff. “Parameter sharing achieves a 90 % parameter re- duction for the attention-feedforward block (a 70 % reduction overall), which, when applied in addition to the factorization of the embedding parameteri- zation, incur a slight performance drop of -0.3 on SQuAD 2.0 to 80.0, and a larger drop of -3.9 on RACE score to 64.0.” [27]

ALBERT supports a new tokenizer called SentencePiece [28]. Senten- cePiece is a language-independent tokenizer and detokenizer developed at Google. The SentencePiece model first needs to be trained on a corpus with sentences. The output is a model file and a vocabulary file, which are self- contained and portable. It is recommended to apply the same text preprocessing techniques both during training and fine-tuning. [17]

(51)

“Fine-tuning is the process of adapting the pretrained BERT model to a dif- ferent task, then it was pretrained on. The first step is modifying the task’s input, so it fits the BERT inputs and somewhat resembles the pretraining tasks. The second step is adding additional layer(s) on top of the core (on top of the output embeddings) to answer the task. This usually means adding a classification layer on the output embedding of the first token ([CLS]) if we want to classify the whole input.” The final step is to train the modified model on the task-specific data. The fine-tuning process is much faster than training the whole new model because the model is already trained on the Czech corpus.” [17]

There are two published transformer-based language models that support the Czech language:

• Multilingual BERT, which supports 104 languages,

• Slavic BERT, which supports Czech, Russian, Bulgarian and Polish, and has a bigger vocabulary overlap with the Czech language than the Mul- tilingual BERT. [17]

[17] presents an ALBERT model trained solely on the Czech language.

Its performance in sentiment analysis and questions answering was tested on Czech datasets, such as the Propaganda dataset (published in [29]) and the Novinky.cz dataset (published in [30]). Czech ALBERT, which was trained on a smaller set of data, with four times smaller vocabulary size than the Mul- tilingual and the Slavic BERT, still manages to outperform Multilingual BERT and nearly matches the Slavic BERT.

5.1.3.3 XLM-RoBERTa

[31] shows that pretraining of multilingual language models leads to significant performance gains for a wide range of cross-lingual transfer tasks. A Trans- former-based masked language model is trained on one hundred languages (including Czech), using more than two terabytes of filtered web crawl data.

The XLM-RoBERTa (XLM-R) model, significantly outperforms Multilingual BERT (mBERT – state-of-the-art multilingual BERT-based language model) on a variety of cross-lingual benchmarks. The results show, for the first time, the possibility of multilingual modelling without sacrificing per-language performance.

(52)

Figure 5.6: Cross-lingual language model pretraining. “The MLM objective is similar to the one of [16], but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences.

To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Position embeddings of the target sentence are reset to facilitate the alignment.” Source: [16]

To train the XLM-R, [31] follows the approach introduced by [32], which presents three language modelling objectives. Two of them only require monolingual data (unsupervised), while the third one requires parallel sentences (supervised). The objectives are as follows:

• Causal Language Modelling (CLM),

• Masked Language Modeling (MLM),

• Translation Language Modeling (TLM).

CLM task consists of a Transformer language model trained to model the probability P(w_t|w₁, . . . , wt−1, θ) of a word given the previous words in a sentence. In the case of Transformers, previous hidden states can be passed to the current batch to provide context to the first words in the batch. “How- ever, this technique does not scale to the cross-lingual setting, so the first words are left out in each batch without context for simplicity.” [32]

MLM task consists of sampling randomly 15 % of the BPE tokens from the text streams, replacing them by a “[MASK]” token 80 % of the time, by a random token 10 % of the time, and keeps them unchanged 10 % of

(53)

5.1. Research & State of the art the time. [31] uses text streams of an arbitrary number of sentences (truncated at 256 tokens) instead of pairs of sentences. To counter the imbalance between rare and frequent tokens (e.g. punctuations or stop words), tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies. The MLM objective is illustrated in Figure 5.6. [32]

“Both the CLM and MLM objectives are unsupervised and only require monolingual data.” TLM objective is an extension of MLM, where instead of considering monolingual text streams, parallel sentences are concatenated as illustrated in Figure 5.6. Words are randomly masked in both the source and target sentences. “To predict a word masked in an English sentence, the model can either attend to surrounding English words or to the French translation, encouraging the model to align the English and French representations.” [32]

“XLM-R outperforms mBERT on cross-lingual classification by up to 23 % accuracy on low-resource languages. It outperforms the previous state of the art by 5.1 % average accuracy, 2.42 % average F1-score on Cross-lingual Natu- ral Language Inference (XNLI), and 9.1 % average F1-score on Cross-lingual Question Answering.” XLM-R is also evaluated on monolingual fine tuning on the General Language Understanding Evaluation (GLUE) and XNLI benchmarks, where the obtained results compete with state-of-the-art monolingual models, including RoBERTa [33], which the XLM-R derives from. “These results demonstrate, for the first time, that it is possible to have a single large model for all languages without sacrificing per-language performance.” [31]

5.1.4 Evaluation

According to [5], this CRISP-DM step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option is to test the model on test applications in a real environment, if time and budget constraints permit.

The evaluation phase also involves assessing any other data mining results that were generated. Data mining results involve models that are necessarily related to the original business objectives and all other findings that are not necessarily related to the original business objectives but might also unveil additional challenges, information, or hints for future directions.

When the resulting model appears to be satisfactory, it is appropriate to review the data mining process and check for any issues (e.g., is the model built correctly, is the testing set adequately separated). Finally, the following steps should be determined in the list of potential further actions, along with the reasons for and against each option, should be provided. After a decision is made, the rationale should also be documented. [5]

(54)

Figure 5.7: Confusion matrix example. Prediction rates in order left to right, top to bottom: True Negative (TN), False Negative (FN), False Positive (FP), True Positive (TP)

Quantifying the quality of predictions

Scikit-learn [34] provides a package sklearn.metrics with a set of metrics to measure the model’s performance. Accuracy metric would be insufficient due to class imbalances in the data. The following metrics were used to track model scores:

• Precision,

• Recall,

• F1 score,

• Area Under the Receiver Operating Characteristic Curve (ROC AUC).

Figure 5.7 illustrates confusion matrix from metrics.confusion matrix, which is useful for visualising model prediction rates.

Sklearn provides methods for Precision (metrics.precision score) and Recall (metrics.recall score). The precision is intuitively the classifier’s ability not to label as positive a sample that is negative. The recall is intuitively the ability of the classifier to find all the positive samples.

precision score= T P

T P +F P, recall score= T P T P +F N

The F1 score from metrics.f1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at one and worst score at zero. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

(55)

5.2. Urgency Prediction

Figure 5.8: Receiver Operating Characteristic example. Source: [35]

F1 = 2×precision×recall precision+recall

ROC curves are typically used in binary classification to study the output of a classifier. Figure 5.8 illustrates an example of a ROC curve from metrics.roc curve and metrics.auc. “ROC curves typically feature true positive rate on the Y axis and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.” [35]

5.2 Urgency Prediction

The task of urgency prediction, described in Subsection 2.2.2, classifies as supervised binary classification. To provide a better perspective for the Client, a User Interface (UI) concept (illustrated in Figure 5.9) was made to visualise urgency prediction in the context of the Client’s application. The questions would be classified in the following classes: urgent (red “yes”), non-urgent (green “no”), unsure (orange “maybe”)

The dataset columns consist of two features: question (source variable X), isUrgent (target variable Y), and was split into tree parts, with the following distributions:

• Training set (80 %),

• Validation set (10 %),

• Testing set (10 %).