• Nebyly nalezeny žádné výsledky

Hlavní práce73794_stae05.pdf, 1.7 MB Stáhnout

N/A
N/A
Protected

Academic year: 2022

Podíl "Hlavní práce73794_stae05.pdf, 1.7 MB Stáhnout"

Copied!
91
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

University of Economics and Business Prague

Faculty of Informatics and Statistics

NLP T ECHNOLOGIES AND AI FOR BUSINESS : APPLICATIONS AND METHODS UTILIZED TO REDUCE COSTS AND WORKFLOW

IMPROVEMENT

MASTER THESIS

Study programme: Applied Informatics Information Systems Management Field of study: Economic Data Analysis

Author: Ekaterina Staroverova Supervisor: Ing. Martin Potančok, Ph.D.

Prague, 2021

(2)

Declaration

I hereby declare that I am the sole author of the thesis entitled "NLP Technologies and AI for business: applications and methods utilized to reduce costs and workflow improvement". I duly identified all citations. The used literature and sources are stated in the attached list of

references.

Prague Signature: Ekaterina Staroverova

(3)

Abstract

The changeable situation in different business sectors forces companies to perform constant improvements for all directions such as maintenance of financial workflow, candidates searching and hiring process organization, customer support, internal purchases and document flow etc.

Technologies evolve dramatically, spontaneously with the number of tasks and data and growing customer's demand. The more successful companies want to be in a certain sector, the more effort management should spend on both process improvement and data investigation, researching potential problems and gaps.

Thousands of modern managers consider Artificial Intelligent (AI) as a possibility to deal with all given tasks, especially with the Natural Language Processing which is able to deal with tasks connected with speech and text - from recognition to building dialogues and data reorganization.

The Master Thesis examines the potential impact of tools and frameworks based on NLP and AI on different divisions in companies. While information' collecting has investigated approaches for workflow improvements and calculated the profit after considering technologies

implementation.

For obtaining results multiple methods were being utilised. Firstly, the programming code based on data sets with employee requests was created.

Secondly, Chatbot and Mailbot design and cost calculations were made, workflows for data normalization and text preparation, model training and testing were presented.

Thirdly, interviews with managers from companies created applications and tools based on AI and NLP were conducted. For specialists' opinion collection the questionnaire and short form with two business cases were prepared.

Finally, last articles and researches about NLP and AI were analysed and potential impact on different types of business was obtained and proved by examples and cases.

During the research preparation, several conclusion was done:

- NLP technologies and AI tools can increase profit and make workflow more transparent - Data for NLP analysis can be a perfect instrument for workflow understanding and further analysis, include customers' behaviour

- AI is able to substitute a human being but requires both proper data collection and analysis

Keywords

(4)

Abstract 2

Keywords 2

Introduction 4

NLP Technologies for processes improvement 7

Services implementing NLP/AI technologies and applications 12

Data Analysis Framework Applying 13

Chatbot Implementation 20

Service Selection 21

Dialog Flow Design for HR Chatbot 24

Building Dialogs in Microsoft QnA Maker 26

Chatbot Managing and Developing 28

Mailbot Implementation 29

Model Selection for Classification 30

Model Cost Calculation 34

Approaches and methods for NLP Models building 36

Platforms for NLP project maintain and deployment 44

Validation of updated services based on NLP and AI 48

Financial assessment of applications and platforms utilisation 48

Approaches for the Model performance measurement 49

Model explanation with SHAP 52

Maintaining of projects based on NLP 54

Consequences for HR and labor market 55

Conclusion 57

Programs 59

List of abbreviations and terms 75

List of Figures and Tables 76

References 79

Annexes 82

Annex A: Libraries and tools for NLP Analysis and Model Building 82

Services for bots maintain and build 82

List of Python Libraries for NLP 82

Annex B: RACI Explanation 85

Annex C: Interviews with NLP and AI Professionals 86

Annex D: Questionnaire for NLP and AI Professionals 90

(5)

Introduction

Artificial Intelligence can be considered "an attempt to find ways in which such behaviour could be engineered in any types of artefacts" (Blay Whitby, 2008, p.20). Speaking about AI business usually means a great variety of techniques and products - from optical character recognition and speech understanding to the Internet of Things and self-driving cars. Businesses can regard AI as a working and promising instrument for improving and even managing processes.

NLP, in turn, as a part or the branch of AI and also as a set of techniques utilised in

programming for text analytics and modelling. Products and tools based on NLP approaches are the new target for thousands of IT companies, media, factories, e-education, logistics and other sectors.

According to the Reportlinker research about the NLP field, the market will reach $29.5 billion by 2025 with a compound annual growth rate of 20.5% (Global Natural Language Processing Market, 2019). The chatbot market size is projected to grow from USD 2.6 billion in 2019 to USD 9.4 billion by 2024, at a CAGR of 29.7% during the forecast period (Markets Insider, 2019).

However, the market can grow even more due to the specific worldwide economic situation in 2020-2021 yy. due to the growing percentage of remote workers which forced companies to automatisation of some processes. The impact of the coronavirus situation on the NLP Market is difficult to estimate for now. However, experts interviewed for the research confirmed the

growing interest in NLP because of unusual circumstances.

The aim of the Thesis is to select and implement NLP technologies, AI frameworks and

applications that are able to influence and benefit business purposes and processes for certain cases (such as the proposal for HR, Procurement, Sales divisions) based on investigation and revealing of processes flows in existed workflows and business organization. The purpose of this thesis is to understand and calculate the potential profit for the business being obtained after the implementation of tools and approaches based on NLP. MT determines at least two ways to achieve it:

- workflow improvement which leads to better processes organisation. Consequently, it can influence customers' satisfaction and increase earning or get rid of not necessary parts, divisions, suppliers

(6)

- workforce substitution. Replacement of some specialists or even divisions with AI or at least tools based on RPA.

The Objective of the thesis is to obtain sufficient and complete proofs of how applications and platforms based on NLP and AI are able to improve workflow, what technologies are being utilized and how much money and time can be saved.

Chatbots, voice recognition tools and helpers, tools for comments/requests/news classification help to reduce costs by replacement of manual work. Even though services like chatbots seem insignificant in comparison with the organization of HR or financial departments, it is an

important part of process improvement. Replacement leads to the implementation of new architecture in the workflow that influences the whole business organisation.

In chapterNLP Technologies for processes improvementthe research describes technologies and platforms which are able to improve the workflow and reduce costs, providing examples of the most popular tools and approaches.

InServices implementing NLP/AI technologies and applicationsMT focuses on data preparations, algorithms and tools for application and feature creation and provides Data Analysis Framework organization, steps for chatbot and mailbot implementation, including cost calculation and selection of ML models for NLP tasks.

In the chapterValidation of updated services based on NLP and AI the research reveals how to measure performance, maintain projects based on NLP and what consequences can be for both employee and employer ih whose workflow tools based on NLP were embedded.

Both Service Implementation and Validation parts are being supported with the code, created with Python language with a purpose of demonstration processes, includes different approaches from the NLP field - cluster analysis, classification with TF-IDF or others data vectorization, N-gramme analysis and visualisation of the most important results. The obtained data was considered as an instrument for workflow improvement and based on the result the new method based on AI was proposed. Furthemore, the chapter reveals requirements mandatory for the Model creation - algorithm or set of algorithms which are the essential part of all possible features.

(7)

Potential outputs of the thesis are:

○ Creating transparent Data Preparation framework for NLP tasks

○ Invention framework for chatbot and mailbot building, include preparing dialog flow and cost calculation

○ Providing full explanation of the mailbold building based on classification model

○ Understanding model validation and evaluation process for mailbots and chatbots.

All these steps are capable of proving how technologies based on NLP and AI are able to reduce costs and increase revenue, how cognitive technologies are able to reorganize work in divisions such as accounting, HR, IT-developing etc.

What else will be determined is what can be done for this digital transformation and how to estimate the result of the improvements. To achieve these, multiple techniques, explanations and frameworks were provided, code for algorithm performance demonstration was created.

The ideas represented in the thesis will be based on scripts created with Python language, NLP techniques investigation, experts' opinions, companies' experience and capabilities of cognitive technologies that can be utilized today or in the nearest future. In addition, the thesis will determine the strategy for different departments and formulate the plan of how to reach goals and meet the criteria. The data set used for demonstration techniques utilized for data

preprocessing for NLP task, topic extraction, classification and model evaluation is requests from employees about several subjects, including processing documents for new and leaving employees, purchase and set equipment for them, set computer programs and give access.

(8)

NLP Technologies for processes improvement

According Deloitte research "AI for the Real World" (Deloitte, 2018), almost 50% of companies participated in the survey use AI for Robotics and Cognitive Automation, 30% - Cognitive Insight - analyse and predictions, 16% - Cognitive Engagement - "engage employees and customers using natural language processing chatbots". In general, 51% of companies emphasized that they use AI to"enhance the features, functions, and performance of our products". In other words, process automation has already become an important part of routine processes improvement in different fields of business.

The important role here plays NLP as the branch of AI and a powerful instrument to solve tasks connected with text, speech, and translation.

NLP utilized for multiple purposes - data analysis to create models for classification,

clusterization of texts, information extraction, chat bot building. In some cases the business idea can be solved fast and without a lot of human resources, in another - tasks can require weeks and even months of data collection, script building and investigation.

Dan Jurafsky and Christopher Manning from the Natural Language Processing Group at Stanford University distinguished at least three levels of NLP tasks categorized by complexity for both science and business (Introduction to NLP, 2012).

Mostly solved: Spam Detection, Part-of-speech tagging, Named Entity Recognition.

Making good progress: Sentiment Analysis, Coreference Resolution, Parsing, Machin Translation, Information Extraction.

Still really hard: Question Answering, Paraphrase, Summarization, Dialog.

Nevertheless, even simple tasks such as Spam Detection or Classification can take some time due to data insufficiency, multiple languages, difficulties with the workflow understanding. Still tools based on NLP usually show better performance at least because they can be created and replicated for different departments, be maintained with one team, provide transparent results and be available 24 hours.

All departments such as financial, HR, procurement, support every day operate on gigabytes of data, including invoices, contracts, internal and external requests, purchase orders, tasks for employees etc.

One of the usual tasks connected with NLP methods and crucial for the digitalisation processes and workflow organization is document recognition. There are a variety of platforms and

(9)

applications based on NLP and capable of handling both problems with digital and not digital documents. In addition, invoices, reports, forms and contracts scans are not always in proper quality. For these purposes there exist applications based on computer vision - the

neuronetwork reads the document, captures potential important parts (company's name,

signature, dates and specific fields etc.) and transfers the document into a digital file. After these steps it is possible to perform topic extraction or send the document for processing. Usually applications for the computer vision step offer a framework for data extraction - possibility to determine key words and even places on the document, languages featuring, and base text analysis. Popular applications based on Computer Vision and useful for document recognition are Microsoft Azure Form Recognizer, Google Cloud's Vision, ABBYY. All these platforms also offer Cloud Based pre-trained models with different languages, data storages, API for Model deployment and analytics for model evaluation.

After a given platform implementation, the specialist does not need anymore to read all

documents to extract important information, he or she can get context summarization or obtain only necessary for specific procedures or workflow maintaining information.

Another typical task connected with language processing - classification of documents, tasks, requests, emails - also can be performed with NLP technologies approaches. For business it opens a variety of possibilities: reduce costs for manual classification (usually performed by the certain department), implement spam-detection, organize document flow with logic and

transparent categorization. In case with no digital documents classification, it is impossible without document recognition - from the .pdf to the format suitable for the document analysis tools. One of the most popular instruments in this field is ABBYY. The company provides an API which is capable of building applications for classification of documents like invoices, licens, contracts, tax forms etc. Fast and precise document recognition is crucial for the different business fields. For instance, for public health, banks, insurance companies which have a dozen of paper and .pdf format documents.

It is already mentioned that utilization of all variants of bots - chatbots, mailbots and voicebots is increasing dramatically. Chatbots are the most popular asset from the bot's family due to its simplicity and many ways of utilization. Banks, e-education, shops - the majority of big

companies from these sectores already utilize chat bots on their sites and applications. Mostly, for communication with customers, provide information about services and collect requests.

(10)

Chatbot is usually represented as a dialog window on the site or application. Chatbots may be closer to a simple FAQ and answer to typical questions or provide sophisticated dialog with a customer. In the first realisation the chatbot usually works with a questions selection on the dialog window represente, in the second - has more connection with intents - trigger words - and based on AI.Chatbot built with technologies based on NLP methods is able to maintain dialog with potential customers or clients. Consequently, the tool could not only replace the support agent, but also collect information and do self-study that means constant improvement. One of the leaders in the chatbot field is question-answering computer systemIBM Watson which is capable of both answering direct and indirect questions and improving itself performance by analysing the data obtained from the users. Hence, the interaction between the service and customers increases, in addition the business does not need anymore to analyse and estimate the employees behaviour against the customer service - everything happens in the service (chatbot) side.

In general, there are several typical tasks that could be covered by bots. Mailbot usually is being utilized for the collecting and classification messages from customers or employees. For

instance, the bot can collect purchase orders by extracting information from e-mails and initiate the purchase-order process in the company purchasing system. As a typical bot, the mailbot is capable of evolving its performance by analysing the results of classification (assigning the task/forming the request). The VoiceBots, on the other hand, are based on both voice

recognition and text analysis. Usually in business it can be found as a replacement of the whole call-center. In this situation VoiceBot is able to answer simple questions, provide necessary information and connect the customer to the proper specialist.

One of the most helpful for the business task that helps to get insides from the data for further decision is Cluster Analysis. For instance, the Customer Support Division can perform cluster analysis of reviews to understand those product categories customers mentioned more often and the kinds of problems and questions connected with them. In this case it is possible to separate customers into cohorts - for each product category or different regions.

The example of code for a given task can be found inCode 1.5 Example of program's code - Cluster Analysis.According to the task solved in the script, the Customer Support division of the company can obtain the table with key words for each cluster and investigate whether that product got positive or negative feedback and that exactly customers liked or disliked more.

(11)

Information extraction from bots, mails and reviews also opens the great opportunity for the topic and sentiment analysis. "If you want to understand people, especially your

customers…then you have to be able to possess a strong capability to analyze text." - said Paul Hoffman, CTO of the Space-Time Insight in his Twitter. The customer's behaviour analysis can be extremely useful in the saturated market - the faster businesses react on the issue and reply to each request the higher probability that the customer will stay with the company. The task of sentiment and topic analysis can be performed not with any platforms but with creating certain scripts applicable for the case.

Relationships between businesses located in different countries increased the demand for translation, in this situation machine translation can be implemented for automatization of document and message translation and dialog flow. Applications for machine translation. Both Microsoft and Google provide SDK for translation from multiple languages, including rare. It is easy to imagine the benefits it can bring.

For instance, the SMM Division of the international company wants to analyse reviews and comments from different countries, they already have the tool to extract all posts and comments from social media where the brand was mentioned. Example of the tool is Social Media

Monitoring and Analysis System Brand Analytics that basically create reports based on keywords - brand and product names. Nevertheless, only collecting the data is not enough in this case - the output will be multilanguage, the company needs the division with translators from different languages. Still there are tools for detecting and translating even reviews with different languages in one message. For instance, Googletrance open-source library based on Google's translate API is capable of translating from more than one hundred languages. It is supported with a variety of languages, including Python, Node.js and Java. The utilization of the library is free for the several thousands of characters per month (with a 24-hours quota), but 575 000 characters processed in a month will cost $1.50 (Google Cloud). In addition, the API offers the possibility to train custom models, perform audio translation, and support custom glossary (the collection of words connected with the brand).

Mentioning NLP, many specialists forget that voice recognition, voicebots and other tools based on speech understanding are also important tools for business analysis and processes

improvement. For example, Amazon Lex chatbots service is capable of understanding telephone calls and communicating with customers. As a main advantage of the voice bot,

(12)

Amazon mentioned "reduced customer call time", but it is obvious that this technology helps to reduce personnel in call-centers.

As can be seen, tools and libraries based on NLP are capable of dealing with a variety of tasks.

Nevertheless, none of them are possible without reliable and readable data. Consequently, the main role for all these task success plays the proper data collection and data analyse workflow - without these two steps performing any steps connected with NLP or AI are impossible.

Consequently, the readiness of the company is connected to the maturity of management - for instance, CTO - and willingness to share data and specify business requirements with the Data Science or Advanced Analytics team. Dependency of data amount and quality can be

considered the only one weakness of automatisation based on NLP. On the other hand, even human beings can not make rational decisions without knowledge based on collected data or personal experience.

(13)

Services implementing NLP/AI technologies and applications

Implementation of NLP/AI technologies can be separated into several steps which help to avoid mistakes and collect certain information. Hence, the data preparation and data investigation can be crucial and it is typical to spend a lot of time for information collection and creating a

framework for data analysing.

The implementation of applications based on NLP/AI may be done with the CRISP approach by IBM.

Figure 1. The data mining life cycle (IBM)

As an example the implementation based on the CRISP approach the Mailbot can be

considered. In this case Business Understanding is an understanding that tasks can be solved by the bot and how they can improve work and decrease costs. Tasks that can be handled with a Mailbot with significant performance are: emails classification and entity detection (or

extraction information according to specific rules, patterns).

The Data Understanding step can include data analysis of mails with all possible NLP techniques: topic extraction, bigram analysis and others. Based on this information the ML algorithm for the model training can be selected and the modelling step can be done "Model

(14)

and Model in Machine Learning, 2020), in other words it is an output obtained after each algorithm performing For example, "the decision tree algorithm results in a model comprised of a tree of if-then statements with specific values". Neverveles, for NLP tasks all these steps are impossible without semi-structured or structured data - emails, reviews, requests, comments or even data basically obtained as a voice message but transferred to the speech. The Modeling step for the NLP job is also unusual since the text vectorization or at least tokenization must be included, that is requires powerful processors or even supercomputers with forceful clusters. In general, for NLP purposes a lot of popular algorithms can be utilized - from Linear and Logistic Regression, different types of classification algorithms such as random tree, artificial neural networks, clustering algorithms as K-means. Speaking about Mailbot it can be based on an ML technique called BERT that is capable both for classification and entity extraction.

The Evaluation step must be aligned with a business, that means base metrics and results may be documented and agreed in advance, the acceptance criteria must be formed even before modelling. The criteria can be the model accuracy or costs.

Finally, the Deployment could be performed only after agreement that the model performed after all steps finished successfully.

The development of a chatbot can stick with the same framework. Nevertheless, the Data Preparation and understanding step is important for building both bots and performing all tasks connected with NLP.

Data Analysis Framework Applying

Every NLP task is always being built in a significant amount of text or voice data. In cases connected with building models based on text such as classification, topic extraction, translation there are various requirements for the data.

○ Quality - texts must be real, without cleaning and improvement before the text analysis and preprocessing. Otherwise all new bunches of data may be obtained with different formats.

○ Sufficiency - the question about the amount of data is one of the most difficult, especially for classification tasks with many categories. If the classification is going to be checked

(15)

or investigated each label must be represented in a sufficient amount. The minimum quantity of text from the certain category must be discussed before the further steps.

○ Actuality - the data should be actual, with all potential patterns.

○ Seasonality - data must represent a certain period of the time which can show time trends. For example, if customers' activity changes during the year - the sample data must cover all periods.

○ Certainty - data owners should describe sources from which data was obtained or at least notify about this possibility. Different sources usually means different text structure.

○ Uniformity - if texts were not obtained in the same format, the extra preprocessing step for unification should be performed.

Since all requirements for the data satisfied text normalisation can be performed. Text

Normalisation is a set of processes "that should be followed to wrangle, clean, and standardize textual data into a form that can be consumed by other NLP and analytics systems and

applications as input" (Text Analytics with Python,2016, p.115). There are several of the most important text preprocessing techniques.

Tokenization

Split text into sentences or/and words andrepresent them as a token - small unit of the whole model. There are several approaches for tokenization - the most common, usually utilized for vectorization - just separation the sentence by words. The BERT model requires a specific type of tokenization with the splitting words with different for all languages logic.In case of

performance BERT from Transformers it is better to keep context and not to otherdue text cleaning. For example, this approach do not need stopwords removing due to it consider words connections and sequence. Nevertheless, the specific tokenization must be performed.

Examples of tokenization preprocessing were provided in the chapterCode 1.3 Example of program's code - Tokenization.

Removing stopwords

Get rid form not significant for data analysis words that do not influence results and create some noise. In addition some algorithms work with a limited number of tokens and there rather be important words.

NLP libraries such as NLTK and SpaCy have a set of stop-words for each language model.

(16)

models there are words like: {‘him’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘i’, ‘after’, ‘few’, ‘whom’,

‘t’ etc.}.

Stemming and Lemmatization

Correcting repetitive words and spelling - stemming and lemmatizationare are procedures for removing affixes.

For example, 'Office catalogs' will be 'offic catalog' after the stemming and the idea behind it is to unify words for next steps - make frequency calculation more representative. Overwise words like 'office' and 'offices' will be considered as different. In other languages (like Russian, Greek, Czech, German) that have several cases and genus variability the problem is even more serious since one word can have at least ten different variants of spelling.

For some steps it is important not only to normalize text but also create thebag-of-words - list with all tokens in the document. The document, in turn, is a set of texts - mails, reviews, requests.

For this task all words in the document must be lowercase, in some cases even stemmed or lemmatized, punctuation marks should be removed. With bag-of-words it is possible to calculate frequency of words and lexical diversity - determine how versatile and complex is given data.

For instance, lexical diversity is calculated as the number of unique words divided by the amount of words and the smaller the difference the less original words in the text, and if the frequency of some words is high - they repeat again and again, there is a pattern. For example, in theCode 1.1 Example of program's code - N-gram analysis there are several common bigrams: 'importance high', 'annual leave', 'visual studio' etc. That indicates that questions connected with these collocations are most asked among all requests.

Extracting words and collocation

○ Counting each word and determining more frequently mentioned - already mentioned before in connection with the bag-of-words. Most common words and n-grams analysis helps to understand the main ideas and create the foundation for the next tasks. The example of this procedure is being presented in Chapter Pictures, tables and programs Code 1.1 Example of program's code - N-gram analysis.The data executed in the code are employees' requests in support in the industrial company. Even such a quick and easy for developing approach reveals topics in one the category close to the HR field:

(17)

○ question about some documents and forms (bigrams'annual leave', 'leaver 'form', 'expense report')

○ - requests to set the place for the new employee (bigrams: 'new starter', 'starter form') and access to some programs ('visual studio')

○ - messages about technical problems and issues, include requests to help to establish access for clients ('clients subsidiaries')

It is not only valuable information about context understanding but also the perfect base for Question Answering because it requires a set of common questions from employees.

Trigrams 'importance high expire' and 'importance high create' indicate that several levels of importance can be added in task creation. For instance, the bot can ask the problem category.

Bigrams and words such as 'line manager', 'administration officer', 'engineer' show responsible persons involved in dialog that also can be useful for the bot building. These people can be involved in interviews about workflow understanding and after - in bot beta-testing.

In the code example with Support Data the visualisation with common words can be obtained.

This chart can be utilized not only for data analysis, but also for monitoring requests context. It is, basically, a perfect tool for employees' issues and question analysing and a perfect indicator to see how topics may change week to week.

(18)

Figure 2. Most Common Words Visualisation (Jupyter Notebook)

Text Vectorization

Fnally transformation texts to vectors with TF-IDF or Advanced Word Vectorization Models - is a necessary step for text preprocessing before modeling. Only after this the majority of algorithms can perform different tasks with data.

TFI-DF - the computation of this method can be found in the documentation. Simply speaking, at first it is computed the TF - normalized Term Frequency or the number of times a word appears in the document divided by a total number of words, at second - IDF - Inverse

Document Frequency "computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears" .

Typically, the tf-idf weight is composed of two terms: the first computes the normalized Term Frequency (TF), the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF).

(19)

The advantage of this technique is that after this preprocessing it is possible to use a variety of algorithms for data classification and cluster analysis - from RandomForest to K-means.

Countvectorizer - another approach for word vectorization, which according to documentation from the Sklearn library "converts a collection of text documents to a matrix of token counts".

POS Tagging

Tagging to each word as a part of speech and type (geographical and personal names, company title etc.) for further Tag-Based Phrases Extraction or obtaining collocations and sentences with specific order or to understand dependencies between words in the sentence.

Figure 3. Dependency Parse Visualisation(Explosion.ai)

In general, the ability libraries to tag words by part-of-speech and place in the sentence (subject, predicate, definition etc.) can be utilized for extraction entities in the text and expresiones that refer to this entity. For example, the coreference resolution - "an exceptionally versatile tool and can be applied to a variety of NLP tasks such as text understanding, information extraction, machine translation, sentiment analysis, or document summarization" (Intro to coreference resolution in NLP, 2020).

Figure 4. Coreference Resolution (Towardsdatascience.com)

(20)

Shallow Parsing

Chunking sentences and matching words with specific groups based on part of speech.

Dependency Parsing

Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object (Spacy, NLTK).

Named Entity Recognition (NER)

Labelling named “real-world” objects, like persons, companies or locations for the future (possible with libraries like Spacy, NLTK). Useful for context understanding and extraction collocation with specific words. For example, Spacy library has the possibility to build the Entity Ruler - add custom words, set of words or even patterns. This technique can be useful if the business needs to extract some specific names or product titles with actions. For instance, the business wants to understand what words customers usually use when asking about different products that can be useful for creating some kind of map for each product or brand.

Topic extraction

Determine a certain number of topics in the text or messages. There are several techniques on how to do it , for example - LDA or LSI . In theCode 1.4 Example of program's code - Topic Extraction with LSIthere is an example of LSI framework - from data preparation and obtaining a suitable number of topics to extract frequent words from each topic. This technique is able to calculate the number of topics in the document that can be useful for dividing the document into chunks and investigating each of them regarding context.

Creating a table with synonyms/antonyms

Prepare keywords for a lexical pipeline for the Chatbots, MailBots must have a different lexical basis - it is an important first step for the Dialog Flow preparation. Making a table for a bots teaching - the algorithm must have a set of words for each intent.

Each task requires several steps for text preprocessing, the information extraction can be organized with specific steps which help to understand the existingworkflow, reveal problems and bottlenecks in the work processes and find out text patterns - more frequent

questions/requests from customers or personal, the text structure. This knowledge is an essential part of the workflow improvement, the first step and the base for further research.

(21)

There are several approaches that are capable of giving some insights about the context of texts.

For achieving the business targets multiple techniques can be utilized and their selection depends on whether the task is about primary classification, cluster or topic analysis or for obtaining the model. The model usually is a set of algorithms or a neural network capable of providing efficient results for a business such as classification or feature extraction.Solving each business task requires not only Business and Data understanding but also insight into which algorithm or machine learning technique is capable for the certain job.

Chatbot Implementation

The cost, time and preparation for implementation of chatbots depends on its type. Bots from the first category are capable of answering specific questions - they are being called Question Answering bots. In this case users usually select one of the topics, get an answer, see another list of questions, select the question. It looks more like FAQ but with a certain structure and can be considered a part of entertainment. There are no sophisticated algorithms, only tree

structure. Still, to build this dialog is impossible without antecedent text analysis - at least to get more common questions.

Second chatbot type is more sophisticated and requires a training model on prepared keywords, collocation and sentences. In this case users print questions or keywords and the system is able to understand what the customer wants to see. In other words, the model from the tool that is being utilized for bot building must be trained on keywords, phrases and collocation. This training process is an essential part of the chat bot building because it offers to understand a customer even though he or she does not ask the same question that was in the database for training but close by context. Consequently, the bot developing team does not need to provide the full set of rephrased questions, the NLP platphorma like IBM Watson, Luis.ai, Qnamaker.ai are capable of understanding from the entities - key words in tokens in phrases.

As an example of a chat bot, the dialog bot for the HR department can be considered. The purpose of this chat bot is to answer employees' questions and provide information about vacation, payment, benefits, insurance. Besides, the bot is able to search among employees in

(22)

In other words,the bot is capable both to find answers based on employees' questions or keywords and to provide a list of services from HR. Consequently, it will be hybride chatbot that requires both preparation and development question-answering parts and a model trained on questions and answers and capable of providing results for a search.

The main question before the team started to work on chatbot implementation is the place there the employee can connect with the bot. There are several places there it can be build:

○ Intranet - internal social network which requires some frontend development for the chatbot form

○ Company site - but in this case the chatbot must distinguish visitors and company's employees (sometimes even from certain regions)

○ Corporative messenger - then the bot-messenger connection is being supported with the API.

Service Selection

The first step for building a chatbot is selecting the platform for the both question-answering ang cognitive part, with best circumstances to have both provided with one company.

There are more popular services being utilized for chatbots building. All of them are free if the company is going to try to build a chatbot and test with a limited amount of data.

Service Products Features Price, Monthly

IBM Watson - Watson Assistant - Build the bot

- Watson Natural Language Understanding - Watson Speech to Text

- Watson Text to Speech

- Text Classification

Bot builder based on AI. Capable to build dialogs and connect key words with answers

From $140

Minimum price actual for

- max 50000 users - max 50 skills - analytics for 30 days

- max 10 dialog versions

Microsoft QnA AI - Building

question-answer dialogs

Bot builder for QnA tasks.

Based on databases

$10 for unlimited managed documents

(23)

created manually or from the FAQ page.

- Azure Cognitive Search - for running service - Standard Plan costs

$0.336/hour

- Azure App Service - for hosting the data - Basic costs

$0.075/hour Azure Bot Service Build multi-language

chatbots for different purposes

Create chatbots and customization for different languages

$0.5 per 1,000 messages for unlimited messages Microsoft LUIS -

Language-Understan ding Service

- Controlling IoT devices using a Voice Assistant

- Messages Classification

More targets on text/speech

classification. Broadly utilized for messages classification

Text Requests - $1.50 per 1,000 prediction transactions

Speech Requests -

$5.50 per 1,000 prediction transactions Amazon Lex

Conversational AI for Chatbots

- Speech recognition and natural language understanding - Context management - 8 kHz telephony audio support - Multi-turn dialog

Chatbot and Voicebot Builder

8,000 speech requests - $32.00 2,000 text requests -

$1.50

Table 1. Services for Chatbots building (Author)

The most convenient way is to use two different platforms from Microsoft - QnA Maker for the question-answer part (capable of answering specific questions from the list and connecting to databases) and Azure Bot Service for the cognitive search part (can answer questions that employees print in the bot window). There are at least two reasons:

○ Costs - these services are more chip, also common subscription will give possibility to track spendings, make them more transperent.

○ Process organization - monitoring and releases in one place, same environment.

(24)

Prerequisites can be divided into several steps: data analysis and preparation, dialog creation and understand connection, training bot in the platforme

According to Microsoft Cognitive Services, QnA Maker helps "to build, train and publish a sophisticated bot using FAQ pages, support websites, product manuals" (Qnamaker.ai). The Azure bot, on the other hand, offers the platform for building cognitive service for text

understanding. In other words, the pretrained model studies on answers, keywords and questions and is able to be retrained, evolve and become more accurate.

Id Item Description Input Output

1. Data Preparation

1.1 Data Analysis Analyse requests from the semantic and lexical point of view

Data from the support service or HR department

Excel with common words and bigrams

1.2 Questions/Ans wers

Create questions and answers for dialog and model training

1.1 Excel with questions, include rephrasing and answers

2 Dialog building

2.1 Connections Create the list with necessary connections with different databases and landings

1.2 Document with all

connections mentioned:

- Intranet DB connection - employees database - List with URLs with regional HR pages - HR Document DB 2.2 Dialog Flow

Diagram

Draw Diagram with dialog flow

2.1, 1.2 Diagram with all steps and connections represented 2.3 Dialog Design Visualisation of the dialog

window

1.1, 2.2 The layout of dialog window 3 Services

3.1 Azure Bot Train Model with

keywords, answers and questions

2.2, 1.1 Trained chatbot

(25)

3.2 QnA Set questions and answers

2.2, 1.2 QnA Database

Table 2. Chatbot Requirement Catalog (Author)

Dialog Flow Design for HR Chatbot

Dialog Flow is being utilized to provide the scheme for the bot. It is a base architecture decision that can be considered like the first step for the Chatbot creation.

Furthemore, Dialog Flow documentation determines not only dialog itself, but also connection between the services and databases. For example, if the employee asks about the remaining vacation day the bot will transfer him to the page with this information or if some worker needs the telephone of a certain department - the bot will provide him with a number. The underwater rock is an access for each user for specific results, in this situation this must be emphasised in the diagram and represented as an additional question with a proposal to register or provide code.

Another obstacle can be the organisation and availability of databases with which connections were established. There must be specific requirements and agreement with the database owners.

As can be seen from the dialog flow, from the beginning the user can rather select an answer (than QnA Maker will be used) or print the question (Azure Bot). After each session chatbot asks to estimate the service, that interaction is important for customer satisfaction

understanding and can be used for the model retraining and database improvements.

The advantage of the dialog flow diagram is that it basically can be used for implementing chatbots with different languages - it is the universal schema for the flow and does not specify question and answer. However, for each language the table with questions and answers must be created.

(26)

Figure 5. The DialogFlow for QnA HR bot (Draw.io)

.

(27)

Building Dialogs in Microsoft QnA Maker

Besides for the dialog flow, excel file with questions for the bot or keywords base for bot skills must be provided as in input

In the given example the DF Q/A part may be built on most common questions obtained from data analysis - mostly, messages from employees. The example of certain data analysis that must be conducted to get this information was provided above.

The second step (after data analysis and keywords, topic extraction) is the creation of questions and answers. QnA Maker (or the alternative - IBM Watson) is capable of working with several data formats such as .json, .pdf, .xlsx.

Set of question-answer pairs in .xlsx format:

Figure 6. Questions - Input Example for Dialog Building (Excel)

After all questions (in alignment with the dialog flow) would be prepared it is possible to do the next step - create dialog inside the chatbot build platform. As can be seen from the example, the QnA service offers to add alternative phrasing for each question - it increases the bot's ability to understand what the employee wants to know or get. Training with several items usually takes several minutes and it is very easy to conduct the retraining - to add new questions or extend existing questions.

(28)

Figure 7. QnA Dialog Building Process (Qnamaker.ai)

After the dialog building the chatbot is ready for testing and deployment. The module with the bot can be added on the site, application, messenger (for instance, Teams). In the case of internal HR chatbot the design is not so crucial since it is not the product for customers and does not have the strong marketing support such as avatar or even some character.

For the mockup several entities must be determine:

- Base colors and fonts - Dialog cloud shapes

- Bot logo and how the employee logo may look like (shape, source)

- Open and close messages design - Multichoice option design

(29)

Table 3. Chatbot design layout (Botframe.comi)

Chatbot Managing and Developing

HR bot from the example can be managed and maintained with a small team, it is not necessary that all members have full allocation to the project due to the main part of work (data storage, training, backend) being made with the chatbot platform. Usually the most time-consuming part of the work is to approve dialog flow and questions with the business. The minimum

requirements regarding the team for the chatbot building, developing and maintaining is provided in the RACI Chat:

Project Activities

Project Manager

Dialog Architect

Data Analyst Backend Developer

Frontend Developer Dialog

Creation

A I R I C

Data Analysis C C R I I

Dialog Architect Mockup

I R I I C

Dialog Building

I R I A C

Backend Integration

A I I R C

Dialog Frontend

A C I C R

Project Planning

R I I I I

ChatBot Deployment

A I I R C

ChatBot Testing

R I I A C

ChatBot Retraining

A I R R I

Table 4. RACI Matrix - chatbot (Author)

(30)

According jobs.cz, average salaries for given specialist in Czech Republic are:

Project Manager - 50 000 - 70 000 Kč Data Analyst - 50 000 - 70 000 Kč

Backend Developer - 70 000 - 100 000 Kč Frontend Developer - 40 0000 - 60 000 Kč

Nevertheless, as it was mentioned before, the cost for personnel can be splitted between departments that are also willing to have chat bots. It is a typical situation for international companies that need at least the same bot for different languages.

The chatbot also requires additional money for monthly support such as data collection,

retraining and improving, reporting and bag fixing. Even so, in the majority of cases the biggest share of the budget is allocated to data preparation and bot building, which usually takes a minimum of one month of teamwork.

Mailbot Implementation

In comparison with a chatbot, the mailbot usually requires more time both for data investigation and bot building. Mailbot can be considered as a tool that can deal with several features - mails classification, extraction information from text, preprocessing text data for business needs (for example, organizing emails structure). For classification a bot can be built with the same tools mentioned in the chatbot section. Nevertheless, for this the precise data analysis must be conducted for keywords and collocation extraction for each label.

Hence, these keywords usually change with the time and the bot often requires retraining. But if each label has the same keyword with a small difference so insignificant that it is not noticeable with the tool?

In this situation the team creates the custom model for text classification and the selection of the algorithm is the question not only from technical point of view but also from business - some approaches are more costly than overes.

Id Item Description Input Output

1. Data

1.1 Emails Analysis Analyse emails from employee, checking

Emails from the support

Excel with keywords, topics extraction, labels

(31)

service or HR department

investigation 1.2 Data

Preparation Framework

Create framework for data cleaning and

1.1 Script for data preparation

1.3 Data Storage Set Data Storage - create account name, organize key vault

1.1 - Table Storage with partition keys

- Blob Storage for trained and tested data

2 Model

2.1 Service Determine resource group and clusters for the model training,

1.3 Document with all

connections mentioned:

- Intranet DB connection - employees database - List with URLs with regional HR pages - HR Document DB 2.2 Model Set the model that will be

utilized for data classification

1.3 Model outputs, evaluation report

Table 5. Mailbot Requirement Catalog (Author)

Model Selection for Classification

One of the crucial steps is to select the algorithm for the classification - it is directly connected with costs since some models require more power for data preparation, training and testing, others can be cheaper from the maintenance side, but not flexible and capable to be educated and retrained.

As it was mentioned before, one of the most popular approaches for classification and entity recognition is BERT - pre-training of Deep Bidirectional Transformers for Language

Understanding, machine learning technique for variete of NLP tasks, developed by Google in 2018. "As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once" (2018, Horev, R). Consequently, it is possible to say that BERT considers several factors for language understanding and model training.

(32)

Figure 8. BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings (Google AI Language)

If it can be understood from the description BERT not only split text with words and sentences but use several approaches to deal with data, catch context and word sequence. Though, it requires a lot of time and resources for training and developing, and a huge volume of data (for each label).

Nevertheless, models based on BERT show impressive results. According the paper published by Google AI, "results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6%

absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1" (2019, Google AI Language).

Since the BERT is usually considered as a sophisticated classification approach since it not only takes into account vectorized words but also word sequence, other approaches take into

account words turned into vectors. Moreover, other algorithms require both text preparation and word vectorisation.

For instance, inCode 1.2 Example of the program's code - Classification before the algorithm performing CountVectorizer - one of the most popular word vectorization methods - was implemented. Only after this the set of algorithms was performed and this is also one of the advantages of machine learning - depending on the data set it is possible to try different

methods and select the best.

In the case with the data set from most significant results - almost 87% - shows the

LogisticRegression. That means the classification can be performed by the model with only 13%

of mistakes that is quite an impressive result for this approach. Nevertheless, the data set has

(33)

only two labels (negative and positive) and 25 000 examples for each. In reality, there can be more labels with different volumes of data, some labels may not have unique patterns. In this situation the performance decreases dramatically.

To understand how email classification performed by a model differs from manual classification, the business case with email classification for the HR department for the industrial company can be considered.

Case

It is determined that emails obtained from employees have diversity. In other words, at least seven topics can be distinguished:

Jobs - questions and mailing about new vacancies

Documents Request - requesting documents like salary or work confirmation Trainings - questions and information about trainings

Benefits - information about benefits, include insurance Contracts - mails about contracts

Salary - information about salary, bonuses Other - other categories

Each of these categories are being assigned to different specialists and devisiones,

consequently, it is important not to mix mails - otherwise the mail can be lost and employees will not get the answer for the request.

Manual approach

Labels are being assigned with specialists from the HR department, who claim this procedure takes a minimum of one hour of the working time every day. The procedure itself consist from follow parts:

○ Read the message in the common email box, mark it as resolved

○ Transfer the message end employee number to the ticketing service, label the message

○ Create ticket with explanation and comments

○ Assign ticket to the responsible person/department

○ Send the answer to the requester

This manual process has several crucial disadvantages:

(34)

○ HR specialists often forget to mark the message as already transferred to the ticketing service, it leads to the task duplication in the system

○ The requester does not get the reply about the ticket number, hence he can't to see the progress

○ Answers must be done with email, HR specialists spend some time finding the email in the box and writing the answer (since it's not automotized).

ML approach

Labels are being assigned with the classification model that reads each email and creates as an output the .json file with probabilities for classes. The model transfers output with results and mail text to the ticketing service with API. Ticketing service triggered with the model request automatically creates a ticket with the label and assigns it to the responsible division or person, spontaneously the employee gets the ticket number with email, after he is able to add

information and check the progress.

The process takes several seconds, and the HR department can process manually only not determined emails, the percentage of which can be 5-10% with the model of high accuracy.

Hence, the automated classification economes time for HR-department's workers, supports communication with employees and organizes workflow - from obtaining the request to solving it. The model is capable of understanding mails with multiple languages that make the

classification without quality loss.

Nevertheless, there are some challenges for implementation of such services.

○ Organization of data collection - the business should decide their text will be collected, this service has to get mail from the mailbox and feed it to the model. In other words, there are at least three participants in the process: service with data collection (text, personal data, timestamp) - classification model - ticketing service. Basically, all this infrastructure is possible to build on some cloud system like AWS or Azure ML.

○ Data cleaning. Each mail must be prepared for classification (clean from artifacts, in some cases - vectorized). If it is planned to store mails in the cloud - personal data - names, telephones, emails, should be deleted.

○ Model retraining. There must be a decision about data sufficient for retraining and the frequency of retraining process (for the model improvement).

(35)

In sum, the mailbot building process is more time consuming then chatbot and also requires more expertise about data preparation, algorithms, parameters tuning. Nevertheless, the emails, requests, support messages classification is an extremely important for customer's service improvement - it helps to answer faster, organize tasks better and worth all investment and effort spent on it.

Model Cost Calculation

Since the mailbot requires more time and preparations, it also needs to be served with specialists who are capable of building the Data Science part and provide the architect's decision regarding services connection. The RACI Chart demonstrates minimum requirements for the mailbot team.

Project Activities

Project Manager

Data Scientist

Data Analyst Backend Developer

System Architect Data

extraction and

Preparation

A I R I I

Data Analysis C C R I I

Model Building - Train/Test

I R I I C

Retraining Model

I R I A C

Backend Integration

A I I R C

Project Planning

R I I I C

Deployment A I I R C

Service Integration

R I I A C

Service Testing

I I I C R

Table 6. Mailbot RACI Matrix (Author)

(36)

In addition to salaries for the specialist mentioned in the chatbot chapter, in mailbot building there should be a Data Scientist and System Architect. Their salaries are approximately 70 000 - 100 000 Kč. Furthemore, the mailbot creation process can not be performed as fast as

chatbot. It will take at least a half of a year.

The cost calculation usually includes not only personnel, but some platform for model building and all data operation. For example, Azure ML works in this case as Pay-As-You-Go meaning that developers pay only for using clusters for algorithm performing. The price for one instance (that can train, test and deploy the model) starts from $0.10 per hour. For the BERT model training the cost can increase dramatically due to this model needing a powerful GPU. In addition Azure Cognitive Service charges for extra tasks connected with NLP and useful for mailbot building.

Figure 9. Azure Cognitive Services pricing (Azure.microsoft.com)

(37)

Approaches and methods for NLP Models building

Model selection and building - one of the most important steps in the majority of business tasks such as classification of mails or reviews, information extraction from documents, text mining.

Utilisation of the tools and platforms based on NLP also usually performed with custom scripts being created with Python, R and Java languages or at least complemented by them. These scripts can be created for data preprocessing, custom models and algorithms, evaluation metrics and reports.

For example, some libraries (SpaCy and NLTK) contain pre-trained linguistic models for multiple languages. Hence, SpaCy maintains core models on Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Norwegian, Bokmål, Polish, Portuguese,

Romanian, Spanish, Russian - in otheral, about twenty languages.

What does the core model mean? Usually authors use data sets with news, open-source sites like Wikipedia and open digital libraries. Each core model has millions of words, each marked as a part of speech, entity (geographical, name, date etc.) and represented as a vector. Thus, models can be utilised for all types of data analysis and almost for all algorithms - with vector transformation the problem with algorithms working only with data does not exist anymore. For instance consider de_core_news_sm - German language model from SpaCy. Frome the SpaCy documentation, it is "German multi-task CNN" trained on the:

○ TIGER - 900 000 tokens of German newspaper text, taken from the Frankfurter Rundschauand (University of Stuttgart, Institute NLP)

○ WikiNER corpora - "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006)

Nevertheless, both SpaCy and NLTK libraries can be considered mostly for obtaining linguistic features - tokenization, POS-tagging, morphology, lemmatization, dependency parsing, entity annotations and labels, NER, entity linking.

Furthemore, with each linguistic model it is possible to build custom decisions for a variety of business tasks. In other words, core models can be improved for studying or business

purposes, but to build pre-trained models vast data-sets with text being required. In spaCy there are project templates that provide workflow for training custom models, add and control data

(38)

and determine features.

Figure 10. End-to-end spaCy workflows (SpaCy.io)

SpaCy and NLTK can be enough for some tasks connected with linguistic features, data

extraction and labeling, but what about classification and cluster analysis? For this piece of work to be done ML libraries can be utilized. One of the most vast and popular is Scikit-learn that has different types of algorithms that can be used for classification and clustering. Library consists of algorithms supporting all possible steps for training, hyperparameters tuning and model

validation.

○ Classification or identifying that category text belongs has variete of algorithms: Linear Models, Support Vector Machines, Stochastic Gradient Descent, Nearest Neighbors, Naive Bayes, Decision Trees, Supervised Neural networks.

○ Feature selection - the process of selection features (in text - words) that influence decision about the label.

○ Model selection and evaluation - cross-validation, tuning the hyper-parameters, metrics and scoring

○ Standardization, mean removal and variance scaling.

In theCode 1.2 Example of program's code - Classificationrepresented comparison of four different models compared by accuracy and trained with cross-validation. In the given example the data-set is unbalanced, which leads to more than acceptable results but with impact on

(39)

classes with less amount of data - it can be crucial if the business requires the same results for all classes.

Model Name Folder Accuracy

MultinomialNB 2 0.875370

LogisticRegression 0 0.870205

MultinomialNB 1 0.869619

LogisticRegression 1 0.867154

MultinomialNB 0 0.866344

LinearSVC 2 0.865922

LinearSVC 1 0.858774

RandomForestClassifier 2 0.758298

RandomForestClassifier 0 0.757825

RandomForestClassifier 1 0.749261

Table 7. Algorithms Testing and comparison with unbalanced data (Jupyter Notebook)

The came experiment can be conducted with a balanced data that predictably decrease the otheral accuracy, but make accuracy for each class more balanced. Before the sampling the model predicted in favor of the major class .

Model Name Folder Accuracy

MultinomialNB 0 0.7975

MultinomialNB 2 0.7960

MultinomialNB 1 0.7940

LogisticRegression 1 0.7855

LogisticRegression 0 0.7740

(40)

LogisticRegression 2 0.7740

LinearSVC 1 0.7540

LinearSVC 2 0.7480

LinearSVC 0 0.7455

RandomForestClassifier 1 0.7000

RandomForestClassifier 0 0.6795

RandomForestClassifier 2 0.6780

Table 8. Algorithms Testing and comparison with balanced data (Jupyter Notebook)

As mentioned before each model from the list requires preprocessed data and suitable data set for training and testing. In the documentation to Google Machine Learning, the training set is "a subset to train a model'' and "test set is a subset to test the trained model".

Before training the data was vectorized and separated into the train set (70% of reviews) and the test set (30%). The ratio can be different - 80/20, 90/10 or even 50/50, but usually the training set has more data due to the more data model being fed the higher the performance is.

Another important thing here is to shuffle data, especially if the data set consists of emails - they are typically organized in order - from more late to new. If the data is sufficient and resources are enough for the multiple trainings it is typical also to have validation set for evaluation of the model during hyperparameter tuning.

(41)

Figure 11. Data Separation schema. Example (Draw.io)

Besides text preparation and normalization data sets frequently require other preprocessing steps before training and testing separation. Standart problem is a huge difference between labels that confuse models, labeling data in support of the label with the highest amount of data.

There are two different approaches: Upsampling (multiple rows from minor classes) and Downsampling (increase number of examples from major classes).

Another thread is Overfitting and Underfitting. To avoid Underfitting several techniques can be implemented - "Increase model complexity, raise the number of features, remove noise from the data, increase the number of epochs or increase the duration of training to get better results"

(Underfitting and Overfitting in Machine Learning, 2020).

Dealing with Overfitting can help increase data for training, decrease complexity, cross

validation - "cross validation avoids overfitting and is getting more and more popular, with K-fold Cross Validation being the most popular method of cross validation" (Shah, T., 2017).

Unfortunately, in some cases it is difficult to understand that steps must be performed before training and it can be found only during experiments.

For text analysis, different algorithms for pattern investigation, topics and workflow

understanding not only algorithms from Scikit-learn utilized. For instance,Context Analysis includes several approaches which can help to understand the meaning of certain texts without reading them. One of the most popular is LSI - the model that contains several algorithms that are able to handle large corpora and provided by Gensim library. By corpora the model

considers all preprocessed texts, decomposition explanation is being provided in the article Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms (2011). According to the article, the LSI uses Singular Value Decomposition (SVD). In theCode 1.4 Example of

(42)

first step is to determine the proper amount of topics. Simply speaking the model try to guess how many topics can be found in the document - in the case under study it is a data set with reviews about clothes bought in some e-commerce.

Figure 12. LSI - Amount of topics estimation (Jupyter Notebook)

After an amount of topics investigation it is possible to see features or words that determine each topic. The output is words with "coordinates" or weights for each item. For example, topic one is being determined by information about changing, approving and updating employee's status, solving issues and errors.

Topic #1 Topic #2 Topic #3 Topic #4

"sent" "inform" "confluenc" "inform"

"updat" "need" "approv" "date"

"issu" "access" "sent" "need"

"approv" "help" "issu" "submit"

Odkazy

Související dokumenty

We can push the technique much farther and it is quite possible that one can use techniques relating affine geometry and complex analysis to complete the proof,

The type analysis is based on Data-flow analysis framework with the lattice of data-flow values being a power set of possible types with additional flags to distinguish between

extent, create fixed items and reserves according to special legal regulations..

The technique is based on a combination of spectral analysis with proper orthogonal decomposition [22–26], and in this paper we apply this technique to experimental data obtain- ed

The analysis of impacts was very thorough and detailed; the author demonstrated a good understanding of academic papers and ability to work with data.. Sources of data are

The thesis investigates how advanced statistical methods (panel data analysis, network methods, and cluster analysis) can be used to analyze EU waste data.. The analysis of

This chapter explains how the final algorithm for sentiment analysis was implemented, using the results of criterion expressions extraction and creation of the sentiment lexicon

 Analysis of user requirements and data sources- this phase includes the analysis of the available data sources that are required in order to meet the user requirements