2018 Martin Kopka PHD THESIS Analysis of process data and their social aspects VŠB – Technical University of Ostrava Faculty of Electrical Engineering and Computer Science

(1)

VŠB – Technical University of Ostrava Faculty of Electrical Engineering and

Computer Science

Analysis of process data and their social aspects

PHD THESIS

2018 Martin Kopka

(2)

Declaration statement

I hereby declare that this PhD thesis was written by myself. I have quoted all the references I have drawn upon.

(3)

Acknowledgements

I would like to express my thanks to my supervisor, prof. RNDr. Václav Snášel, CSc., for supporting my doctoral studies, for his motivation and the extensive knowledge that led me to find methods and solutions. His guidance has helped me throughout the research and writing of this work.

Furthermore, I would like to express my thanks to doc. Mgr. Miloš Kudělka, PhD. for the opportunity to consult with him on procedures and ideas in the area of data analysis, as well as the opportunity to work with him on experiments and for his time and positive energy during consultations.

My special thanks to my family for their support throughout my studies, especially during the completion of my work.

(4)

Abstract

Information systems support and ensure practical running of most critical business processes. There exist or can be reconstructed records (logs) of the process running in the information system with information about the participants and the processed objects for most of the processes. Computer methods of data mining can be used for analysis of process data utilizing support techniques of machine learning ang complex network analysis. Process mining is able to analyze and reconstruct the model of running process from its process log. The analysis of participants behavior of running process from process log transformed into complex network of its participants is not very used approach, much frequently the quantitative parameters are analyzed. Here we show how data and process mining methods can be used for analyzing of running process and how participants behavior can be analyzed from the process log using network (community or cluster) analyzes in constructed complex network from the SAP business process log. This work formulated and developed a methodology covering data integration, pre-processing and transformation, data mining with following interpretation and decision support – the work was realized and experimentally verified on sets of real logs from SAP business process. The modified canonical process log structure is suggested with respect of SAP environment – this can be applied for any SAP system (principally). This approach constructs the complex network from the process log in the given context and then it finds communities or patterns in this network.

Found communities and patterns are analyzed using knowledge of the business process and the environment in which the process operates. The results demonstrate possibility to cover up not only quantitative, but also qualitative relations (i.e. hidden behavior of participants) using the process log and specific knowledge of the business case. This approach was found as useful starting point for decision support analysis supporting managers with getting knowledge from process data (log). While process mining can provide the model (visual or formal) of running process, the complex network analysis can uncover behavior relations of participants, that are hidden in quantitative models of process log.

Key Words

Decision support, process log data, data mining, process mining, SAP log, graph construction, network construction, visualization (visual data minig), community detection, graph clustering, pattern analysis, outliers analysis, behavior.

(5)

1 List of used symbols and abbreviations

Symbol / Abbreviation Annotation

Business document Business document in SAP system means any business level document that is represented by its transactions, screens and relations (purchase order, sales order, delivery note, invoice, production order, ...)

CRUD Abbreviation for basic operations on database (Create/insert, Read, Update, Delete)

EDI Electronic Data Interchange

iDOC Data format for data interchange among SAP systems

KDD Knowledge Discovery from Data

SAP Producer of an enterprise application software (www.sap.se) XES XML-based standard for event logs

(9)

2 List of illustrations

Fig. 1. Picture representation of network (directed x undirected graph) ... 27

Fig. 2. Visualization of silhouette of clusters ... 30

Fig. 3. UML class diagram for model of XES standard ... 32

Fig. 4. Approach of decision support in use-case evaluation ... 35

Fig. 5. Used concept of Knowledge Discovery from Data ... 37

Fig. 6. Process model - Path 1 ... 45

Fig. 9. Most used sequence of the process visualized by turtle graphics ... 48

Fig. 10. Topology deviations of the sequence... 48

Fig. 11. Two mode (affiliation) network ... 50

Fig. 12. Constructed network ... 50

Fig. 13. Local community illustration ... 50

Fig. 14. Visualization of detected communities ... 52

Fig. 15. Data mining tasks ... 57

Fig. 16. Visualization of clusters in 9.4 ... 62

Fig. 17. Degree distribution in D1 for 9.4 ... 62

Fig. 18. Connections of high-degree users in 9.4 ... 64

Fig. 19. Silhouette of patterns in 9.4 ... 64

Fig. 20. Representatives of patterns in D1 for 9.4 ... 65

Fig. 21. Distribution of Avg time in D1 Pattern 1 in 9.4 ... 66

Fig. 22. Comparison of representative and outliers in D1 cluster 1 in 9.4 ... 67

Fig. 23. Difference in attributes (compare outliers with representative) in D1 cluster 1 in 9.4 ... 67

Fig. 24. Visualization of clusters for D2 in 9.4 ... 72

Fig. 25. Degree distribution in D2 for 9.4 ... 73

Fig. 26. Silhouette of patterns of D2 in 9.4 ... 74

Fig. 27. Representatives of patterns in D2 for 9.4 ... 75

Fig. 28. Distinction of patterns to base in D2 ... 76

Fig. 29. Principle of finding pattern for new object ... 77

Fig. 30. Principle of identifying nearest objects for pattern Pr ... 78

(10)

Fig. 31. Visualization of clusters in 9.4.7 ... 86

Fig. 32. Degree distribution in D3 for 9.4.7 ... 87

Fig. 33. Silhouette of patterns in 9.4.7 ... 88

Fig. 34. Representatives of all patterns in D3 for 9.4.7 ... 88

Fig. 35. Representatives of non-trivial patterns in D3 for 9.4.7 ... 89

Fig. 36. Distribution of Avg time in D3 Pattern 3 in 9.4.7 ... 90

Fig. 37. Visualization of clusters for D4 in 9.4.8 ... 94

Fig. 38. Degree distribution in D4 for 9.4.8 ... 95

Fig. 39. Silhouette of patterns of D4 in 9.4.8 ... 95

Fig. 40. Representatives of patterns in D4 for 9.4.8 ... 96

Fig. 41. Distinction of patterns to base in D4 ... 97

(11)

3 List of tables

Table 1. Normalized average values of pattern (model) ... 26

Table 2. Confidence interval ... 26

Table 3. System tables used for SAP workflow log ... 40

Table 4. System tables used for SAP change management trigger ... 40

Table 5. System tables used for SAP status log ... 41

Table 6. System tables used for SAP application log ... 41

Table 7. System tables used for SAP EDI log ... 41

Table 8. Set of data collected from SAP ... 43

Table 9. Canonical log structure with relation to SAP dataset ... 43

Table 10. Interesting paths in the process model ... 45

Table 11. Interpretation of process steps in turtle graphics ... 47

Table 12. Log structure for 9.3 ... 51

Table 13. Interpretation of found communities in 9.3 ... 53

Table 14. Found communities A1-4, B1-4 ... 53

Table 15. Source log for Network analysis ... 54

Table 16. User-Attributes/Vendors-Attributes data table for Network analysis... 55

Table 17. Checklist for data mining steps ... 57

Table 18. Summary of results of experiment 9.4... 60

Table 19. Patterns – table of profile parameters in experiment 9.4 ... 61

Table 20. Patterns – confidence interval in experiment 9.4 ... 61

Table 21. Outliers in network of experiment 9.4 ... 63

Table 22. Pattern representatives in D1 for 9.4 ... 65

Table 23. Patterns – table of profile parameters in dataset D2 ... 71

Table 24. Outliers in network of D2 in 9.4 ... 74

Table 25. Pattern representatives in D2 for 9.4 ... 75

Table 26. Patterns – types of attributes (R/C/M) in experiment 9.4 ... 78

Table 27. Finding original record in experiment 9.4 ... 81

Table 28. Typical curves for distance distribution of distances to pattern ... 81

Table 29. Summary of results of experiment 9.4.7 ... 84

Table 30. Patterns – table of profile parameters in experiment 9.4.7 ... 84

(12)

Table 31. Patterns – confidence interval in experiment 9.4.7 ... 85

Table 32. Outliers in network of experiment 9.4.7 ... 87

Table 33. Pattern representatives in D3 for 9.4.7... 89

Table 34. Pattern 5 representative from D3 for 9.4.7 ... 89

Table 35. Pattern 3 representative from D3 for 9.4.7 ... 90

Table 36. Patterns – table of profile parameters in experiment 9.4.8 ... 93

Table 37. Patterns – confidence interval in experiment 9.4.8 ... 94

Table 38. Pattern representatives in D4 for 9.4.8... 96

Table 39. System tables available for SAP workflow ... 105

(13)

4 Introduction

Information systems process large number of processes, some of them run automatically, many of them are run directly mediated by users. There are more participants of the process – not only users, but also objects or subjects of the business process (business documents, business partners, master data ...).

Information systems collects a plenty of detailed transactional information about these objects that is saved in process logs or attached databases [AA11] – this information testifies about the business process participants behavior in actual time.

4.1 Motivation

The basic motivation for this work has been found in practical challenges in areas that I have been professionally engaged in since 1996, namely in areas of the project management and the design, development and implementation of SAP business process workflows.

The mentioned challenges aim to help simplify decision making and / or identify the wrong or unexpected run of the SAP business process. Managers necessarily need to dispose with information about domain they are responsible for. What they can obtain from actual information systems is information of quantitative type – how many, how long, who, what. But users are connected by formal and unformal interconnections, they share knowledge and their processes and behavior can display certain common parameters that are not seen on hard numbers (behavior patterns). Patterns are interesting inputs for managers decision, as they can allow to understand some hidden similarities.

As a first challenge it was a possibility of automation of decision-making in typical tasks (e.g. time estimation of labor for use-case in service / development based on classified parameters). Many service or development tasks can be described by use-case, they can be parametrized, and new request should be evaluated based on recently prepared knowledgebase.

We identified several challenges in area of business process analysis based on process logs found for recently ran processes. There is still a gap between languages describing workflow management systems and formal methods for descriptions of processes [AH03]. It causes that many practical implementations do not cover a formal description of the processes against which running process can be monitored and formally evaluated.

Evaluation of process parameters relative to direct participants and roles (e.g. the biggest delay in the delivery confirmation process occurs in the XY role in the AB period, the longest processing of the vendor invoice occurs with the XY invoice type from AB suppliers) can be obtained directly and is not usually a problem.

Recognition and process analysis based on a process log (for example, an analysis of typical run, the longest and shortest run and their frequency) is not a common feature of information systems, there some exceptions (the latest version of SAP system provides Process mining tool).

The possibility of comparing the current process with this correct / incorrect pattern model then creates an opportunity to respond to unwanted situations in the process.

Process comparison for example in processes of change management during company mergers can lead to the evaluation and comparison of adequate processes and select a more appropriate model [SB17].

(14)

Recognizing of relationships and behaviors of business process actors (that relations and behaviors are not apparent from the definition or role of an actor) when a given phenomenon occurs is another challenge for manager. Finding some “typical symptom” of the phenomenon would be also expected.

Finally, it is challenging to use the acquired knowledge of behavior to identify a certain phenomenon in the future, based on the fulfillment of the parameters, which were accompanied.

As it turned out, the individual challenges can be generalized and defined more general models for their solution. I expect that it will be possible to apply used methods back to practical use, as the partial results of the applied methods have already been shown on real data samples.

4.2 Actual status

Companies enact most of their business processes with support of the information systems today. These

“running processes” are represented by transactions and workflows, partly managed by the information system, partly managed by users’ decisions and activities. Given the increasing number of such processes and given the competitive environment, the requirement to monitor their effectiveness, to find reserves and to ensure gradual improvement of these processes is becoming more important. It is not easy to understand whether specific process run efficiently, because a) usually many various activities are processed in parallel and process definition allows plenty of process enactment variations and b) usually it does not exist formal definition of implemented processes running in information systems (only small part of them are running as strictly defined workflow [AH03]).

Information systems provide detailed technical logs of running processes on low level (often every user activity in information system SAP is logged). It also depends on the implementation of specific information system, how detailed log is provided.

Some quantitative parameters (such as load of resources, their frequency of activation and direct involvement in the process in time) can be established by direct evaluation of the log, but it is not enough for manager to work only with quantitative parameters.

4.3 What is the problem

Manager need to have set of quantitative parameters (addressed to specific process) together with other important qualitative parameters (relationships among resources and participants) to be able to decide and manage process evaluation and change management. In taxonomy introduced by Power [PO02] we speak about data-driven and knowledge driven decision support system.

Based on technical logs it is possible to get some quantitative parameters (load of resources, their frequency of activation and direct involvement in the process in time – user XY changed value A to value B). These fragments are not enough for decisions, but they provide technical data for following analysis.

Technical logs contain real data in various levels, typically from many processes, with many buzzes, records of incomplete process and others [AA11]. It is important to provide clear samples as an input for analysis.

The reason is that technical logs combine records from different processes (not only observed process and period) and different level of process.

(15)

When business process is not running based on strictly defined workflow definition, it is not standard possibility to reconstruct the typical and special/abnormal process run model from historic logs and compare them to actually running process for early watch or trend alert.

The business process is not usually formally described in implemented information system, but it contains many identifiers that are present in technical log. These identifiers can be analyzed and applied in subsequent analysis.

Processes in information systems use several resources and we are concentrating to participants – users/people. Behavior of users, their decisions, delay, knowledge etc. significantly influence the process.

Actual information system SAP neither others usual systems do not provide results of analysis taking into accounts behavior and social aspects (relations) of participants in common ERP process.

Such analysis could help managers to see relationships that are not visible from the process data, because they are not present in specific logs.

4.4 What can be improved?

It exists way how to enable very detailed process monitor and log in latest version of SAP systems – Solution manager Process Management. It is based on rigorous specification of observed processes on defined business objects, then switching on the monitoring and getting specific quantitative KPI about specific process. This possibility was also enabled in later versions of SAP (using external but integrated components of ARIS Business process Management).

In practice, using of such tools is exceptional, because it brings significant additional costs and extension of implementation time. From that reason this work does not expect this exceptional environment (when integrated business monitor is available) – in this case we can support following analysis by switch more detailed log on or prepare additional log even from standard transactions with business process stamps.

In all cases, the pre-processing of source data log is needed. We can utilize the fact that standard data model and standard tools used in SAP are the same in all instances and so that usability of tools is In case the integrated business monitor is not available, we can use external methods of Process mining for reconstructing model of running process – it was one part of my work with published results [SKo13].

As most of systems and business processes have not high-level monitoring available, using of Process mining method and reconstructing variants of model of running process (from system logs) can meaningly help to understand how the process runs.

Results of such analysis can be used for detecting normal/abnormal process model based on observed parameters of the process (similar method as [SS12], [SS13], [SK12], [SK13]).

We can use data mining and machine learning methods that will help us with analysis of social aspect of the inspected processes (typically they are analysis of participants’ behavior and relationships).

These analyses are not available in existing information systems.

Participants of business process can be viewed as members of social network. The social network is not defined by explicit relations as is known in public social networks (relation friend, follower, …) but it

(16)

can be calculated used some common metrics or based on specific process. Such kind of analysis expects detailed and specific business knowledge of user.

4.5 Research objectives

Described motivation, analysis of related works brought me to following research objectives.

- Formulate and developed methodology for Knowledge Discovery from Data covering data integration (from real SAP process log), pre-processing and transformation, data mining (complex network analysis) with following interpretation and decision support potential. Verify the approach on sets of real logs from SAP business process. Utilizing this methodology for common usage in SAP environment.

- Supporting the decision-making during project management using classified knowledgebase.

- To prepare and experimentally use method for preparation data from SAP for data mining methods

- To use process mining method for construct model from the process logs of specific process in SAP

4.6 Structure of the thesis

This thesis is structured as follows.

The Introduction chapter presents initial motivation for work Introduction, identification of the set of real problems and area where the improvements would help.

Related works chapter provides overview of work and conclusions that are known and that are relevant for discussed topics (process mining, social networks and data pre-processing). Own planned contribution to the solution is also added as a last part of this chapter.

Theoretical background chapter provides definition and formal specification of tools and methods used in the work.

Aims of the Thesis chapter take the planned own contributions and define the aims of the work.

Value estimation of use-case chapter presents experimentally verified approach for decision support method for managers working with use-cases in project management.

Process data analysis chapter formulates the approach and methodology for data analysis of process log from SAP. This methodology is used and verified in experiments with different approach on data mining technique. These experiments and their results are presented in this chapter too.

Chapter Conclusions shows summary of presented work in the thesis.

Last two chapters, Authors publications and Bibliography serve as reference list of papers and other sources.

(17)

5 Related works

In recent decades, information systems have started to become more and more process oriented [DA05].

The shift to process-oriented systems was motivated by the idea of supporting systems for the daily business, shifting the knowledge about operations that could be described as processes from humans to systems. Process-oriented systems started to be worshipped as the only way to control the processes and activities that had to be enacted. The knowledge about such processes and their enactment was transferred to information systems.

The shift from the data-oriented systems to the process-oriented systems brought companies tools to control and check the enactment of the processes and resources that are involved.

To build a quality process-oriented system means to build a system supported by strong workflow system.

There are reasons why this is not always possible, such as a cost, schedule, often changes of doing transactions. Thus, there are many implementations of enterprise systems that are process supportive but not exactly process driven like ERP systems (SAP ERP, CRM systems, B2B systems and many other types of enterprise information systems). They are also cases when the system provides tool for implementation of process-oriented system (like S/4HANA – Solution Manager – Process Management), but specific implementation will not use it.

Sometimes these systems are not even aware of the processes that are supported. Processes were not defined at the beginning of the systems implementation or were defined but lost in the long-term usage and maintenance of the system.

The need of the proper knowledge about the business data led to the usage of Business intelligence (BI) tools. Since the systems shifted from the data oriented to the process-oriented systems, new subdomain of BI, Business process intelligence (BPI), was defined. BPI and its supportive tools help users to manage process execution quality by the analysis, prediction, monitoring, control and optimization [GC04].

On the other hand, there is a Business process management (BPM). BPM [WA04] can be defined as whole business company process management and optimization. Its concern is on the process improvement and its alignment to the needs of clients. BPM lifecycle consist of design, modeling, execution, monitoring and optimization. It means that the BPM takes care of the composition, enactment and analysis of the operational business processes.

Business process definitions are sometimes quite complex and allow many variations. All of these variations are then implemented to supportive systems. If you want to follow some business process in a system, you have many decisions and process is sometimes lost in variations.

Analyzing process data generated from information systems falls into the category of Big Data. The term Big Data was introduced in 1998 by John Mashey as a reaction to fact that of how fast the data is generated from information systems, technological and user devices and global applications. From this time this phenomena affects all areas of analysis and work with data.

We recognize two basic types of big data:

(18)

- Structured data represents words and numbers, that can be easily categorized and analyzed - these data are usually generated by ERP systems (financial or logistics figures, EDI, logs …).

- Unstructured data include more complex information (photos, multimedia, content of social media) – they cannot be easily categorized.

The research and samples processed during this work deals mainly with structural data, but it must respect basic principles of big data introduces below.

Big data is described by the concept of Big Data 3Vs that was presented in [DO01] and later it was extended to Big Data 5Vs – Volume, Velocity, Variety, Veracity and Value. This concept is considered when processing the big data.

- Volume points out the quantity of data produced continuously. Not only enterprise applications (information systems and their digital content, sensor data), but all text, picture and video content on social networks, produced by mobile devices, e-mails and other communications streams generate every second big load of data. We must consider that this quantity is growing and most of the data is stored and is not deleted during the time.

- Velocity points out the speed at which new data is generated and the speed at which data moves in the information space.

- Variety points out the diversity of data types produced and saved today – previously used structured data was added by unstructured data (documents, photos, video).

- Veracity points out the trustworthiness of the data and often disarray or them. There is less control of accuracy and quality.

- Value points out our ability turn our data into value. Enterprises must collect and leverage big data, thy must take care not to glut themselves by the buzz data without understanding.

Another concept HACE theorem, presented in [WZ14], is used to characterize Big Data. The HACE stands for heterogeneous, autonomous, complex and evolving.

- Heterogeneous data means that the same data can be represented in different formats (several information sources can contain the same data).

- Autonomous is given by distributed computing platform – the resource with data are distributed across different stand-alone systems without centralized servers.

- Complex and Evolving parameter of data means growing complexity of data with growing number of connected participants, interactions and processes.

DKIW is often used concept of pyramid Data –> Information –> Knowledge –> Wisdom. All data mining methods follow this concept where

- Data is source set of logs from information systems

- Information is constructed in term of data, it can answer questions like who, what, where, when

… (in our case information is the XES prepared for specific method processing)

- Knowledge represents result of used method applied back to source environment (it means business understanding of data mining result),

- Wisdom is the application of the result to management (change management, …)

(19)

5.1 Process Mining

Modeling and simulations can help you to adjust the process, find weaknesses and bottlenecks during the design phase of the process. Sometimes you can guess or know the patterns and occurrence probabilities of variations that are used during the execution phase.

However, not even modeling and simulation of the processes can ensure you, how processes are really enacted in the system, what is e.g. the perceptual usage of the variations and whether some variations are enacted at all. If you want to analyze the real usage of the system, recognize its weaknesses, bottlenecks or strangeness on the real data, you have to know how the process was followed in reality.

Process mining is an approach that is used for the analysis of real enactment of the processes. Process mining uses logs of real process enactments to analyze the process itself. Process mining can answer you the question, how the process was really executed, which variations were used and what are the probabilities of the enactment of each process variation. Process mining can be seen as a supportive method for the BP and BPI analysis and from the perspective of BPM, can be used as a feedback to the BPM methods [AH03].

As the case study on real sample of data from business process was realized during my work, it was relevant to observe outputs from this area.

There is a lot of papers that describe new ways or improvements of methods, techniques and algorithms used in the process mining. Surprisingly only several papers are focused on the case studies, it means, in fact, practical usage of the process mining. Detailed overview of recent case studies up to 2011 can be found in [WS13]. Authors describe 11 case studies in several domains, mainly in public services [5- 15].

Since then, some other papers that describe the usage of process mining where published. In [LB13], authors applied fuzzy mining, trace clustering and other additional methodologies among ProM to analyze block movement process in shipyards. They used real-world event logs of the Block Movement Control System (BMCS). In connection to ProM, papers were published that presents analyzing patterns from application logs (i.e. [SW17]). Pre-processing of data from original log is mentioned as complex problem, many methods and tools were published, as [TS17].

In financial sector, authors of [AS12] used process mining for the identification of financial processes to analyze the compliance to the security requirements needed by the security audit. Another work [WB16] presents data mining techniques used for detection of financial fraud detection.

In [LI12], authors demonstrate the applicability of process mining to the discovery of processes that characterize the knowledge maintenance and the organizational perspective is used to find relations between individual performers. This case study was made on the real case of a knowledge maintenance process in aviation institute.

The range of process mining case studies and variations of process mining types and methods shows the wide range of process mining applicability to answer different questions in different domains.

(20)

5.2 Process Mining and Social Networks

Process mining provides information about the process topology and moreover also information about the involved resources, for example, about users (actors). Finding information about relationships among actors is the aim of the social network analysis.

The first extensive analysis in this area was published by Jacob Levy Moreno [MO34] in the 1934. He used several sociometric approaches to assign residents to various residential cottages. He is also the person who coined the term sociometry. Sociometry is a quantitative method for measuring social relationships. Since his time a lot of sociometric studies have been published. At the present time, thanks to electronic data there are many possibilities to gather and analyze data, for example emails [FE87]. As it is difficult to distinguish important or unimportant emails, it is crucial to have structured information.

Fortunately, today’s present enterprise systems provide structured information on all transactions of the system in the log. It usually contains all the necessary information, for example, the start and end of process activities which are performed by particular users with specific data.

An overview, the methods and usage of social networks in process mining were described by Aalst in [AS04]. The paper introduced an approach of combining workflow management and social network analysis. It defines several metrics of how to mine organizational relations. The first approach is metrics based on causality. This means the performers are related if a case is passed from one performer to another (for example the handover of work metrics). The second approach is based on metrics based on joint cases. Authors ignore casual dependencies and simply count how often two individuals perform activities for the same case. This is, for example, working together metrics. The third approach is metrics based on joint activities. The distance between two performers is measured by several metrics that are based on a comparison of corresponding vectors. The vector shows how often a particular current activity is performed. These metrics were implemented in the tool MiSoN (Mining Social Networks).

In [AR05] the same authors use these metrics in a case study. They used the log of a provincial office of a Dutch national public works department, which employs about 1000 civil servants. The MiSoN tool was later not supported by users and its methods are contained in the plugin of the ProM Framework [DM05].

As was described above, sociometric is a part of process mining. There is a more precise part of sociometry – social network analysis (SNA). This refers to the collection of methods, techniques and tools for the analysis of social networks. The following paragraphs describe certain usage of SNA in various applications.

One of the issues which were resolved by social networks is finding people among employees, who can help to resolve problems in company processes [LC13]. These problems are based on employee knowledge and a lack of information in daily work, specifically incomplete information, missing contacts for key persons or non-existing documents. The methodology [LC13] aims to record which employees keep knowledge concerning certain parts of a process. The authors propose creating a framework based on the internal social network system which is linked to the process of an organization. This solution should provide the sharing of knowledge among employees in the same area of expertise or between different employees from two collaborating teams.

(21)

Social networks of collaborators can be created in many ways, for example, a social network established by monitoring e-mail exchanges between employees using a distributed system [OY01]. This approach helps to find and gather the important members in an organization.

The study in [HB12] is also set up on emails. The authors focused on the recommendations of recipients of emails. It can often happen when a user forgets to add all recipients of the email, that some users can miss important information in the email. The authors use an algorithm based on social networks.

Evaluation of this approach was conducted on the Enron Email Corpus and Lotus Note Email Corpus.

The initial feedback received from participants in the pilot project was very positive.

There is another usage of social networks – the project management issue – allocation of human resources, especially in IT development projects. In [ZH08] Lixin Zhou comes up with a human resource allocation method. The method of assigning tasks to employees is based on employee skills, interpersonal relations, and employee preference and task parameters. The goal is to find the most suitable employee within the company for particular tasks, thus reducing the time required for completion of the given task and ensuring the quality of the work.

The authors of paper [HL12] provide a methodology on how to derive an organizational structure that is suitable for a business process performed in an organization. It is done by a social network analyzed by various SLA techniques. Nodes of social networks are organizational units and arcs provide the transfer of work between various organizational units in terms of the current business process.

In the paper [FA11] the authors have focused on large events logs. They proposed a methodology on how to simplify complex models of social networks by clustering users into communities. The result of the analysis is a social network of communities. The authors used this methodology in a case study in a medium-sized hospital. This approach is implemented in the ProM framework [DM05].

5.3 Complex networks

Network science is an integrated field of research that studies networks from many real areas (as computer networks, biological networks, telecommunication and finally also social networks). The basics of graph theory were defined by Leonhard Euler, then applications in many other areas were founded. Probabilistic model of the network (graph) theory was introduced as Erdos-Rényi model.

Model of organizations (PCANS model) that introduces three domains within organizations (individuals, resources and tasks) was defined in [KC98]. Authors presented concept that there exist networks (relation) across these domains and these networks can be analyzed.

There were found during research of topological parameters of real networks (web, social networks, ...) that probabilistic model do not dispose with structural parameters as studied real networks (distribution of degree, clustering parameters). New models that fits more precisely were introduced – Small-world network [introduced in WS98] and Scale-free network [BL99]. Applications in complex networks analysis were presented using these models in more application areas i.e. [ST01] (food web, molecular integration), [SE18] (electric power systems).

The scale-free network is a network topology that contains hub vertices (vertices with many connections), that grow in a way to maintain a constant ratio in the number of the connections versus all other nodes. Many networks (internet) maintain this aspect, while other networks distributions of nodes

(22)

only approximate scale free ratios. As it was presented in [BL99], the existence of hubs (“rich get richer process”) is principally given by tending new vertices to connect to more popular (connected) sites.

5.4 (Big) Data mining: Data preparing and preprocessing

Every method of data mining expects that analyzed data is prepared into input table(s) with correct assumptions for given method. Source data (in our case technical logs from information system SAP) is in raw format and often consists from several sources with many uncomplete records, missing relations, mixed from many observed processes.

This is the reason the source data must be preprocessed – the preprocessing is critical assumption of using each method. Preparation of pre-processing task requests detailed technical and also process knowledge of data provider.

Data obtaining from the log of the system for the process mining were described in the papers [AG98, AD02, AW03, AD03 and CW98]. It usually focused on the control flow. The log should abstract time, date, event type and other mandatory information important for workflow mining, or process mining, or social network analysis. Suggested model of the log is described in [AA11], authors use openXES model and structures as defined in [GU09].

(23)

6 Theoretical background

This chapter provides theoretical background for topics and methods used in this work.

6.1 Data mining

Data mining is known as an interdisciplinary specialization that uses tools and knowledge from statistics, machine learning, artificial intelligence, graph theory and complex networks to analyze usually large datasets. Data mining aims to find useful patterns in data. Data mining tools are often used for analyzing large datasets for finding hidden patterns or relationships and building new knowledge. Data mining is a synonym for Knowledge Discovery from Data (KDD). KDD is defined by Frawley [FS92, p. 2] as follows: Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a pattern as a statement S in L that describes relationships among a subset FS of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in FS.

A pattern that is interesting and enough certain (both to user’s criteria) is then called knowledge.

There is one important distinction between traditional statistics and data mining – while statistical methods verify or reject pre-defined hypothesis, data mining generates new ones.

Whole knowledge discovery process contains several steps, data mining is one part of it. Data must be integrated from needed sources, then pre-processed, passed to data mining algorithm and then interpreted. As was mentioned by Bramer [B40, p. 3], the preprocessing and the interpretation (as opposed to the blind use) of the results are both of great importance.

There are many applications where data mining is applied (as financial forecasting, fraud detection, medical diagnosis, weather forecasting, marketing ...).

6.1.1 Data

There are two basic data approach – labeled and unlabeled data.

Labeled data have special attribute and the aim of data mining is to predict the value of this attribute for cases that were not be inspected yet. Used method is called classification (if data is categorical / enumeration) or regression (if data is numerical). Data mining using labeled data is called supervised learning.

Unlabeled data have no special attribute as was mentioned above. The aim of data mining with unlabeled data is to extract the most information from the data that it is possible. Data mining using unlabeled data is called unsupervised learning. Methods of clustering and association rules are used in unsupervised learning.

Basic concepts

We work with inspected objects; each object is described by set of variables (attributes). The set of attributes of one object is also called as a record, instance or vector. All data provided to the data mining variables of all objects is called dataset. It is commonly used dataset interpretation as a two-dimensional table where rows represent objects and columns represent attributes.

Variable types

(24)

We work with following basic types of variables:

- nominal variable: enumeration values that can be categorized (like username, name of city, ...) but do not define arrangement; also, numerical values can be nominal, but no mathematical relations do not apply to them;

- ordinal variable: enumeration values that can be categorized and can be arranged (like secondary school, grammar school, university);

- integer variable: values of integer variable is meaningful integer properties – it means mathematical operations can be applied to them (like number of children);

- binary variable: special nominal variable with values 0= false, 1=true

- interval variable: numerical variable, differences between two values is meaningful, but ratio between variable is not meaningful (40 deg Celsius is not twice hot as 20 deg Celsius):

- ratio variable: both ratios and differences are meaningful (quantity, age, Kelvin temperature);

Steps in KDP and Data preparation phase

There are several knowledge discovery process (KDP) models described in literature – models provide approaches of steps of KDP process and positioning of data mining during this process. First it was defined academic model presented by Fayyad [FP96], nine steps iterative model with possible loops between any two steps. Business oriented model was presented in [CC00] as part of commercial CRISP methodology [CP07]. After several next models Cios et al. [CP07] published hybrid academic x industrial model, it is based on CRISP model and provides more descriptions of used steps. There is also elaborate comparison of previous models done in [CP07].

We introduce a knowledge discovery model inspired by [CP07] in this thesis in chapter 9.1. Presented model focuses to technical aspects of every step especially it considers that process and data domain is SAP application. Also, the step of understanding the problem and data is not analyzed in high detail – we expected that these parts are recognized and known.

6.2 Clustering methods

Clustering methods (as unsupervised methods of data mining) provides automatic discovering structures (clusters) from initial dataset without need of supervision.

We will assume dataset X = {x1, x2, ..., xn}, xk = {xk1, xk2, ..., xkm}  R^m is vector of m attributes describing object k. We have whole dataset represented as matrix

𝑿 = (

𝑋₁

𝒙_𝟏 𝑥₁₁ 𝑋₂ … 𝑋_𝑚 𝑥₁₂ … 𝑥_1𝑚 𝒙_𝟐 𝑥₂₁

⋮ ⋮

𝑥₂₂ … 𝑥_2𝑚

⋮ ⋱ ⋮

𝒙_𝒏 𝑥_𝑛1 𝑥_𝑛2 … 𝑥𝑛𝑚)

, xi = {xi1, xi2, ..., xim} R^m, Xj = {x1j, x2j, ..., xnj} Rⁿ

When we discuss the cluster, we understand cluster as set of objects that are more similar inside cluster then with objects outside the cluster.

The m-dimensional Cartesian coordinate space is defined via m basic vectors ej = (0, ..., 1j, ..., 0)^T. Any other vector in R^m can be written as linear combination of the standard basis vectors.

𝑥_𝑖 = 𝑥_𝑖1𝑒₁+ 𝑥_𝑖2𝑒₂+ ⋯ + 𝑥_𝑖𝑚𝑒_𝑚= ∑^𝑚_𝑗=1𝑥_𝑖𝑗𝑒_𝑗 (1)

(25)

Objects can be visualized in 2D-3D space.

6.2.1 Similarity

Similarity and distance of objects helps define computer methods. Let’s have three objects (vectors) x, y, z – the similarity d(x, y) must be the function producing non-negative value meeting following conditions:

d(x, x) = 0 non-negative

d(x, y) = d(y, x) symmetry

d(x, z) + d(z, y)  d(x, y) triangular inequality

Frequently used functions of similarity are known from numeric models, but they could be defined also for non-numeric descriptions.

Hamming distance: 𝑑(𝒙, 𝒚) = ∑^𝑛_𝑖=1|𝑥_𝑖− 𝑦_𝑖| (2)

Euclidean distance: 𝑑(𝒙, 𝒚) = √∑ (𝑥^𝑛_𝑖=1 𝑖− 𝑦_𝑖)² (3) Chebyshev distance: 𝑑(𝒙, 𝒚) = 𝑚𝑎𝑥_𝑖(|𝑥_𝑖− 𝑦_𝑖|) (4) Cosine distance: 𝑐𝑜𝑠() =_{|𝑥|.|𝑦|}^𝒙.𝒚 = ^∑^𝑛^𝑖=1^𝑥^𝑖^𝑦^𝑖

√∑^𝑛_𝑖=1𝑥_𝑖²√∑^𝑛_𝑖=1𝑦_𝑖²

(5)

6.2.2 Construction of network and clusters identification

Pre-processed log was prepared into Object-Attribute format, where attributes were prepared into numerical format. We use Euclidean distance for similarity function and similarity can be easily measured. The issue is that data in vector format in high dimension cannot be effectively visualized.

As we like to visualize the data and results for managers, we decided to transform initial Object-Attribute table into network. There are many visualization techniques and tools for networks (graphs) that can be used – including visualization of found patterns. Visualization provides a fast user-accepting tool for recognition of specific situations and relations in the network.

Method used for network construction was presented by [OZ17], we use Louvain method of community detection [BG08]. Network construction is based on method [HF07] for nearest neighbor analysis where k nearest neighbors must be specified and known. Used method uses nearest neighbors in another way.

Representativeness of source objects (and potential graph vertices) is used, we expect that objects have different representativeness. Representativeness is a local property based on the number of objects that are nearest neighbors of a selected node.

Edges between all pairs of nearest neighbors are created first, then additional edges between the individual data objects in the number proportional to the representativeness of these objects are created.

Representativeness of nodes in the constructed graph then corresponds approximately to the representativeness of objects in the data. This makes natural graph representation of the original data, which preserves their local properties.

Algorithm [OZ17] runs in following waves:

1. Create the similarity matrix S of dataset D.

2. Calculate the representativeness of all objects Oi.

(26)

3. Create the set V of nodes of graph G so that node vi of the graph G represents object Oi of dataset D.

4. Create the set of edges E of graph G so that E contains an edge eij between nodes vi and vj (𝑖 ≠ 𝑗) if Oj is the nearest neighbor of Oi or Oj is representative neighbor of Oi.

Time complexity of the algorithm is O(|D|²).

6.2.3 Representative of cluster – patterns

Patterns were identified by the cluster analysis and that clusters must be analyzed. Following statistical analysis was done on vectors that are members of identified clusters.

Normalized average values of coordinates of every cluster member define representation (representative vector) of given cluster.

Let X = {x1, x2, ..., xn}, xk = {xk1, xk2, ..., xkm}  R^m is original dataset.

Vector of maximal values in every attribute is calculated: xmax = {max1, max2, ..., maxm}.

Let cluster Pj contains 𝑛_𝑗 original objects, Pj = {y1, ..., ynj}, where ∀𝑖 ∈ {1, . . , 𝑛_𝑗}; 𝑦_𝑖 ∈ 𝑋, yk = {yk1, yk2, ..., ykm}.

Table Tj of normalized average values (Table 1) is calculated for every cluster j, where:

Tj = {tjA, tjB, tj1 ..., tjm} tjA ... ID of pattern (= j)

tjB ... number of members in pattern Pj = 𝑛_𝑗

tj1 -- tjm ... representative vector; ∀𝑤 ∈ {1, . . , 𝑚}; 𝑡_𝑗𝑤 =_𝑛 ¹

𝑗∗𝑚𝑎𝑥_𝑤∑^𝑛_𝑖=1^𝑗 𝑦_𝑖𝑤

Values 𝑡_𝑗𝑤are normalized by each column (attribute) separately against the maximal value of given attribute in whole dataset, so that they can be visualized in one picture.

PATTERN j COUNT ActivitiesNR TimeTotal TimeAverage TimeMax TimeMin ...

T1 1 75 0,013101 0,052 0,029822 0,183087 0,002122

T2 2 45 0,002201 0,00562 0,021185 0,056982 0,008092

T3 3 69 0,000292 0,001519 0,041694 0,046393 0,028812

T4 4 42 0,000187 0,0011 0,053018 0,048879 0,054223

Table 1. Normalized average values of pattern (model)

For every value tjw in Table 1 a confidential interval CI95 is calculated into Table 2

ActivitiesNR TimeTotal TimeAverage TimeMax TimeMin ...

1 0,055705938 0,589809836 0,015292769 1,063547475 0,004158323 2 0,000994289 0,098973122 0,138267375 0,622442662 0,014628092 3 0,069092742 0,572755889 0,010018928 0,886886835 0,056472154 4 0,000761284 0,005646703 0,043908205 0,015342395 0,103811321

Table 2. Confidence interval

(27)

These representative vectors of patterns shown in Table 1 and confidence intervals in Table 2 provide tabular and visual view of patterns model. The description model of patterns serves analytics with understanding how patterns are constructed (it shows parameters of pattern representatives).

6.3 Complex networks, representation and measuring

6.3.1 Representation

Picture is visual method for representation of the network.

Fig. 1. Picture representation of network (directed x undirected graph)

Adjacency matrix is a square matrix Aij with elements representing if given pair of vertices are adjacent in graph or not. There are following data representation of adjacent matrix for graph from Fig. 1.

An arrow (v1, v4) is directed from v1 to v4; then v4 is called the head and v1 is called the tail of the arrow;

v4 is said to be a direct successor of v1 and v1 is said to be a direct predecessor of v4.

Directed graph Undirected graph

𝐴_𝑖𝑗= {1 … 𝑤ℎ𝑒𝑛 𝑣_𝑖 𝑖𝑠 𝑡𝑎𝑖𝑙 𝑎𝑛𝑑 𝑣_𝑗𝑖𝑠 ℎ𝑒𝑎𝑑

0 … 𝑜𝑡ℎ𝑒𝑟𝑣𝑖𝑠𝑒 𝐴_𝑖𝑗= {1 … 𝑤ℎ𝑒𝑛 𝑣_𝑖𝑎𝑛𝑑 𝑣_𝑗𝑎𝑟𝑒 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 0 … 𝑜𝑡ℎ𝑒𝑟𝑣𝑖𝑠𝑒 𝐴 = [

0 0 1 0 0 1

0 1 1 0 0 0 0 1

0 0

] 𝐴 = [

0 1 1 0 1 1

0 1 1 0 1 1 0 1

1 0 ]

Incidence matrix is a n × m matrix B, where n is number of vertices and m is number of edges defined as follows (for graph from Fig. 1).

Directed graph Undirected graph

𝐵_𝑖𝑗 = {1 … 𝑤ℎ𝑒𝑛 𝑣_𝑖 𝑖𝑠 𝑡𝑎𝑖𝑙 𝑜𝑓 𝑒_𝑗

0 … 𝑜𝑡ℎ𝑒𝑟𝑣𝑖𝑠𝑒 𝐵_𝑖𝑗 = {1 … 𝑤ℎ𝑒𝑛 𝑣_𝑖𝑖𝑛𝑐𝑖𝑑𝑒𝑠 𝑤𝑖𝑡ℎ 𝑒_𝑗 0 … 𝑜𝑡ℎ𝑒𝑟𝑣𝑖𝑠𝑒 𝐵 = [

00 0 0

0 0 1 0 1 0

0 1 0 1

0 0 1 0 0 0

] 𝐵 = [

10 0 1

0 0 1 0 1 1

0 1 0 1 1 1

1 0 0 0

]

6.3.2 Bipartite network projection

Bipartite network is such network whose vertices can be divided into two disjointed sets V1 and V2 and every edge connects one vertex from V1 and one vertex from V2.

(28)

Bipartite network projection is a method used for compressing bipartite network to one-mode network.

The result one-mode network contains nodes of only one set (V1 or V2), where nodes are connected only if when they have at least one common neighboring node in original bipartite network.

Several weighting methods have been proposed in literature:

- Simple weighting: weights are calculated by the number of common associations - Hyperbolic weighting: weights are calculated by scaling factor 1/(n - 1)

- Resource allocation weighting: produces initial distribution of some resource among nodes and eliminates miss of nodes with degree 1

We used weighting given by a minimum of common associations in our work (9.3) as we are also finding measure of similarity of connected vertices.

6.3.3 Degree

Degree of vertex v is defined d(v) as number of edges incident to given vertex.

We calculate local vertex degree 𝑑_𝑖 = ∑ 𝐴(𝑖, 𝑗)_𝑗 and global mean degree 𝜇_𝑑=^{∑ 𝑑}^𝑖 ^𝑖

𝑛 . Degree sum of a graph G=(V,E) is calculated as ∑_𝑣∈𝑉𝑑𝑒𝑔(𝑣) = 2|𝐸|.

6.3.4 Local and global distances

Distance 𝑑(𝑣_𝑖, 𝑣_𝑗) between vertex vi and vj is defined as the shortest path between vertex vi and vj in network.

Eccentricity of vertex e(vi) as local property of vertex vi is defined as the longest distance from vertex vi to any other vertex of the network

𝑒(𝑣_𝑖) = max

𝑗 {𝑑(𝑣_𝑖, 𝑣_𝑗)} (6)

Diameter d(G) of a graph G is the maximum eccentricity of any vertex in the graph. That means d(G) is the greatest distance between any pair of vertices

𝑑(𝐺) = max

𝑖 {𝑒(𝑣_𝑖)} = max

𝑖,𝑗 {𝑑(𝑣_𝑖, 𝑣_𝑗)} (7) Global mean distance is defined as

𝜇_𝐿 = 2

𝑛(𝑛 − 1)∑ ∑ 𝑑(𝑣_𝑖, 𝑣_𝑗) (8)

𝑗>𝑖 𝑖

6.3.5 Structural network properties

We are interesting on importance of vertices in a graph (network). Several structural properties, representing “importance” of given vertex was defined (vertex v can be important for other vertices if change in vertex v produces changes to many other vertices in network). There are many definitions and approaches published for importance property.

One often used concept of importance is centrality. This concept was developed in social network analysis and was later developed i.e by [BO87] and others. Every concept of centrality is based on some assumptions and should be analyzed if it meets our interest.

(29)

Most of centrality measures are calculated from neighbor matrix. There are often used centralities:

- Based on degree (Degree, Eigenvector, PageRank) - Based on paths (Closeness, Betweenness)

Degree centrality is defined as degree of the vertex (see 6.3.3).

Closeness centrality of a vertex is a measure of local centrality in a network. Closeness is defined using mean distance. Let dij represents length of the shortest path between vertex i and j. Mean distance of vertex i is defined as

𝑙_𝑖 =1 𝑛∑ 𝑑_𝑖𝑗

𝑗

(8)

Mean distance of vertex, that is little distant from others, is small. Such vertex has fast access to other vertices and can influence other vertices fast. Closeness is defined as

𝐶_𝑖 = 1 𝑙_𝑖 = 𝑛

∑ 𝑑_𝑗 _𝑖𝑗 (9) Closeness have known problems with

- networks with more components (closeness is calculated for each vertex in context of their component separately, but it handicaps vertices in large networks)

- range of values has log n size, it is small range, distinction of vertices is small

Betweenness centrality is a measure based on shortest paths in a network – betweenness of the vertex means the number of shortest paths between any two vertices in the network that pass through given vertex. By contrast to closeness, this betweenness centrality represents the “degree” how the vertex fits among other vertices. Let 𝑛_𝑠𝑡^𝑖 is defined as 1 (if vertex i is on the shortest path between s and t) or 0 otherwise. Let gst is number of shortest paths between s and t. The betweenness centrality xi of vertex i is defined as

𝑥_𝑖 = ∑𝑛_𝑠𝑡^𝑖 𝑔_𝑠𝑡

𝑠𝑡

; 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑥_𝑖 = 1

𝑛²∑𝑛_𝑠𝑡^𝑖 𝑔_𝑠𝑡

𝑠𝑡

(10)

6.3.6 Mean of dataset

The Mean of dataset (represented by a data matrix X) is the vector which coordinates are constructed as the average of coordinates of all records.

6.3.7 Silhouette

Silhouette describes method for validation of cluster data consistency. Silhouette value is a measure of how similar an object is to its own cluster compared to how similar the object is to other clusters. The silhouette value ranges from −1 to +1. Higher value indicates that the object is more similar to its own cluster and weakly similar to neighboring clusters.

Stable clusters have most objects with high value. If many objects have a low or negative value, then the clustering configuration is not stable.

The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

2018 Martin Kopka PHD THESIS Analysis of process data and their social aspects VŠB – Technical University of Ostrava Faculty of Electrical Engineering and Computer Science

VŠB – Technical University of Ostrava Faculty of Electrical Engineering and

Computer Science

Analysis of process data and their social aspects

PHD THESIS

2018 Martin Kopka

Declaration statement

Acknowledgements

Abstract

Key Words

Table of Contents

1 List of used symbols and abbreviations

2 List of illustrations

3 List of tables

4 Introduction

4.1 Motivation

4.2 Actual status

4.3 What is the problem

4.4 What can be improved?

4.5 Research objectives

4.6 Structure of the thesis

5 Related works

5.1 Process Mining

5.2 Process Mining and Social Networks

5.3 Complex networks

5.4 (Big) Data mining: Data preparing and preprocessing

6 Theoretical background

6.1 Data mining

6.2 Clustering methods

6.3 Complex networks, representation and measuring