2018 Martin Kopka SELF CITATIONS for PHD-THESIS Analysis of process data and their social aspects VŠB – Technical University of Ostrava Faculty of Electrical Engineering and Computer Science

(1)

VŠB – Technical University of Ostrava Faculty of Electrical Engineering and

Computer Science

Analysis of process data and their social aspects

SELF CITATIONS for PHD-THESIS

2018 Martin Kopka

(2)

Abstract

Information systems support and ensure practical running of most critical business processes. There exists or can be reconstructed a record (log) of the process running in the information system with information about the participants and the processed objects for most of the processes. Computer methods of data mining can be used for analysis of process data utilizing support techniques of machine learning ang complex network analysis. Process mining is able to analyze and reconstruct the model of running process from its process log. The analysis of participants behavior of running process from process log transformed into complex network of its participants is not very used approach, much frequently the quantitative parameters are analyzed. Here we show how data and process mining methods can be used for analyzing of running process and how participants behavior can be analyzed from the process log using network (community or cluster) analyzes in constructed complex network from the SAP business process log. This work formulated and developed a methodology covering data integration, pre-processing and transformation, data mining with following interpretation and decision support – the work was realized and experimentally verified on sets of real logs from SAP business process. The modified canonical process log structure is suggested with respect of SAP environment – this can be applied for any SAP system (principally). This approach constructs the complex network from the process log in the given context and then it finds communities or patterns in this network.

Found communities or patterns are analyzed using knowledge of the business process and the environment in which the process operates. The results demonstrate possibility to cover up not only quantitative, but also qualitative relations (i.e. hidden behavior of participants) using the process log and specific knowledge of the business case. This approach was found as useful starting point for decision support analysis supporting managers with getting knowledge from process data (log). While process mining can provide the model (visual or formal) of running process, the complex network analysis can uncover behavior relations of participants, that are hidden in quantitative models of process log.

Key Words

Decision support, process log data, data mining, process mining, SAP log, graph construction, visualization (visual data minig), community detection, graph clustering, pattern analysis, outliers analysis, behavior.

(3)

Abstrakt

Informační systémy podporují a zajišťují praktické fungování nejdůležitějších obchodních procesů.

Výsledkem běhu většiny procesů v informačním systému je dostupný log procesu s informacemi o účastnících a zpracovaných objektech. Pro analýzu procesních dat lze použít počítačové metody datové analýzy s využitím podpůrných technik strojového učení a komplexní síťové analýzy. Tento Process mining umožňuje analyzovat a rekonstruovat model běžícího procesu z jeho logu. Analýza chování účastníků probíhajícího procesu z jeho logu transformovaného do komplexní sítě jeho účastníků není příliš používaným přístupem, častěji se analyzují kvantitativní parametry procesu. V této práci ukážeme, jak lze k analýze procesní dat využít procesy data miningu a jak lze analyzovat chování účastníků na základě dostupných dat z logu procesů pomocí analýz sítě (komunitních nebo klastrových) v konstruované komplexní síti z protokolu obchodních procesů SAP. Tato práce formulovala a vyvinula metodiku zahrnující integraci dat, pre-procesing a transformaci dat, analýzu dat s následnou interpretací a poskytnutí vstupů pro podporu rozhodování. Práce byla realizována a experimentálně ověřena na sadách reálných protokolů z obchodního procesu SAP. Modifikovaná struktura logu je navržena s ohledem na prostředí SAP - lze jej proto použít pro jakýkoli systém SAP. Popsaný přístup konstruuje komplexní síť z protokolu procesů v daném kontextu, následně vyhledá komunity nebo vzory v této síti.

Nalezené komunity nebo vzory jsou analyzovány pomocí znalostí o podnikovém procesu a prostředí, ve kterém proces pracuje. Výsledky ukazují možnost odkrýt nejen kvantitativní, ale i kvalitativní vztahy (tj. chování účastníků) na základě logu procesů a specifických znalostí o obchodním případu. Tento přístup byl shledán jako užitečný výchozí bod pro analýzu podpory rozhodování, která má za cíl podporovat manažery při získávání znalostí z procesních dat (log). Zatímco obecně analýza dat může poskytnout model (vizuální nebo formální) běžícího procesu, komplexní síťová analýza může odhalit vztahy chování účastníků, které jsou skryty v kvantitativních modelech protokolu procesů, a to i pomocí vizualizace analyzované sítě.

Klíčová slova

Podpora rozhodování, log procesních dat, data mining, process mining, SAP log, konstrukce grafu, vizualizace (visual data minig), detekce komunit, klastrování grafu, analýza vzorů, analýza outlierů, chování.

(4)

1 List of illustrations

Fig. 1. Approach of decision support in use-case evaluation ... 11

Fig. 2. Used concept of Knowledge Discovery from Data ... 13

Fig. 3. Process model - Path 1 ... 20

Fig. 6. Most used sequence of the process visualized by turtle graphics ... 23

Fig. 7. Topology deviations of the sequence... 23

Fig. 8. Two mode (affiliation) network ... 25

Fig. 9. Constructed network ... 25

Fig. 10. Local community illustration ... 25

Fig. 11. Visualization of detected communities ... 27

Fig. 12. Data mining tasks ... 32

Fig. 13. Visualization of clusters in 5.4 ... 37

Fig. 14. Degree distribution in D1 for 5.4 ... 37

Fig. 15. Connections of high-degree users in 5.4 ... 39

Fig. 16. Silhouette of patterns in 5.4 ... 39

Fig. 17. Representatives of patterns in D1 for 5.4 ... 40

Fig. 18. Distribution of Avg time in D1 Pattern 1 in 5.4 ... 41

Fig. 19. Comparison of representative and outliers in D1 cluster 1 in 5.4 ... 42

Fig. 20. Difference in attributes (compare outliers with representative) in D1 cluster 1 in 5.4 ... 42

Fig. 21. Visualization of clusters for D2 in 5.4 ... 47

Fig. 22. Degree distribution in D2 for 5.4 ... 48

Fig. 23. Silhouette of patterns of D2 in 5.4 ... 49

Fig. 24. Representatives of patterns in D2 for 5.4 ... 50

Fig. 25. Distinction of patterns to base in D2 ... 51

Fig. 26. Principle of finding pattern for new object ... 52

Fig. 27. Principle of identifying nearest objects for pattern Pr ... 53

Fig. 28. Visualization of clusters in 5.4.7 ... 61

Fig. 29. Degree distribution in D3 for 5.4.7 ... 62

Fig. 30. Silhouette of patterns in 5.4.7 ... 63

(7)

Fig. 31. Representatives of all patterns in D3 for 5.4.7 ... 63

Fig. 32. Representatives of non-trivial patterns in D3 for 5.4.7 ... 64

Fig. 33. Distribution of Avg time in D3 Pattern 3 in 5.4.7 ... 65

Fig. 34. Visualization of clusters for D4 in 5.4.8 ... 69

Fig. 35. Degree distribution in D4 for 5.4.8 ... 70

Fig. 36. Silhouette of patterns of D4 in 5.4.8 ... 70

Fig. 37. Representatives of patterns in D4 for 5.4.8 ... 71

Fig. 38. Distinction of patterns to base in D4 ... 72

(8)

2 List of tables

Table 1. System tables used for SAP workflow log ... 15

Table 2. System tables used for SAP change management trigger ... 15

Table 3. System tables used for SAP status log ... 16

Table 4. System tables used for SAP application log ... 16

Table 5. System tables used for SAP EDI log ... 16

Table 6. Set of data collected from SAP ... 18

Table 7. Canonical log structure with relation to SAP dataset ... 18

Table 8. Interesting paths in the process model ... 20

Table 9. Interpretation of process steps in turtle graphics ... 22

Table 10. Log structure for 5.3 ... 26

Table 11. Interpretation of found communities in 5.3 ... 28

Table 12. Found communities A1-4, B1-4 ... 28

Table 13. Source log for Network analysis ... 29

Table 14. User-Attributes/Vendors-Attributes data table for Network analysis... 30

Table 15. Checklist for data mining steps ... 32

Table 16. Summary of results of experiment 5.4... 35

Table 17. Patterns – table of profile parameters in experiment 5.4 ... 36

Table 18. Patterns – confidence interval in experiment 5.4 ... 36

Table 19. Outliers in network of experiment 5.4 ... 38

Table 20. Pattern representatives in D1 for 5.4 ... 40

Table 21. Patterns – table of profile parameters in dataset D2 ... 46

Table 22. Outliers in network of D2 in 5.4 ... 49

Table 23. Pattern representatives in D2 for 5.4 ... 50

Table 24. Patterns – types of attributes (R/C/M) in experiment 5.4 ... 53

Table 25. Finding original record in experiment 5.4 ... 56

Table 26. Typical curves for distance distribution of distances to pattern ... 56

Table 27. Summary of results of experiment 5.4.7 ... 59

Table 28. Patterns – table of profile parameters in experiment 5.4.7 ... 59

Table 29. Patterns – confidence interval in experiment 5.4.7... 60

Table 30. Outliers in network of experiment 5.4.7 ... 62

(9)

Table 31. Pattern representatives in D3 for 5.4.7 ... 64

Table 32. Pattern 5 representative from D3 for 5.4.7 ... 64

Table 33. Pattern 3 representative from D3 for 5.4.7 ... 65

Table 34. Patterns – table of profile parameters in experiment 5.4.8 ... 68

Table 35. Patterns – confidence interval in experiment 5.4.8... 69

Table 36. Pattern representatives in D4 for 5.4.8 ... 71

(10)

3 Aims of the Thesis

These thesis will summarize theoretical basis, present research hypothesis, define approach for the experiments and document the experiment or used method for every research topics. My own contribution for topics described in related works are shown here below.

Analyzed data sets source for specific experiments and process analysis will come from real business cases from system SAP. Data are anonymized in first step of their collection, so that no private information is published as a result. The fact that analyzed data are taken from real environment, brings reality to the experiments and it can compare every success or failure of used method with real business situation (I will do this comparison).

In area of decision support I used the approach of SOM clustering and fuzzy logic for support decision in multi-project / multi-tasks environment where similar use-cases come to be solved. As the knowledge base of solved use-cases can be maintained, the approach could help with evaluating of new use-case using knowledge base of previously solved and evaluated use-cases.

The hypothesis H1 is that used approach can be used for given case of decisions.

I define the canonic structure of process log that is usable by more analytics methods used in my research and I prepare the extract methods from SAP system site, that can provide such logs. The extraction structure is compatible with such defined in [AA11] but it utilizes some elements common used in SAP and that extends value of logged data. Modified method of gathering process data from SAP for process mining solves the challenges mentioned in 5.1.

Using method of (social) network analysis from SAP process data is based on approach defined later on in this theses – two approaches are presented, both of them were experimentally verified. The basic specific approach described in 5.3 will be extended by more types of transforming functions. More general approach described in 5.4 shows the concept how analysis of the behavior of process participants (people, participating objects) can be supported by network analysis and clustering data mining methods.

For basic approach I expect that (hypothesis H2) the suggested an applied approach will result relevant relations and uncover (not visible in original datasets) participants’ behavior patterns. The approach of following steps will be used:

- construction of social network is define by function w

- definition of the function s specifying relation between actors (can be defined explicitly using organizational DATA or as relation of variables known within the case or process)

- communities in this constructed network are

- more methods can be used for analysis of this network

The hypothesis for general approach (hypothesis H3) is that using methods of data minig, clustering to set of data prepared from preprocessing of SAP process log will show not obvious behavior aspects of participant objects, that can be easily visualized and verified and analyzed in original environment. This can be used as a tool for manager’s decision support.

Method of empirical study will be used for this work.

(11)

4 Value estimation of use case

Many decisions during making a project (use case) estimation and planning are based on previous experience and competency of a manager. Evaluated completed projects provide rich knowledge about similar decisions and reality.

This approach introduces how to estimate value of specific parameters of the projects utilizing parameterized use case model from previously running projects.

Main assumption of this approach is that database of valued use cases exists and is maintained.

We used methods of SOM clustering for grouping similar use-cases and fuzzy rules for calculating the search parameter value.

Approach:

- parametrization of recent use cases including result value - parametrization and adding new use case

- clustering data by SOM (finding new use-case in cluster)

- processing of the cluster by fuzzy logic with result value (without new use-case)

- estimating of new use-case result value by the fuzzy rule of cluster where new use-case was found

Processing by Kohonnen map

(SOM) Use case 1

Recent Projects

1,2,...N

Use case 2

Use case N

...

New project

New Use Case parametrization: 1,2,3,2,3,1

parametrization: 3,2,1,2,1,1

parametrization:

dif. Rows, dif. Paragraphs, dif. Words, overall diff., N/S/C, testing level

parametrization 2,2,1,2,2,1

The parameter xwork will be 1 with the 85% probability.

Processing by fuzzy rules

Fig. 1. Approach of decision support in use-case evaluation

Conclusion

Realized experiment proved that the approach was successful in cases when the best fuzzy rule could be determined, the same rule provided then correct new value; on the other site, in one case when all evolved rules featured the same training fitness, only 3 from 10 evolved rules correctly estimated new value.

(12)

The experiments on the datasets showed the possibility to use of presented approach of machine learning methods for prediction of evaluation use-cases parameter based on known sets of evaluated parameters and apply it to new unevaluated use-case from the same domain.

Presented result

Results of several experiments were presented on

- The International ACM Conference of Emergent Digital EcoSystems (MEDES), 2012 [SK12]

- Databases, Texts, Specifications, and Objects (DATESO), pages 25-37, 2012 [SS12]

- IFSA World Congress NAFIPS Annual Meeting, 2013 [SK13]

- The 4th International Conference on Innovations in Bio-Inspired Computing and Applications, IBICA, pages 37-47, Ostrava 2013 [SS13]

(13)

5 Process data analysis

5.1 Used concept of KDD and specific tasks

Data mining methods, that are used in this work, expect always specific data table with given requested structure (specific for given method). But this table is never provided automatically by source information system(s). This chapter describes common approach for Knowledge Discovery from Data for the cycle of producing data for specific used data mining method and following analyzing. Presented schema of steps (Fig. 2) was inspired by well-known steps and we supplied it by the “lifecycle” and practical content with respect to SAP data sources. It is many times needed to come back with the feedback to the data mining and repeat analysis with specific changed inputs (the feature selection or analysis for detailed subset of data can be often practical reasons) and sometime it is needed to come back to integration phase, when some feature should be verified and the relevant data is not contained in the original data set.

This capture states how specific activities comprise in realized tasks ran during data preparation from SAP systems.

It is specified in chapter of given experiment (5.2, 5.3, 5.4) what next activities and decisions with data is done for given approach.

Fig. 2. Used concept of Knowledge Discovery from Data

There are more optional checks that can be applied and that can bring challenge (described also in [AA11]) for final event log creation (I solved all of them as described below):

- Correlation challenge: the events are grouped per case in this concept; I utilized fact, that it was possible to identify a case from log record of event; if we would have issue with this correlation, the specific output would be prepared using some known method (i.e. [FG09])

- Timestamps challenge: basically the events must be ordered in group for case; timestamps helps with ordering using provided timestamps. If timestamp would not be available (it is not case of

(14)

SAP), another method would be used (domain knowledge with some pattern). Another issue can appear when multiply source systems runs unsynchronized clocks.

- Snapshots challenge: it is requested to analyze whole case, but sometime the case has lifecycle longer then provided period of the log. Uncompleted cases are filtered out from the log so that do not affect results of completed processes.

- Scoping challenge: in specific SAP environment with thousands tables running the specific domain knowledge must be used to scope source data.

- Granularity challenge: not all SAP processes provide defined common log structure; when the log is on lower level granularity, some grouping mechanism can be used, if only higher level granularity is available, then log must be extended by some internal statuses.

5.1.1 Integration

Integration phase the phase aims to combine source data streams from different data sources into one unique database. In many cases SAP runs on one database system and this phase is not needed. But it was needed in some scenarios – run process in SAP CRM system or SAP sales cloud. The SAP ERP database was selected as the final database.

Core data of logging is based on principle Case-Event-DateTime-Originator (see 5.1.3 for more details).

Case represents one whole pass of the process – if the process represents vendor invoice verification, the Case represents all records related to specific vendor invoice. Event represents one step/activity related to specific vendor invoice.

If multi system workflow is running, the IDOBJ+system was used as a unique ID of the case – internally IDOBJ was extended by the system ID for ensuring of uniqueness.

Identifying the proper source for log for every observed business process is first important analytical task. Analyst must met both business process and system implementation knowledge to specify correct source – I always select standard sources and enable repeatability of the experiment on other SAP systems.

Following standard methods for logging are most usual for integration in SAP systems working on SAP Netweaver platform (as SAP ECC) or S/4HANA.

5.1.1.1 Data log from SAP workflow process

Data log from workflow processes are saved from transaction SWIA, alternatively (when more detailed information from workflow containers should be known), an export tool is prepared from system tables listed in Table 1. System tables used for SAP workflow log (workflow system uses more then 60 system tables).

Table Description

SWWWIHEAD Headers of all workitems

SWWORGTASK Actual ORG OBJ processing the workitem SWWCONT Container values of running workitems

(15)

SWWCNTP0 New XML container (BAPI function

SWW_WI_CONTAINER_READ) is used for acquiring container values

Table 1. System tables used for SAP workflow log

5.1.1.2 Data log from business process out without SAP workflow

Data log from process that is not run as SAP workflow, should be saved based on analysis what relevant triggers represent observed process. Basically there are following standard triggers used for this purpose:

- change management - business object event - status change

- standard application protocol - iDOC export

or special trigger can be created (programmed) if standard ones are not enough.

Change management

Business document in SAP can be activated with Change management – it causes generating

“change document” on every defined change (CRUD) on document.

CDHDR Change description header CDPOS Change description position

Table 2. System tables used for SAP change management trigger

Business object event

Business object event is triggered automatically by the system based on customizing. It can be triggered based on change document, status change or by user program. Business object event can be found in standard SAP table SWFRETLOG. Most common using of the business object events is for triggering workflow – in this case the log is saved from workflow log (see above), but in some cases workflow is not defined and this event can serve as standard milestone.

Status change

Statuses represent very standard tool for modulation of business document in specific states.

Basically system uses “system (Exxxx)” and “user (Ixxxx)” statuses. I prefer to use the system statuses, because they are provided by standard in any SAP system. The OBJNR (ID of the object/document) is used as basic reference for used status tables.

JCDS System and User status – all changes log values JEST System and User status – actual values

JSTO Status profile data TJ30, TJ30T User status + description

(16)

Table 3. System tables used for SAP status log

Standard application log

System SAP provides standard logging subsystem which is can be used also by customer code for logging of running programs and transactions. There is an application screen for work with this log (transaction SLG1). Application log has BAPI interface that can be used by customer programs, logging is saved in following sets of tables as is shown in Table 4.

BALHDR Application log: log header BALOBJ Application log: objects

BALMP Application log: message parameter BALHDRP Application log: log parameter

Table 4. System tables used for SAP application log

iDOC export

In some case the export of iDOC structure of business document can serve as a trigger. It is very standard process and provides many important information.

EDIDC Control information of iDOC

EDID4 Data records of iDOC

EDIDS Status records of iDOC

Table 5. System tables used for SAP EDI log

Special trigger

In case of non-standard implementation it is possible to use non-standard trigger (it would be defined by the implementation). It is possible, but not recommended way.

5.1.2 Pre-processing SAP data

Pre-processing task contains important steps focused to selection, cleaning and extension the log data.

Selection process selects log records meeting requested parameters - IDOBJ type (i.e. vendor invoice),

- task / activity type (i.e. set of workflow tasks representing steps in observed process), - time period (i.e. 2017/2018 year),

(17)

- organization structure (selected region if requested).

Cleaning process selects and updates records with aim to

- have only completed cases logged (delete cases without start or end), - solve faulty values in some relevant columns, I typically solved

o responsible person – blocked users without representation o error status of workitem

Extension process typically finds more context data for observed object, data or process and enriches the dataset by requested parameters (I used extension for purchase order type, plant ID...).

Anonymization process converts sensitive data in dataset into number from generated interval, so that it would not exist any sensitive data in future processing. I have used a tool for anonymization of following data from datasets: username, organization structure, vendor ID.

I use binary evaluation of categorical attributes for some methods. It is run (by request) during anonymization process – attribute A is anonymized in first step, let set of values of attribute A is f(A, k)

= {A1, ..., An}, let f(A, k) is the value of attribute A for log record k, let set of anonymized values of attribute A is {VA1, ..., VAn}. Then n new columns (attributes) A1, ..., An are created. We define the f(Ai, k) the value of attribute Ai for specific log record k: 𝑓(𝐴_𝑖, 𝑘) = {1 ↔ 𝑓(𝐴, 𝑘) = 𝐴0 𝑒𝑙𝑠𝑒 ^𝑖.

5.1.3 Transformation

We have analyzed variants of logs generated from SAP for specific variants (workflow, change documents, ...) and compared them with mandatory part of log structure defined by process mining tool ProM [DM05].

The result is following final version of transformed log, that will be used for all following analysis. The same log can be used also as an input for ProM tool, which we have used for process mining task described in chapter 5.2.

Expected common data structure of input log table used for process mining was defined by ProM tool [DM05], where minimal mandatory set for data log contains:

- Case – one pass of the process - Event – one step of the process

- Start time – start time of the task (mandatory) - End time – end time of the task

- Originator – originator of the particular task

As we have prepared the concept of the original log from SAP with knowledge of this mandatory structure and concept (Case-Event-DateTime-Originator), the main part of the transformation utilizes this property.

Following dataset is the least set of data collected from SAP using my extract tools. Mandatory note in

“obligation” column means that the value is transformed to mandatory field of final log.

Parameter Description obligation

IDOBJ ID of the SAP object mandatory

(18)

IDACTIVITY ID of the activity mandatory

DATESTART Date when the activity started optional

TIMESTART Time when the activity started optional

DATEEND Date when the activity ended mandatory

TIMEEND Time when the activity ended mandatory

ROLE Role of the object processor (creator, accountant, approver). mandatory USER User account of the object processor (real user name) optional TRANSACTION System transaction run by the user during processing the activity (in case

of workflow process it can be represented by TS workflow task)

mandatory

DATA Detailed information about context of the invoice – specifically ORGID, INVOICETYPE, binary evaluation…

optional

Table 6. Set of data collected from SAP

The final log structure is constructed (transformed) from SAP log using following transformation function (defined in Table 7. Canonical log structure with relation to SAP dataset).

Final log SAP log obligation

CASE IDOBJ mandatory

EVENT ROLE + TRANSACTION mandatory

END_TIME DATEEND + TIMEEND mandatory

ORIGINATOR USER mandatory

START_TIME DATESTART + TIMESTART optional

DATA DATA optional

Table 7. Canonical log structure with relation to SAP dataset

There is needed to work with canonical data type formats (special candidates for the transformation is Date and Time format and decimal numbers. System SAP uses legislative version of the date (i.e.

MM/DD/YYYY for Hungary and DD/MM/YYYY for Czech version) – I transformed Date to DD/MM/YYYY format. The same legislative differences occur in decimal format.

Previous steps (5.1.1 - 5.1.3) prepared the log for next processing. Every data mining method expects the specific data table as an input for its processing.

Final transformation

The final phase of transformation for specific method prepares specific data table expected by selected datamining method. We used two final transformations during experiments documented in this work.

Transformation to 2-mode graph and to network (approach 5.3)

Transformation to bi-connected graph and then to network as a final step for data mining approach explained in 5.3.

(19)

Transformation to Object-Attributes table (approach 5.4)

Transformation to Object-Attributes table prepares data table for data mining approach explained in 5.4 – assumptions and approach is defined in the chapter 5.4.

5.1.4 Data mining and Interpretation

Specific data mining method is used for analyzing data prepared by previous steps. Used method and specifics are explained in chapter given by the specific experiment.

Common approach of used methods is:

- construction of network – the network is constructed based on similarity of participants’

behavior (both approach 5.3 and 5.4 use different method for network construction – methods are explained in referenced chapter)

- network clustering methods are used for identifying clusters in the network - visualization is available for interpretation of calculated model

- analysis of the network

- the model of analyzed network is prepared (patterns, profile and detailed information about profiles of every calculated pattern)

5.1.5 Domain based interpretation

Interpretation is the final stage of one analytics wave. It provides one step of validation of results from data mining. Evaluation of patterns, communities, outliers in real process and organization context provides validation of found results.

As we dispose with all information about source objects and relations and with knowledge about original environment, we prepared interpretation of received model and its patterns. The interpretation results provide important background for the feedback phase and proves or refuses the hypothesis if used approach can uncovers hidden behavior.

Interpretation can also open new questions about data to be solved when we have found (understood) some new relations about source data. These new facts provide basics for feedback step.

5.1.6 Qualitative validation

We use method for decision support based on results of data mining. It uses identified patterns from original dataset. When new object appears, we can compare this object to all identified patterns and find the best fit pattern for the new object. Then comparison of attributes can be done and it can be analyzed if behavior of new object also fits to behavior of found pattern.

5.1.7 Feedback

Feedback phase starts new wave of data mining, we use it for verification of found patterns, for detailed analysis of some identified cluster or as next step in cycle of KDD.

5.2 Process mining application for process in SAP

Prepared case study analyses business process of the invoice verification in SAP system with aim to identify context in which the process is not effective and provide a suggestion for the process improvement. The analyzed environment runs SAP system in five countries (with five different jurisdictions) and processes approx. 30 000 supplier invoices per year.

(20)

Data was loaded from SAP using extract method into canonical log and processed through processing steps (anonymization, filtering).

ProM tool was used for processing the analysis and we found results shown in following table and several results are described below.

Table 8.

Interesting paths in the process model

Path 1 – 1^st most frequented path

This path is used by most process enactments (cases) of the log. It is almost the half of all cases. The process model of this path is depictured on the Fig. 3. Process starts by start artificial event, continues by Creation event, Verification event. Next, there are two repetitions of Approval event and process ends by Posting event followed by the end event.

Fig. 3. Process model - Path 1

The invoices from this group were found as invoices connected to the purchase order/contract and with proper receipt without differences (amount, price). Typically there are invoices for the investments, strategic raw material and overhead invoices with two persons approving the invoice. This proves that the purchasing system is well defined and settled.

Path 2 – 2^nd most frequented path

This path contains 1564 cases. It is almost 29% of all cases. The process model of this path is depictured on the Fig. 4. Process starts by start artificial event, continues by Creation event, Verification event.

Next, there are three repetitions of Approval event and process ends by Posting event followed by end event.

Path Case(s) Occurrence

(relative)

Number of events

Path 1 – 1^st most frequented 2436 44.912% 7

Path 2 – 2^nd most frequented 1564 28.835% 8

Path 3 – 3^rd most frequented 504 9.292% 6

Path 4 – 4^th most frequented 200 1,862% 9

Path 5 – 5^th most frequented – more Posting events 101 1,862% 9

Path 6 – most time consuming path 1 0.018% 18

Path 7 – least time consuming path 1 0.018% 10

Path 8 – without Approval event 1 0.018% 5

(21)

Most of invoices from path 2 come from the purchases during main business process and financial services; the confirmation by three persons for specific invoice types was presented with the motivation to find some level of approving during purchasing process and to reduce the number of approvals during invoice verification. The process is mature and correct, the motivation is to move one step to purchase phase where this person is also active.

Path 6 – most time consuming path

This path is most time consuming path. This path includes only one case. This case was for some reason longest lasting. The path is depictured in the Fig. 5.

Process of this path, or case, is following:

1. Start artificial event

2. Two times – Creation and Verification event 3. Approval event

4. Posting event

5. seven times - Approval event 6. Posting event

7. Approval event 8. Posting event 9. End artificial event

This is possible (but not typical) enactment for the invoice without purchase order/contract, the purchase order must have been prepared and agreed – it is strictly suggested to have the purchase number approved before the invoice comes (otherwise the invoice is not accepted), this case was combined with issues discussed in Path 5; another case from this group was the invoice from key customer created supposedly pursuant to the contract (there was a long discussion about legitimacy and correctness – but as the partner was strategic partner, the invoice was not turned back and the process was prolonged).

Conclusion

We proved that using selected method of process mining there is possible to analyze SAP log data pre- processed by our tools and build the process diagram of running application. We identified most frequently run variants of the process and visualized them for further work. This method can be used for analyzing expected behavior of business process – there are some expectations of running the process, the approach can confirm or alter the view on running process.

Another visualization method

We also analyzed other visualization method for presenting business process sequence. This method uses Turtle Graphics method for visualization of process execution (method was presented in [SS14]).

We defined mapping between event types and turtle move. In our approach we control the turtle's behavior using fixed directions assigned to specific events in process log. It means we tell the turtle how to turn not according where the turtle is facing, but according the sides of the area.

(22)

The turtle's state is defined as a pair (P, w), where P = (px; py) represents the coordinates of the turtle's position, vector w = (wx; wy) represents the turtle's heading. The turtle's state can be written also as (px; py; wx; wy). Initial state of the turtle is Pi = (0; 0; 0; 1).

We have four types of events - creation, verification, approval, posting in the examined process. We set up aliases for the type of the events in the process. It means that Verification event is V, Creation event is C, Approval is A, and, finally, Posting is S.

In our approach we control the turtle's behavior using fixed directions assigned to specific events in process log. We assigned following directions to specific events:

- event C - turtle goes up (North), - event V - turtle goes right (East), - event A - turtle goes down (South), - event S - turtle goes left (West).

The turtle can be controlled by commands (the turtle knows its actual state P = (px; py; wx; wy) before command execution):

- move (x; y ) - moves turtle to state (x; y; wx; wy) without keeping track, - turn (x; y ) - moves turtle to state (px; py; wx + x; wy + y ),

- forward (D) - moves turtle D steps in direction of its heading while the turtle keeps track to final state ((𝑝_𝑥^′, 𝑝_𝑦^′, 𝑤_𝑥, 𝑤_𝑦), where 𝑝_𝑥^′ = 𝑝_𝑥+ 𝐷 ∗ 𝑐𝑜𝑠𝛼, 𝑝_𝑦^′ = 𝑝_𝑦+ 𝐷 ∗ 𝑠𝑖𝑛𝛼, 𝛼 = 𝑎𝑟𝑐𝑡𝑎𝑛^𝑤^𝑦

𝑤𝑥 . Interpretation of the process step in turtle graphics then is explained in following Table 9:

Process step Turtle move Start state 1. turn 2. turn C up (North) (px; py; wx; wy) turn(-wx, 1-wy) forward(1) V right (East) (px; py; wx; wy) turn(1- wx, - wy) forward(1) A down (South) (px; py; wx; wy) turn(-wx, -1-wy ) forward(1) S left (West) (px; py; wx; wy) turn(-1- wx, - wy) forward(1)

Table 9. Interpretation of process steps in turtle graphics

Experiments and results

The first experiment shows four most used sequences without time parameter CVAAS, CVAAAS, CVAS, CVAAAAS, CVAAAAAS as you can see in the Figure ??. Starting point is surrounded by circle. End points are red. When we look at figure, we can see that all sequences end in points that are on the line from the starting point to the bottom. This is the behavior that we expect from the correct sequence. The difference is only the distance from the starting point. The other finding is the shape of the picture. We can see that the picture is simple, there is only one step right - it means verification in our case. Accordingly, there is only one posting activity for all sequences. The difference is the bottom direction. Every sequence its specific number of approval activities. We can see, that there is no deviation found for the most used sequences.

(23)

Fig. 6. Most used sequence of the process visualized by turtle graphics

Next experiments show the deviations according to the proper process execution - CVAAS (). On the left side we can see that the sequence visually breaks the rule of one C and V directions. Thus the there is a creation and verification made two times for this type of sequence. Otherwise the sequence is correct.

The drawing in the middle shows that there is a problem at the end of the sequence, A and S were performed in addition even when the sequence was correctly ended. The third drawing on the ride side shows that there is unexpected S activity directly after first A and then two A and final ending is made.

These three results shows that the deviations at the beginning, at the end and in the middle of the sequence can be easily recognized by the visualization.

Fig. 7. Topology deviations of the sequence

Performed experiments showed the potential of the visualization using turtle graphic. We discussed few experiments, but even these experiments show that the approach can be used for the visualization and is useful. There are many ways how to use and accustom the settings to draw the paths.

Presented result

This results were presented on [SKo13] and [SS14].

5.3 Social network analysis of business process in SAP (L1)

We realized the research in the environment of the enterprise information system SAP on case of Invoice verification process. Participants of given business processes stand in different relationships. We are interested in the relationships that are not explicitly seen from the process logs, but which are detectable by research methods of social networks and communities in social networks. The work constructs the social network from the process log in the given context and then it finds communities in this network.

We analyzed found communities using knowledge of the business process and the environment in which the process operates.

When analyzing the resources involved in running business processes, we focus on the analysis of common features (similarities) that are shown by the involved resources. We use the process data

(24)

generated during running business process as a process log. The similarity of the resources is defined depending on the context. The context is defined as a set of entities from the process which are relevant to the analyzed resource.

The network of relationships among human actors (human networks) is created from the process log and then communities are identified in this network.

Determination of the parameters by which the network is established among human actors is an important task. Specifically, important parameters include the relation of the resource valuation, the relation of resources similarity and the resources belonging to the community. We analyzed found communities within the context and within the actual business process – it can serve as a support tool for manager with business and context knowledge.

5.3.1 Business process and data model

The examined business process of invoice verification is implemented in SAP ERP and SAP DMS, and user activities are controlled by SAP business workflow. Users participate in the invoice verification workflow in several different roles (creator, accountant – completion, approver, and accountant – decision and posting).

The analyzed process is as follows: the accountant initially creates the invoice, verifies it, then sends it to the approvers and finally, after getting it back he/she decides about posting or changing the responsible approver.

5.3.2 Used approach

I. Formal description of transformation process context → network

We use the following simple formalization for a description of our approach. The formalization is based on the fact that we are looking for the relation between resources (required for the process run – we mean actors, people) and entities (describing the process). The relation is then represented by the network.

Let’s have the set A of N actors A={A1,A2,…AN} and the set P of M entities describing the process P={P1,P2,…PM}. The set P is called the process context.

Let function w: A×P→R0+ define the weight of the relation of the actor from A to the entity from P.

Let a=[a1,a2,…,aM], ai=w(A,Pi) is the vector representing the actor from A in process context P. Let’s assume for the following descriptions that A=a.

Let function s: A×P→R0+ define the similarity of two actors from A in process context P.

We define the weight two mode (affiliation) network G2={A,P} (see Fig. 8). The network describes the relation of actors to the entities. This network can be converted (using the function) to one mode (social) network G={A,E}, where E is the set of edges between actors (vertices of the network) defined by the application of function s. The network G describes relations between actors. The network in Fig 2 was constructed from the network in Fig. 1. The weight of the edge between actors A, B∈A is equal to:

 ()

If this value is zero, there is no edge between actors.

(25)

Fig. 8. Two mode (affiliation) network Fig. 9. Constructed network

II. Automatic detection of communities in the network

We use a local algorithm based on local expansion for automatic community (cluster) detection. This algorithm uses the dependency of a vertex as a community membership function. A more detailed description can be found in [KD11].

We work with the terms local community, community core, community boundary, and community shell similarly as is shown in Fig. 10, and communities are constructed from a starting set of vertices (community base) that affect the community.

Fig. 10. Local community illustration

A local community expansion is an iterative process, in which only the base vertices are considered to be a community, while the other network vertices (contained in the shell) are gradually examined and the vertices that follow certain criteria (are recognized as members of the community) are then moved to the boundary and progressively expand the community.

Definition 1 (Community base): A community base is a starting set of n vertices appropriately chosen in advance, which by definition belongs to the community and which meets the following criteria:

1. It is a bi-connected sub-graph.

2. At least n-1 vertices have to be dependent on the other base vertices.

Definition 2 (Recognition of a network vertex): An unrecognized network vertex becomes a recognized one, if during the process of local expansion it meets the following criteria:

A

₃

A

₁

A

₂

1 2

3

(26)

1. It is adjacent to at least two different community vertices.

2. It is dependent on the other community vertices (dependency D>0.5).

()

()

III. Detection of All Network Communities

Unless stated otherwise, a community base is considered to be two vertices connected by an edge, while at least one of the vertices is dependent on the other one. This couple should be called the edge base. In this case, every vertex in the community (excluding max. one base vertex) must be dependent on the rest of the community vertices. The purpose of implementing the edge base is effective detection of more communities.

If we want to detect all the network communities, then it is essential to detect the communities for all edge bases of the vertex and remove the duplicities.

Remark: Automatically detected communities are bi-connected graphs which can be overlapped and nested – this is a crucial feature of used algorithm.

IV. Data preparation was realized based on activities described on 5.1.1, 5.1.2 and 5.1.3.

Column Description Mandatory

IDOBJ ID of the invoice Yes

IDACTIVITY ID of the activity Yes

DATESTART Date the activity started No TIMESTART Time the activity started No

DATEEND Date the activity ended Yes

TIMEEND Time the activity ended Yes

ROLE Role of the invoice processor (creator, accountant, approver)

Yes USER User account of the invoice

processor

No

TRANSACTION System transaction Yes

DATA Details about context No

Table 10. Log structure for 5.3

The original log contained 14 895 invoices and 72 590 activities. Data was anonymized.

The cleaned and filtered log contained 14 755 invoices and 71 264 activities of 130 users.

5.3.3 Specifics of data preparation phase Transformation had two steps:

T1: Converting log data into 2-mode graph

(27)

The context was defined as the set of invoices. The function w: A×P→R0+ represents the number of activities realized by the user (actor) with the specific invoice.

T2: Transformation into the network

Cosine similarity was used as a similarity function s for actor vectors a=[a1,a2,…,aM]:

()

The similarity defines the weights of edges between the actors. The resulting network has a high density (most vertices are contained in one connected component). Regarding the saving of crucial properties, we reduced the network by an edge threshold of 0.8.

5.3.4 Data mining analysis, results

16 communities were identified by the algorithm described above. Communities were visualized in whole network as is seen in Fig. 11 using 2D visualization.

Fig. 11. Visualization of detected communities

Visualization uses the following principles: circles represent actors in the network, the radius of the circle means the frequency of activities for a specific actor, the edge and its thickness means the strength of ties between actors.

An interpretation of found communities marked in visualization on Fig. 13 in real environment of the business process and organization is summarized in Table 11 – this summarization proved relevance of using approach, as it detected behavior of participant groups, that are not connected in input datasets.

ID Interpretation

A1 The community contains two users – an accountant and buyer from one smaller plant - they actually confirm most invoices together (in conjoint process).

A2 Example of an articulate vertex in the network. These actors are the purchaser and invoice approver (sales) in a company code with two plants.

(28)

ID Interpretation

A3 Purchase order creator and invoice approver (sales) for one plant.

A4 They are interesting cases where community A4 contains as a subset community B1. Three members in A4 are the correct part of invoice verification in one plant. The fourth one has identical authorization as one of the set (spare).

B1

B2 This community reacts to verification of a specific invoice type in which these participants take part.

B3

This community contains B1 (where all are from one plant) as a subset. The articulate actor there is a buyer responsible for two plants and the left set contains logistics managers combined from both plants. The biggest is the accountant responsible for both plants.

Table 11. Interpretation of found communities in 5.3

Table 12. Found communities A1-4, B1-4

5.3.5 Conclusion of L1

Realized experiment analyzed the process log of the invoice verification process, it transformed the log into a complex network and then it found communities in this network. We identified 16 communities and made an interpretation of them – the result provided a reasonable interpretation for each of them.

This approach can provide a set of actors that have a similar behavior with a set of invoices.

Although the approach is limited by the defined transformation to complex network (by behavior of user given by the number of commonly touched invoice), it could be prepared more variants of transformations and uncover more types of behaviors. Manager can get suggestions of persons or groups of persons that can be analyzed by quantitative methods and then prepared for manual decision.

(29)

5.3.6 Presented result

This result was presented in [KK13].

5.4 Network analysis for business process in SAP (L2)

While approach presented in 5.3 constructs social network of participants based on fixed transformation (on one significant behavior feature), this new approach is more general. It takes into account whole vector of behavior features and constructs the network based on similarity function. This approach provides possibility to analyze more objects (participants of business process) and combine any set of features from the process log.

5.4.1 Business process and data model

Vendor invoices are verified by the workflow process running in SAP ERP. Users participate in 10 different roles (workflow solves 10 special cases, every case is solved by special role). Workflow gradually runs through specific roles (based on conditions and user decisions) until invoice is approved or declined.

Analyzed sample: 37 684 invoices (cases), 240 users, 171 831 steps (activities), sample was used from other environment then was used in analysis described in 5.3.

Original log contains following data/attributes:

Attribute Description

CASE SAP invoice number

EVENT Identification of the workitem START_TIME Start Date and Time

END_TIME End Date and Time ORIGINATOR User running the workitem

DATA

Role of user

Purchase order number, amount Purchase order type

Vendor

Table 13. Source log for Network analysis

5.4.2 Used approach

I. + II. + III. Data preparation was realized based on activities described on 5.1.1, 5.1.2 and 5.1.3.

Data tables (Objects-Attributes) were prepared for objects participating in approval process: Users table, Vendors table, Invoices table, Purchase orders table.

User-Attribute (D1, D2)

Vendor- Attribute (D3)

Explanation

User Vendor User ID / Vendor ID for which values below refer

(30)

ActivitiesNR ActivitiesNR Number of activities of the user (vendor’s invoices) TimeTotal TimeTotal Total time processed by the user (vendor’s invoices)

TimeAverage TimeAverage Average time processed by the user (users for vendor’s invoices) on one activity

TimeMax TimeMax Maximal time processed by the user (users for vendor’s invoices) on one activity

TimeMin TimeMin Minimal time processed by the user on one activity (for vendor’s invoices)

Role Role Sum of RoleIDs of all activities of the user (for vendor’s invoices) R1 ... R10 R1 ... R10 Number of occurrences of the user (for vendor’s invoices) in R0 ... R10 NumberRoles NumberRoles Number of different roles of the user (for vendor’s invoices)

NumberInvoice NumberInvoice Number of invoices processed by the user (for the vendor)

NumberPO NumberPO Number of purchase orders for invoices processed by the user (of vendor)

NumberVendors N/A Number of vendors for invoices processed by the user N/A NumberUsers Number of users processed invoices of the vendor

AvBusProcess AvBusProcess Average of bus. process for invoices processed by the user (vendor) AvApprProces AvApprProces Average of bus. process for invoices processed by the user (vendor)

Table 14. User-Attributes/Vendors-Attributes data table for Network analysis

IV. Data mining

User-Attribute table is used as a source vectors’ set for transformation to network. The main reason for using network there is a possibility of visualization of data structures and sub-structures based on similarity relation (similarity of vectors from data source).

Transformation of original data source into network and cluster construction was realized using algorithm mentioned below. Method used for network construction was presented by [OZ17], we use Louvain method of community detection [BG08].

Attributes of the vector are constructed from behavior of users (or vendors), whole vector represents set of evaluated behavioral attributes.

Automatic clustering for network enables to find the most important clusters (groups) in the network.

The quality of found clusters is checked by silhouette of the clusters. Silhouette shows visually how stable the cluster members are in connection to its cluster.

Measuring network parameters follows – parameters help to understand network behavior in some cases.

V. Interpretation

Pattern analysis is done by statistical analysis of found patterns. Every pattern provide the information containing combination of value mix of profile parameters (attributes). This mix of values for all patterns

(31)

provides a model. The model describes found clusters by attributes’ values. It can be found representative participant for every pattern – this pattern is then defined by this representative vector of attributes. Analysis is done for participants very similar to representative on the one site and typically non-conforming participants of the cluster on the other site. Participants can be distributed by the conformance with the model attributes. We are interested in outliers, as they represent some unique behavior (the can excel or simple differentiate, the can represent risk or chance). We use two methods of outliers detection: network outliers (they are detected as communities with one member) and attribute outliers (they are detected by outliers of distribution given by selected attribute – for example by quantile method).

Detailed analysis of interesting cluster is also used – we repeat the clustering for only participant of selected cluster (with the same attributes). It enables to avoid influence of participants from other clusters.

As it was mentioned, visualization is very important possibility in networks. We utilize several visualization concepts, as the target of using this approach is to support decision for managers (visualization is a valuable supporting tool):

- visualization of clusters and relations in network using Gephi software tool [HE17], - visualization of the pattern model

- distribution of participants inside cluster

Interpretation provides also important confirmation of analyzed results with real environment. We always come back to original business process and compare analytics results from analysis with reality (confirm, find if analyzed result reflects some reasonable situation, constellation).

VI. Feedback

Typical feedback is an impulse for detailed analysis of selected cluster (repeat the clustering method to participants from selected cluster only). Another impulse is based on pattern analysis

5.4.3 Hypothesis

The hypothesis for presented approach is that using methods of data minig, clustering of dataset prepared from preprocessing of SAP process log will show not obvious behavior aspects of participants, that can be easily visualized and verified and analyzed in original environment.

(32)

5.4.4 Data mining analysis for Users – User-Attribute, results

The approach of data mining analysis follows the steps shown in following checklist and it is also schematically shown in Fig. 12.

1. User-Attribute table 2. Network construction 3. Pattern recognition

4. Basic parameters of the network o Quantitative parameters o Degree distribution o Modularity

o Clustering coefficient distribution o Centrality distribution

o Eccentricity distribution 5. Visualization

6. Outliers analysis 7. Pattern analysis

o Silhouette

o Pattern representative

o Identify relevant attributes for representative

▪ Distribution of relevant attribute inside pattern

▪ Differential analysis patterns x whole dataset

▪ Analysis of representative and outliers on this distribution o Recursive detail analysis of the pattern → 2

8. Summary, conclusion

Table 15. Checklist for data mining steps

Fig. 12. Data mining tasks