Elements of the model of design principles for the definition of data 91

5.4 Final model of design principles for the definition of data

5.4.2 Elements of the model of design principles for the definition of data 91

In the following paragraphs, the model of the principles of data definition as the Abstract Solutionshall be introduced.

5.4.2.1 Content

Hitz et al. (2019) reveals in the study that the elements that describe pure content are the most important components of data governance and perceived data quality for end users. In the study, the variable content almost exclusively described data governance. The principles combined to indicators described the content itself with a very high significance. The results then also allowed the principles to be taken into account.

Actuality The first principle in this group is theactuality of the data. Thus, data in infor

mation are perceived as useful only if they reflect exactly what represents the latest state of information. In order to comply with this principle, it is essential to ensure thisaccuracy of data. For data definition, this principle requires that theaccuracy of datamust be guaranteed in an instantiated solution in terms of bothtimeandcontent.

Universality The purpose of this data definition principle is that data within a company should have only one validity. At the same time, data should only be available once in a universally accessible version. This is the only way to ensure that information can be analysed universally in the same way and that the same conclusions can be drawn. What seems to be easy to solve is difficult to implement in reality. The reasons lie in the nature of the business.

As shown in Johnston (2014) and in table 5.2, a data state can take on nine different perspectives. A universal view would thus be able to communicate these perspectives without any doubt to anyone who would make an analysis with the data. However, data is condensed to information and the chance of a loss of exactly this information is to be prevented under all circumstances. Observations in practice have also shown that even experts are currently not very concerned with this problem. The majority of companies adhere to the perspectivewhat we currently assert things are like now, which only applies to the correspondingtype of data.

For the implementation of this principle it is necessary for an instantiated solution first of all to establish the awareness that data sources are very difficult to hold in the temporal context and are exposed to an erosion of usefulness. As Johnston (2014) explains, the stor

age of data in a bitemporal form is extremely costly. A technical development of temporal query languages started long time ago and therefore is an old demand (Ozsoyoglu & Snod

grass, 1995; Snodgrass & Ahn, 1985). For temporal queries a SQL based temporal query languages also exists since many years (Böhlen, Gamper, Jensen, & Snodgrass, 2018). An implementation in practice could not be seen in the observed companies and may turn out to be very difficult. Therefore, it is generally recommended to keep the data models extremely simple to be able to manage this data. Finally, for this principle it makes sense that the data be made centrally available in a single source of truth.

Credibility This principle refers to the effect that data can have on the external realm. Not only for the inside view it is important that data is reliable to the outside. Outwardly, credi

bility depends heavily on the reliability of data. Ensuring that information can be presented in such a way that it can be verified at any time is central to preventing damage caused by incorrect information. Reputation damage caused by misinformation can be fatal for an or

ganization. The principle of selfexplanatory from the point of view of cyber security.

For an instantiated solution, it is imperative that the data architecture is regularly checked via Governance Risk and Compliance (GRC). For this principle, extensive implementation approaches prevailed in the observed companies.

Unambiguousness Unambiguousness goes very tight with the principle actuality and shares the problems with it. The principle of uniqueness obliges the ongoing examination of infor

mation for uniqueness. Information has to be defined very accurately. Thus it must be defined centrally, which datanotis refined to information. This concerns in particular sources, which supply the same data and could lead under circumstances to ambiguity. Due to the large va

riety of data within an organization, it must be determined in a repository which data is to be sourced for which purpose.

5.4 Final model of design principles for the definition of data | 93 It would be necessary or an instantiated solution to think about a metadata system which could serve as a repository for the whole data architecture.

Completeness In the study by Hitz et al. (2019) great importance was attached to this prin

ciple. The principle can be interpreted in two ways. On the one hand it can mean that data what one has decided for should deliver complete data. On the other hand it can also imply that the information needs one has once defined in a data architecture should also be com

pletely sourced as an entity. The principle should be understood in both ways. Data should thus be completely available to the extent that one can produce information with them. This implies that an analysis is very often already possible without having the entire extent of the data available. At the same time it can be that data are available in a detail, which become su

perfluous for an information. Thus, the principle can be understood to mean that data should be available in such a way that it does not overfill it for a given piece of information.

For a solution that is too instantiated, this principle with the creation of a data architec

ture with associatedmetadata would also be a goal to be aimed at. In these metadata the information products are to be specified more precisely.

Consistency Once data is defined as information that is to be made available on a long

term basis and at any time, it must be ensured that this takes place with the same continuity and consistency. The principle would be met if the information product became a reliable and trustworthy source of information in the long term. The habits of the recipient of the information must also be taken into account. The size and format of the information transfer play an important role. In addition, sequential information must be free of repetition and must reflect actuality. It must be recognizable what is new about the information. In the case of data aggregations, such as key figures, for example, the same information must result after each calculation. Here the same problem of accuracy plays a role as with bitemporal data storage.

An instantiated solution could use check rules to ensure this consistency. Issues with the data temporality would be to be solved the same as with the principleactuality.

Interpretability The interpretability of data or the interpretation of information can be an

ticipated in an analysis. When creating a data architecture, the possible scenarios can already be anticipated. The principle of interpretability focuses on the expressiveness of information.

Accordingly, only those data are to be condensed to information which also achieve an effect with their expressiveness. In the observed organisations of practice, data is partly collected for the sake of data collection. The principle should allow in the reverse conclusionno data which do not become an interpretable added value for the company.

In an instantiated solution this would have to be considered in a data architecture. In particular catalogues about the criticality of data would have to be provided and the costs for an information product to be produced would have to be calculated. The principle would then be met, if only data for information would be sourced which would fulfill the necessary data criticality.

Comprehensibility This principle is fundamental to the development of the roles of data quality as presented in Section 2.2.5. Companies that have defined such data governance already use these roles to ensure that information is free of contradictions. Information should be provided in such a way that it is logical and intuitively understandable. It must be clear to the information recipient when the information is plausible. For this purpose, ranges of possible values would have to be known or made available via timespanning information.

An instantiated solution would have to define a data governance to ensure comprehen

sibility over time. Information is always contextdependent and requires business domain knowledges. This can be ensured, for example, by applying data stewardship as described in section 2.2.5.

5.4.2.2 Coherence

While the content was about which data and in which form should be provided, thecoherence focuses on the meaning of the data. Perfectly provided data can be as good as intended, if it is not used correctly in a specific business context, it remains of no value. The principles of coherence thus ensure that information can ultimately be used effectively. Interestingly, in the study by Hitz et al. (2019) these principles were regarded as not so important for data governance or the perceived data quality. However, it is still a data quality characteristic whether the data can be used effectively. Hitz et al. (2019) explains it by the fact that data users are not aware that the correct use of data is just as important as the correct supply.

These two perspectives on data must not be interchanged. Despite this assessment the study shows that the respondents are nevertheless aware of the indicators of semantics. Principles ofcoherencemust always be understood in the business context.

Relevance The principle of data relevance is central and must be taken into account in data engineering. The principle is so strong that no information to be made available should be valued higher than irrelevant information to be made available.

For an instantiated solution, it would mean that data engineering would pay special atten

tion to this relevance when designing the data architecture. In the worst case, the provision of irrelevant data would cause data to be completely ignored, even if it were relvant in context.

The correct implementation of this principle also results in a reduction of data and a focus on the most essential.

Traceability This principle, too, is a requirement for data preparation that has been dis

regarded in practice. The knowledge about how the information came about is missing in many systems because there is no versioning. The implementation of this principle does not improve the quality of the information itself. It must be seen as a supplement.

An instantiated solution would have to carry the source information throughout the entire data refinement process. This could, for example, be done via corresponding information structures in the metadata or a binding versioning.

5.4 Final model of design principles for the definition of data | 95 Contextuality It is data that creates information, and information in turn creates a correct understanding of a context. A context can be explained by the relationships between objects described by the information. This explanation and how contexts are to be understood and be determined in advance. It is the core of thecontextualityprinciple. It shows the relationship between the information contents that leads to the understanding of the context. This principle also has to do with the temporal nature of data and the associated information and issues.

In an instantiated solution these relations between data and information would have to be documented for the use of data. Metadata could also be maintained here. The temporal discussion would possibly already be solved by other principles.

5.4.2.3 Function

Last but not least, the principles offunctionmust be mentioned. Also with these principles the indicators in the study of Hitz et al. (2019) drew a clear picture for the variablesyntactic.

The principles had described the variable very strongly. However, the interviewees did not make any reference to data governance. The principles were not seen as a basis for a well perceived data quality. This circumstance is also explained in the study by the fact that the recipients of information are not aware of the importance of technically correct processing, because it does not belong in the domain of knowledge of data recipients. Nevertheless, there are important principles that have to be adhered to in order to guarantee data governance at all. The description of data governance by the variable as not significantly negligible must be ignored for an abstract solution. During the interviews and discussions with the experts, the essential existence of these principles for the creation of data governance was clearly confirmed.

Preparation The principle of preparation addresses data refinement. From a technical point of view, data should be structured in such a way that it can be processed in the same way. The definition of the structure must therefore be the same. If this would not be possi

ble, then the information about the structure would have to be provided with such metadata that automation would still be possible. For the processing of data this is not insignificant, nevertheless, in the socalled datastagingthe largest costs result.

For an instantiated solution there are different techniques to solve the problem. Also here it would be extremely helpful to define metadata and to deliver this also with data trans

fers. Different approaches are already widespread today. However, there is still no uniform standard.

Extensibility The principleextensibilitywas very controversial in the discussion with ex

perts. The question whether it should be possible to extend existing data with additional data artificially generated in the process. It is not about merging or mutually enriching data. It is about new data that is created in context. The danger is evident because it becomes difficult for the information recipient to distinguish the real data from the artificial data. In addition, artificial data is as good as the algorithm that created the data. In other words, the person who

programs the algorithm. With this aspect in mind, conspiracy theories will soon be the order of the day. It is forgotten that extending data has always been a technique in the field of data science. Completing incomplete datasets in example. The labelling for the calculation of models belongs to the normal technique in the Datascience area. Nevertheless, the concerns are not to be trivialized.

An instantiated solution would therefore have to be based on the principle that data is shown transparently at all times and when and in which way it has been expanded. Here, too, metadata is used, which describes the functionality of algorithms, for example.

Version control This principle refers to the previous discussion on the determination of the validity of data. See also section 5.1.1. However, the principle does not mean that versioning should exist for all data. Versioning is very complex and it has to evaluate where versioning is really necessary.

For the implementation in an instantiated solution one can stick to the already advanced techniques. In the future, you can count on options in database systems that allow versioning at the table level. Within the scope of this dissertation project, however, such a (promised) solution could not be identified.

Historicization Also the principlehistoricizationwas discussed under section 5.1.1. This principle is undisputed. Interestingly, the definition is often mistaken for versioning. The principle of historization guarantees that data along the time axis can be distinguished from each other at any time. What seems to be completely logical here is in reality difficult. Thus socalleddata cemeteriesarise in companies from which nobody knows who still needs these data at all.

An instantiated solution must ensure that the sequence of data can be distinguished in

dependently of their time stamps. Extensive techniques already exist to help implement this principle.

Accessibility The principle accessibilityalso speaks for itself. A data architecture must therefore be defined in such a way that data can be found and queried. This is by no means the case in the observed companies. Many companies know about data that should be available, but have no idea how to access this data. Available data must also be accessible, that’s what this principle stands for.

An instantiated solution would also have to make this available via a data architecture.

Access to data and information would have to be centrally accessible.

Measurability Judgment depends not only on data but also on the ability to evaluate and measure that data correctly. Wrong judgements ultimately lead to wrong decisions. The principle of measurability ensures that data is technically prepared in such a way that mea

surability by the information recipient takes place in such a way that every other recipient would draw the same conclusion. In the companies observed, it had to be established that

5.4 Final model of design principles for the definition of data | 97 there were already differences in the scales in use. It must be ensured how and what is mea

sured and how the creator of the information thought that the measurability should take place.

This is what this principle stands for.

In an instantiated solution this transparency must be provided centrally. It concerns again the data architecture and the associated metadata.

5.4.3 Answer of the final model to research question and hypotheses

In this subsection, the central question is compared to the design for data definition. The question in section 1.4 wasWhich model of principles for the definition of data needs to be considered in order to provide information consistency for the use of decision making?. The final model gives a detailed as well as scientifically sound answer to this question. The current problems of the lack of consistency of information could be identified and generalized. By adhering to the validated principles, an instantiated solution can lead to a better information consistency as well as a better perceived data quality.

The two hypotheses could be tested in the study of Hitz et al. (2019). The questionDesign principles for data definition are correlated with structures of a data governancecannot be rejected. In the study all principles of data definition show a clear relation to each other and correlate to the groups extraordinarily strongly (Hitz et al., 2019). However, the second hypothesisDesign principles for data definition are positively correlated with perceived data quality, which was also examined in the study by Hitz et al. (2019), must be rejected. The reasons for this have already been explained in the study (Hitz et al., 2019).

6 | Conclusions and outlook

This chapter summarizes the key points of the dissertation project. In section 6.1 the im

portance from the point of view of management is briefly discussed. Section 6.2 describes the contribution to the theory. In section 6.3 it is explained which limitations the research is based on and gives an outlook which research could be carried out.

6.1 Managerial contributions

Anyone looking for pitfalls in data science on the internet can read in one of the first points that more than 80% is still used for data preparation. Finding, assembling, cleansing, and preparing the data for use still seems to be a big problem. It is also discussed that the com

plexity of data should not be underestimated. Cao (2016) states thatdata quality is a critical problem in data science and engineering. Data management is in a strong context of data science and its Xcomplexities (Cao, 2017). The model presented in this dissertation helps to reduce this complexity. The various studies showed several findings that can be applied in the management context. An important finding is that, in the nature of today’s world of work, it is difficult to maintain data governance over time. The desire for good, experienced data quality cannot be maintained even at great expense. The reasons for this lie in the semantics of the data, which changes continuously during the life cycle of the information and can inevitably lead to misinterpretation. Another important finding lies in the decisionmaking process it

self. Data from current evidence are not solely responsible for bad decisions based on an inconsistency of information. In all qualitative discussions, cultural aspects were mentioned which play a more important role and according to which people make decisions. It means that decisions could be wrong even if all available information showed a consistent picture.

Mental models are responsible for this and are reflected in the experience of decisionmakers.

In the longer term, such mental models would take the place of factdriven decisionmaking.

This means that if confidence in the data is lost due to poor data quality, decisionmakers rely on their intuition, their mental model. Such a development must be prevented and the model for the principles of data definition presented in this dissertation contributes to it. With this model, companies are able to reduce the complexity of their data landscape. In applying the model would mean (1) to find (2) to assign semantically (3) to analyse (4) to process automatically (5) to include and design models and (6) to interpretdatacorrectly and doubt

less. Further, automation will become more likely possible which will lead to digital expert systems. Digital expert systems with intelligent humanlike bots would be able to prepare

In document Hlavní práce55039_hitc00.pdf, 5 MB Stáhnout (Stránka 111-120)