SKOS - Semantic web - Hlavní práce75522

2. Background

2.2 Semantic web

2.2.4 SKOS

The easiest possible mental model for RDF is the graph view, where the graph nodes are represented by resources and the graph edges are links between two resources. The graph view as a visual presentation of the data is usually easy to understand.

2.2.3 RDFS

RDFS or Resource Description Framework Schema provides a mechanism for describing RDF resources and relations between them. It is an extension of the basic RDF vocabulary. The namespace for RDFS is identified by the IRI - http://www.w3.org/2000/01/rdf-schema#, which is associated with the prefix rdfs: to refer to that namespace. [51]

The RDF Shema has the class and property system which describes properties in terms of class to which they apply. The mechanism of domain and range allows determining characteristics of other recourses. "For example, we could define the eg:author property to have a domain of eg:Document and a range of eg:Person" [51]

The benefit of RDF Schema is that it allows at any time to extend the list of properties of existing resources without massive effort. It keeps descriptions of the resources up to date.

2.2.4 SKOS

SKOS or Simple Knowledge Organization System is a part of W3C Semantic Web technology, that is built upon RDF and RDFS. SKOS is a standard to represent controlled vocabularies, taxonomies, and thesauri. The main goal of SKOS is to make publication easier and enable data for linking and reusing. It has a wide range of knowledge sources. SKOS concepts can be related and linked to other concepts, which allows a cost-efficient development.

The main elements of SKOS are concepts, labels and notation, documentation, semantic relations, collections, and mapping properties. Since SKOS adopts a concept-based approach, the relationships are expressed between concepts, and the concepts are associated with the lexical labels. SKOS concept schemes are not formal ontologies in the way OWL ontologies are.

23 2.2.5 SPARQL

SPARQL Protocol and RDF Query Language is an RDF query language. SPARQL first appeared in 2008 and was acknowledged by W3C Semantic Web on 15 January 2008 as an official recommendation.

SPARQL provides functionality to display and manipulate data stored in the RDF format. The entire database should be as a set of "subject-predicate-object" triples. Triple patterns are like RDF triples except that they also may be a variable. SPARQL supports aggregation, subqueries, negation, creating values by expressions, extensible value testing.

Additionally, it provides capabilities for querying graph patterns with their conjunctions and disjunctions. The results of queries are the sets or RDF graphs. [52]

2.3 Linked data

The World Wide Web Consortium provides a set of practices for publishing structured data, which is called Linked data. The relationships between entities from different data sets make from the collection of isolated datasets the Semantic Web. Linked Data is the large-scale integration of data on the Web. Basically, it is the heart of the Semantic Web. Internationalized Resource Identifiers (IRIs) are used to identify entities for the structuring data. IRI can identify only one entity, which can be any entity, that is why IRIs are universal. [53]

Linked data is based on semantic web technologies. The structured and linked data becomes beneficial with the usage of semantic queries. If the structure of the data is regular and well-defined, it is easier for tools to reuse data.

Since the language of websites is HTML, the orientation is towards structuring documents rather than data. The structure of HTML makes the extraction of data complicated.

To face this complication a variety of microformats have been introduced. The weak point of microformats is they only provide a small set of attributes; it is often not possible to express relationships between entities. [54]

Web APIs is another way of making structured data available on the web, it provides simple query access over the HTTP protocol. This way is more generic than microformats.

However, it requires significant effort to integrate data into the application.

In Linked Data issues related to microformats or Web APIs are resolved by the Resource Description Framework (RDF) language. RDF provides a flexible way to describe

24 things in the world with the bricks of RDF datasets, which are called triples. The triple consists of a subject, a predicate, and an object. This structure gives RDF flexibility that microformats miss.

2.3.1 URI

URI is an identifier type that is defined by RFC 3986. It is created to be a simple and extensible identifier. The specifications of URI syntax are written in RFC 3986. It does not specify which resource can be behind the identifier and does not provide information on how it can be addressed. The properties follow from the specifications of the protocols. The resource can be an electronic document, a physical object, or a service. The most important is to distinguish a resource from other resources.

Example: http://www.w3.org/albert/bertram/marie-claude

A limitation of this identifier type is the ASCII character set, which allows only the English alphabet character to be used. [55]

2.3.2 IRI

IRI (Internationalized Resource Identifier) is defined in RFC 3987. Unlike URI, it allows the usage of a UTF-8 encoding character, which allows IRI to include, for example, Czech alphabet characters and to incorporate different words of natural languages. This makes identifiers easier to create, process, understand, memorize, and so on. The IRI specification is compatible with the older specification of URI. It is a complement to URIs. For HTTP, same as URI, IRI Unicode characters are encoded using percentage encoding. IRI is used in the RDF to publish linked open data. [56]

absolute-IRI = scheme ":" ihier-part [ "?" iquery ]

relative-IRI = ihier-part [ "?" iquery ] [ "#" ifragment ]

2.3.3 List of prefixes

A prefix is a standard mechanism of shortening URIs in some RDF serializations, like for example Turtle. Prefixes are beneficial for better understanding, for manual creation and modification, and analysis of the RDF data. They are a convention, so prefixes can be chosen freely. However, several common prefixes exist and are used worldwide. Prefixes that have been used in this thesis are listed below.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

25 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dbo: <http://dbpedia.org/ontology/>

PREFIX dc: <http://purl.org/dc/terms/>

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX wdt: <http://www.wikidata.org/prop/direct/>

2.4 Linked data principles

To allow a machine or a person to explore the semantic web and to make it grow, the semantic web technologies is guided by four principles, which are presented in Berners-Lee 2009 [57]:

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL).

4. Include links to other URIs, so that they can discover more things.

These principles are not the strict rules, but the expected behavior and standardization alignment. Following them makes the data interconnected. However, breaking the principles does not destroy the semantic web, but reduces the efficiency of the functionality.

The first principle states that resources must be identified by URIs. If the URIs are not used, that part is not a semantic web.

The second principle is based on the fact that HTTP name look-up is a powerful set of standards. Since HTTP URIs are names and not addresses, people tend to invent new URI schemas to have them under separate control. The possibility to look up those names is a convenient way to manage data.

The third principal benefits to the information that can be obtained from the RDF, RDFS, and OWL ontologies including the relationships between the properties and classes in ontology.

26 The fourth principle helps to connect data into one unbounded web, which will enable to have as complete information as possible. The value of the information within the web page is enriched by the subset of what it links to. [57]

2.5 Benefits of linked data

The main benefit of linked data is easy and efficient data integration and browsing through complex data. It removes the walls between different formats and different sources.

Due to that, the update of data becomes easier. Links provide the knowledge extension out of existing facts. The Linked Data advantage is that it is sharable, extensible, and easily re-usable.

That is why the linked data approach scientifically overcomes current practices and solutions for creating and delivering library data. [58]

The usage of Web-friendly identifiers URIs, which is supported by the Linked data standards, sustain the multilingual functionality for the data. Identifiers allow multiple descriptions referring to the same resource. Since resources can be described by different libraries, organizations, and individuals, anyone can contribute to the expertise of the source.

This complex approach increases the value of the data far beyond each contribution taken individually. Global unique identifiers allow easier citation and easier access to the metadata description. Libraries and memory institutions provide trusted data of long-term cultural importance.

Additionally, linked data aims to eliminate the redundancy of bibliographic descriptions on the web. Clear identifiers and links reduce the number of different names for one entity.

It is not obvious how useful is the output of Linked data when it is implemented.

However, the capabilities for using data will be improved when the structure of data becomes richer. Navigation across library and searching will become more advanced. The global information graph from the dataset will also become more generous.

The benefit to researchers lays within the re-use of the library data. The advantage is that linked data technology is rather enhancing the Web and not rebuilding or creating a new one. By simply copying and pasting URIs a researcher can make a citation. The citation management integrates library data into research documents. As a result, from the research documents, it is possible to find additional data and information just by following links among

27 multiple domain-specific knowledge bases. Moreover, it makes it easier for other researchers to reuse the data or to replicate the experiments.

For organizations linked data reduce infrastructure costs. It may be the first step toward the cost-effective approach to complex managing of cultural information. For small projects or organizations, access to larger data may increase their presence on the web. The internal data curation and publishing processes can also be improved by linked data usage.

Linked data creates an open global pool of shared data, which makes a direct impact on librarians and archivists. The redundant effort can be reduced as resources can be used and re-used on large scales. The identifiers make updates easier, and resources can be kept up to date with fewer expenses. The focus should shift from re-creating existing descriptions, that have been created by others, to extending domains of local expertise.

For developers, linked data offers a lot of opportunities to provide support to information technology. Developers will not have to use custom software tools for library-specific data formats. By leveraging RDF and the well-known standard protocol HTTP data gets placed in the generically understandable format. [58]

3. Full and Partial Identity

To analyze quasi-equivalent concepts, it is important to understand the nature of equality. This work is built on the assumption of correspondence identity and equivalence, respective quasi-identity, and quasi-equivalence. That is why the beginning of this part is dedicated to the concept of identity and its' philosophical roots.

3.1 The Concept of Identity

The question of equivalence or quasi-equivalence has its roots deep inside the concept of identity. Over years "identity" has had multiple interpretations and there is still no standard definition of this term. Since this concept is mainly coming from philosophy, almost every philosopher was trying to get closer to the essence of the "identity".

Plato was the first one to differentiate the verb "to be", to set "is" as an identifier. He separated "is" as a copula from "is" as identity, which helps to distinguish one object from another one. [20].

On the other side, Aristotle defines the numeric meaning of identity, which is called the Indiscernibility of Sames:

If x and y are the same in number, then every attribute of the one is an attribute of the other. There is another statement related to it - the Indiscernibility of Identicals, which is often called "Leibniz's Law," or "one-half of Leibniz's Law,". However, Kenneth T. Barnes in the book

"Aristotle on identity and its problems" declares the Indiscernibility of Identicals the principle of Aristotle.

If x and y are identical, then every attribute of the one is an attribute of the other.

[21]

The study of Aristotle's views by Nicholas P. White gives its formulation of the principle of the Indiscernibility of Identicals:

If A and B are identical, then whatever is true of the one is true of the other.

[21]

However, there are some questions regarding ownership of the Indiscernibility of Identicals, there is no doubt that principles of identity occupied the central place in Leibniz's work and his philosophy. Leibniz stated that there cannot be two absolutely similar things,

29 because there cannot be things that have absolutely equal properties. His idea was developed in the concept of substance. [22]

In the article "Identity, Indiscernibility, and Philosophical Claims" the version of Leibniz's principles is presented by the below formula. Where a and b mean individuals and F is a variable that ranges properties of the individuals a and b. [24]

∀F(F(a) ↔ F(b)) → a = b OR a = b → ∀F(F(a) ↔ F(b)

[24]

The complications of identifying that a is equal to b based on their properties are in the fact of finding all the properties of both individuals. This can be possible with the assumption of the limited number of properties.

As an alternative to Leibniz's theory, Kant stated that individuals cannot be specified in terms of a concept of substance. The individual objects according to Kant are bound to space and time, contrasting "identity" with "existence". He wanted to attribute identity not through the properties of individuals but a priori, in its fundamental way. Kant viewed individuals as objects in their unity and not as a set of properties. Practicably, awareness of ourselves as thinking subjects stands for the Kantian sense of identity. As he stated, "The 'I think' must be able to accompany all my representations" [23], §16.

3.2 Identity Problems

The brief research of philosophical roots and definitions of identity shows some identity problems and inconsistency. In this section, some of the issues will be introduced in more detail.

3.2.1 Philosophical Problems

As it has been already slightly mentioned, the main problem of the personal identity in philosophy relates to the question of who to identify as a single individual? From a philosophical point of view, two points require to be clarified before answering that question.

The first one is the disagreement between Leibniz's theory and Kant's theory regarding metaphysical presuppositions like space and time. Is it possible to say that the same chair today and tomorrow is the same chair? Over time some properties of the chair may change. If someone breaks that chair, even the main purpose of the chair may change as it will be impossible to sit. Would it be the same chair in this case? There is no standard and unique

30 answer from the philosophical point. The same happens with the location. A chair outside a house under the rain and a chair in a warm room will also have slightly different properties as being wet and cold in one case and dry and warm in another case.

The second one is context-dependency. Sometimes, to state that two individuals are identical it is required to mention a context. For example, when a doctor prescribes a painkiller to the patient. In the medical context, two medicines with the same chemical structure and the same effectiveness are identical, they are a painkiller. However, in the context of business or market, it may not be true as two medicines have different prices or are produced by different pharmaceutical companies.

3.2.2 Practical Problems

Some practical problems of identity in ontology are driven by philosophical problems, which have been mentioned above. However, some of them bring the nature of ontology itself.

Due to the fact of existence Open World Assumption and AAA (Anyone can say Anything about Any topic) principles, the number of data is continuously growing including individuals and their properties. That is why the problem of identity in the Web of Data is even more controversial.

Taking into account the Open World Assumption, the lack of an identical statement does not mean that they cannot be identical unless there is no statement owl:differentFrom.

From the AAA point of view, most owl:sameAs links have no guarantee to be correct.

Especially, since they are mostly generated by automated algorithms. In the book "Results of the Ontology Alignment Evaluation Initiative 2019" the accuracy was ranged between 79%

and 92% (SPIMBENCH). [25]

In the article "The sameAs Problem: A Survey on Identity Management in the Web of Data" the authors give an example of the algorithm matching accuracy problem regarding books. It is common to match books formed on the similarity of titles and authors. However, two different editions of one book with a different number of pages also share the title and the author without being owl:sameAs. [19]

Besides, the same article, declared that different people may have a different opinion about the similarity of the two objects. The authors present studies, where three experts were analyzing 250 owl:sameAs links. The first expert confirmed only 73 owl:sameAs links, the second one 132 and the third one 181 owl:sameAs links. The deviation is quite high. [19]

31 The reasons can be multiple. One of them is coming from the philosophical problem of context-dependency. If two different persons were evaluating owl:sameAs connection based on the different contexts, the result might be completely different. Another reason can relate to the differences in the competence of the modelers.

3.5 Contextual Identity

The contextual identity is closely related to the second philosophical problem of context-dependency. As it was also shown by practical examples the idea of identity connects to the context of individuals. The standard OWL semantics of "owl:sameAs statement indicates that two URI references refer to the same thing: the individuals have the same "identity"."[9]

For example, from the First Data Set there is information about a:greg:

a:greg a foaf:Person.

The second Data Set provides information about a:otherGreg and the equality relation, that a:greg and a:otherGreg.

a:otherGreg a foaf:Person.

a:otherGreg owl:sameAs a:greg.

However, this example does not consider contexts. "At the moment, the way of encoding contexts on the Web is largely ad hoc, as contexts are often embedded in application

In document Hlavní práce75522_qnesa01.pdf, 1.1 MB Stáhnout (Stránka 22-0)