• Nebyly nalezeny žádné výsledky

Database Applications (NDBI026) SS2021/22

N/A
N/A
Protected

Academic year: 2022

Podíl "Database Applications (NDBI026) SS2021/22"

Copied!
111
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

course:

Database Applications (NDBI026)

SS2021/22

RNDr. Michal Kopecký, Ph.D.

Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague

(2)

Full-text Search

Filtering

Disambiguation and Lemmatization

Indexes

Thesauri

Querying

XML Extension

Privileges – Overview

Granting Privileges

Revoking Privileges

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 2

(3)
(4)

Different principles than in case of standard data searches

Not strictly structured data, where each column has its exact in advance known semantics, are searched

The task is to search in texts written in natural language. The same thing can be described differently by different authors.

Different words with the same meaning (Synonyms)

Different collocations

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 4

(5)

Database engines uses their means of

extensibility to allow full-text searches out of the box

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 5

(6)

Standard: SQL/MM 2

SQL Multimedia and Application Packages Part 2: Full-Text

Real implementation done by proprietary extensions

There is nothing like universal best possible solution

Results are evaluated subjectively by users

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 6

(7)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Writer:

Text = sequence of words in natural language.

Each word stands for some idea/imagination of writer.

Ideas represent real subject, activity, etc.

Reader: folows (not necessary exactly the same) mappings from left to right

...

M. Kopecký 7

(8)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Synonymity of words

More words can have the same meaning for the writer

car = automobile

sick = ill

...

M. Kopecký 8

(9)

Homonymy of words

One word can have more different meanings, not one exact meaning

fluke: fish, anchor, …

crown: currency, treetop, jewel, …

class: year of studies, category in set theory, …

Full-text Search, XML Extension (NDBI026, Lect. 6)

...

M. Kopecký 9

(10)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Word meanings need not be exactly the same.

Hierarchical overlapping

animal > horse > Stallion

Associativity among meanings

calculator ~ computer ~ processor

...

M. Kopecký 10

(11)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Mapping between subjects, ideas and words can depend on individual persons – readers and writers.

Two people can assign partly or completely different meaning to given term.

Two people can imagine different thing for the same word.

mother, room, ...

In result, by reading the same text two different readers can obtain different information

Each from other

In comparison with author’s intention

M. Kopecký 11

(12)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Homonymy and ambiguities grow with transition form words/terms to sentences and bigger parts of the text.

Example of English sentence with more grammatically correct meanings (in this case a human reader probably eliminates the nonsense meaning)

In the sentence „Time flies like an arrow“ either flies (fly) or like can be chosen for the predicate, what produces two significantly different meanings.

M. Kopecký 12

(13)

Example of Czech sentence with more

grammatically correct meanings (again, a human reader probably eliminates the nonsense meanings)

see for example Podivné fungování gramatiky, http://www.scienceworld.cz/sw.nsf/lingvistika

Sentence „Ženu holí stroj“ can have depending on the predicate selection – meaning:

„Poháním stroj pomocí hole“ (hnát)

„Žena používá depilační přístroj“ (holit)

„Návod k nekonvenčnímu způsobu oblékání“ (strojit)

… and further

„Oblékej ženu, patřící holím“,

„Poháním stroj, patřící holím“

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 13

(14)

The query, that would return all documents, that are interesting for the user if hard

(impossible) to formulate

Together with correct – relevant – documents are usually returned also incorrect – irrelevant –

documents

Not all relevant documents are returned

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 14

(15)

Problems

Homonyms

Wants the user asking “square” to see documents concerning the geometry (special case of rectangle) or would like to find some plaza or even something other?

Synonyms

Is document about “bins” relevant to query asking “baskets”?

Is document about “trees” relevant to query asking

“connected acyclic graphs”?

Hierarchies

Animal - Mammal - Predator - Bear

Printed media Magazine

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 15

(16)

Strict Boolean logic is not very suitable for full-text search query formulations

Document either complies the query or not

Full-text searches requires sorting answers according to presumed relevancy to the user

It is useful to be able to define the measure of similarity between the query and the document

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 16

(17)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Two TRS’s can (and do) return two different

result for the same query, that can be partly or completely unique.

How to compare quality of those systems?

Documents in the database

Relevant documents Returned

by TRS1

Returned by TRS2

M. Kopecký 17

(18)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Two questioners can suppose another

documents to be relevant for their equally formulated query

How to meet both subjective

expectations of questioners?

Documents in the database

Relevant Returned

docs.

Relevant

M. Kopecký 18

(19)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Quality of result set of documents is usually evaluated according to numbers Nv, Nr, Nrv

Precision

P = Nvr / Nv

Probability of returned document to be relevant to the user

Recall

R = Nvr / Nr

Probability of relevant document to be returned to the user

M. Kopecký 19

(20)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Both coefficients depend on the feeling of the questioner

The same document can fulfill information needs of first questioner while at the same time fail to meet them for the second one.

Each user determines different values of Nr and Nrv coefficients

Both measures P and R depend on them

M. Kopecký 20

(21)

Full-text Search, XML Extension (NDBI026, Lect. 6)

In optimal case

P=R=1

There are all and only relevant documents in the response of the system

Usually

The answer in the first iteration is neither

precise nor complete

0 0 1

1

Optimal answer

Typical initial answer

M. Kopecký 21

(22)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Query tuning

Iterative modification of the query targeted to increase the quality of the response

Theoretically it is

possible to reach the optimum sooner or later …

0 0 1

1

Optimum

R

P

M. Kopecký 22

(23)

Full-text Search, XML Extension (NDBI026, Lect. 6)

… due to (not only) ambiguities both measures depend

indirectly each on the other,

ie. P*R const. < 1

In order to increase P the absolute number of

relevant documents in the response is decreased.

In order to increase R the number of irrelevant

documents rapidly grows.

The probability to reach quality above the limit is

low. 0 0

1

1

Optimum

R

P

M. Kopecký 23

(24)

Full-text Search, XML Extension (NDBI026, Lect. 6)

In time of query formulation the questioner has to guess correct term (words) the author used for expression of given idea

Problems are caused e.g. by

Synonyms (author could use different synonym not remembered by the user)

Overlapping meanings of terms

Colorful poetical hyperboles

M. Kopecký 24

(25)

Full-text Search, XML Extension (NDBI026, Lect. 6)

The problem can be partly suppressed by inclusion of thesaurus, containing

Hierarchies of terms and their meanings

Sets of synonyms

Definitions of associations between terms

Questioner can use it during query formulation

System can use it during query evaluation

M. Kopecký 25

(26)

Full-text Search, XML Extension (NDBI026, Lect. 6)

The user often tends to tune its own query in conservative way

He/she tends to fix terms used in the first iteration (they must be the best because I remembered them immediately) and vary only additional terms at the end of the query

It is useful to support the user to

(semi)automatically eliminate wrong terms and replace them with useful ones, that describe really relevant documents

M. Kopecký 26

(27)

Full-text Search, XML Extension (NDBI026, Lect. 6)

The questioner is usually not able or willing to go through exhaustive number of hits in the response to find out the relevant one

Usually max. 20-50 documents according to their length

Need to not only sort out documents not matching the query but order the answer list according to

supposed relevancy in descendant order – the supposedly best documents at the begining

M. Kopecký 27

(28)

Full-text Search, XML Extension (NDBI026, Lect. 6)

Due to maximal criterion, the user usually tries to increase the Precision of the answer

Small amount of resulting documents in the answer, containing as high ratio of relevant documents as possible

Some problematic domains requires both high precision and recall

Lawyers, especially in territories having case law based on precedents (need to find out as much similar cases as possible)

„better“

Vr.

Rel.

Vr.

Rel.

„worse“

M. Kopecký 28

(29)

SQL databases usually use modifications of Boolean models for document representation

Corresponds well to standard data representation

Relatively easy implementation

Queries are written in form of Boolean formulae, where operands are words or collocations

Many modifications

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 29

(30)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 30

Filtering

Lemmatization Disambiguation Stoplist

Lemm. text

Indexing List

of Lemmas

Inverted file

List

of Documents Thesaurus

(31)

Filtering

Removes formatting and produces plain ASCII text

Disambiguation

Determines the meaning of the word according to context

„lie on the bed“ … werb “to lie”

„it’s a lie“ … noun “lie”

Lemmatization

Determines the basic form of the word

Often replaces by Stemmer that eliminates prefixes/suffixes and produces the word stem

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 31

(32)

Indexing

Creates auxiliary data that help find documents efficiently – typically lists of lemmas and

documents together with Inverted file

Couples [doc_id, lemma_id]

sorted by lemma_id, doc_id and stored without duplicities

Currently usually more information, for example quintuples

[doc_id, para_nr, sent_nr, word_nr, lemma_id]

sorted by lemma_id, doc_id, para_nr, sent_nr, word_nr

Allows evaluation of proximity constraints for distances of word occurrences in the document

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 32

(33)

Common character columns are too short for text data

Usually are used subtypes of LOB (Large OBject) type

LOBs

Standard types for huge data storage in the database defined in ANS SQL-92 Full

Up to 4GB of data

BLOB … standard binary type

CLOB … character type in standard server character set

NCLOB … character type in national server character set

In Oracle: additional external type BFILE

Read only

Self-standing independent file stored externally in the OS

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 33

(34)

LOBs

In Oracle

Up to (232-1) * STANDARD_BLOCK_SIZE bytes

BLOB … standard binary type

CLOB … character type in standard server character set

NCLOB … character type in national server character set

V MS SQL

Image … binary data up to 2 GB

Text … textual data up to 2 GB in standard server character set

NText … textual data up to 1 GB in national server character set

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 34

(35)

In table columns are stored only descriptors (LOB locators)

The LOB locator points to data stored in self- standing data segment

Table full-scan need not read data of LOBs

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 35

(36)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 36

(37)

Index creation on textual column

CREATE FULLTEXT INDEX ON table_name [(column_name [TYPE COLUMN

type_column_name]

[LANGUAGE language_term] [,...n])]

KEY INDEX index_name

[ON fulltext_catalog_name]

[WITH

{CHANGE_TRACKING {MANUAL | AUTO | OFF [, NO POPULATION]}}

]

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 37

(38)

Mandatory information

Table name, name of column of following type:

char, nchar, varchar, nvarchar, text, ntext

Name of (as small as possible) unique index over given table in KEY INDEX clause

Options:

Language definition (LANGUAGE)

Way the index will be updated WITH CHANGE_TRACKING

MANUAL … only log of changes is created. Its processing and index synchronization can be done either manually or by scheduled job

AUTO … index will be synchronized immediately

OFF … index will not be synchronized

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 38

(39)

Example:

CREATE UNIQUE INDEX ui_ukJobCand ON JobCandidate(JobCandidateID);

CREATE FULLTEXT CATALOG ft AS DEFAULT;

CREATE FULLTEXT INDEX ON JobCandidate(Resume) KEY INDEX ui_ukJobCand;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 39

(40)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 40

(41)

Document filtering

NULL_FILTER for textual documents

TXT, HTML, XML

INSO_FILTER for binary documents

CHARSET_FILTER for conversion of documents to the database character set

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 41

(42)

Types of full-text indexes

CONTEXT

Basic type of index for searching in text documents

Suitable (designed) for larger documents

The index synchronization is by default done explicitly by invocation CTX_DDL.SYNC_INDEX procedure

(similar to WITH CHANGE_TRACKING MANUAL option in MS SQL)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 42

(43)

Types of full-text indexes

CTXCAT

Suitable for smaller documents and excerpts

Can be combined with other non-textual columns for combined queries (partially on standard columns and partially full-text search)

The index synchronization is by default done

automatically together with changes in the table

(similar to WITH CHANGE_TRACKING AUTO option in MS SQL)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 43

(44)

Types of full-text indexes

CTXRULE

Build upon set of predefined queries

Is used for document classification according to queries it correspond to

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 44

(45)

Document source origin

NORMAL_DATASTORE

One document is in each row of given column of the table

MULTI_COLUMN_DATASTORE

Document is formed by concatenation of more columns in each row of the table

URL_DATASTORE

Document is stored on the Internet, the column contains only its URL

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 45

(46)

Index creation on textual column CREATE INDEX index_name

ON table_name(column_name) INDEXTYPE IS

ctxsys.{context|ctxcat|ctxrule}

[PARAMETERS

(’param_name param_value …’)];

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 46

(47)

Mandatory information

Name of the table, name of textual column of one of following types:

char, nchar, varchar2, nvarchar2, clob, nclob

Type (implementation) of the domain index

Optional information:

Filter, Datastore

Lexer … splitting text to words with respect to the language, diacritics and its transcription (ö, oe, o), …

Stoplist,

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 47

(48)

Example:

CREATE INDEX myindex ON doc(htmlfile)

INDEXTYPE IS ctxsys.context PARAMETERS(

’datastore ctxsys.default_datastore filter ctxsys.null_filter

section group ctxsys.html_section_group’);

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 48

(49)

Together with new index types the database servers implement new operators for

comparison of documents with queries

Operators return number, an estimate

measure of similarity of document content with given query

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 49

(50)

MS SQL uses

a Boolean operator – a predicate CONTAINS(

column,

’query’[,

LANGUAGE lang ])

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 50

(51)

Oracle uses numerical function – a predicate CONTAINS(

column,

’query’[,

comparison_nr ])

Using auxiliary function

SCORE(comparison_nr) is possible to find out the returned CONTAINS value

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 51

(52)

Query operators

AND, & … ‘mice & cats’

OR, | … ‘mice | cats’

AND NOT, & ! … ‘mice AND NOT cats’

NEAR, ~ … proximity operator

‘mice NEAR cats’

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 52

(53)

DBI026 -DB Aplikace - MFF UK

Query operators

AND, & … ‘mice & cats’

OR, | … ‘mice | cats’

NOT, ~ … ‘mice ~ cats’

NEAR, ; … proximity operator

‘mice ; cats’

NEAR((mice,cats),5)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 53

(54)

Query operators

Phrases in double quotas

FORMSOF ( INFLECTIONAL , word) the word is searched in all forms

hide, hidden, …

FORMSOF ( THESAURUS , word)

given word is expanded to the list of words with the similar meaning using thesaurus

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 54

(55)

Query operators

Word prefix followed by star wildcard for

searching for all words that begin with given prefix

word WEIGHT (n)

Weighted term, the importance of word occurrence by is multiplied by given constant, used together with a CONTAINSTABLE operator (see later)

ISABOUT (theme [,…])

Documents discussing given theme

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 55

(56)

Query operators

The SQL wildcards _ and % can be used at the beginning and at the end of words

word WEIGHT n, word * n

Weighted term, the importance of word occurrence by is multiplied by given constant

ABOUT ( theme )

Documents discussing given theme … ‘about(politics)’

STEM, $

Searches for occurrences of words with given stem

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 56

(57)

Query operators

query WITHIN PARAGRAPH query WITHIN SENTENCE

Searching in scope of one paragraph or sentence if the document structure can be recognized and processed

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 57

(58)

CONTAINSTABLE(

tab_name,col_name,query[,language][,top_n]

)

Returns table with the schema (KEY,RANK) that – in pseudo column RANK – contains measures of similarity between document in row with given KEY and the given query

SELECT select_list

FROM table AS FT_TBL INNER JOIN

CONTAINSTABLE(table, column, query) AS KEY_TBL ON (FT_TBL.unique_key_column = KEY_TBL.[KEY] )

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 58

(59)

SELECT

C.CategoryName, C.Description,

R.RANK

FROM Categories AS C INNER JOIN

CONTAINSTABLE(

Categories, Description,

'ISABOUT (fishes weight (.8), beers weight (.3) )'

) AS R

ON C.CategoryID = KEY_TBL.KEY ORDER BY R.RANK DESC;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 59

(60)

SELECT

C.CategoryName , C.Description,

R.RANK

FROM Categories AS C INNER JOIN

CONTAINSTABLE (

Categories, Description,

'("sweet and savory" NEAR sauces) OR

("sweet and savory" NEAR candies)', 10) AS R ON C.CategoryID = R.KEY;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 60

(61)

SELECT Description, SCORE(1)+SCORE(2) FROM Categories

WHERE

CONTAINS(

Abstract,

’database & search’,1

) > 0 AND

CONTAINS(Description

,

’(object|relational) NEAR database’,2

) > 0 ORDER BY SCORE(1)+SCORE(2) DESC;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 61

(62)

Thesauri operators

BT(word[,n]) … Include also Broather Term for word (n levels higher). If the word is a homonym include all meanings

NT(word[,n]) … Include also Narrower Term for word (n levels lower).

PT(word) … Replace word by Preferred Term that represent a whole set of word equivalents.

TT(word) … Replace word by Top Term from the meanings hierarchy thesaurus

SYN(word) … Include all SYNonyms of word.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 62

(63)
(64)

Creation

CREATE [PUBLIC] SYNONYM new_name

FOR

[schema.]old_name[@dbname];

Dropping

DROP [PUBLIC] SYNONYM new_name;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 64

Everywhere, where the you can use the old_name, you can use new_name

as well

Can be created for almost everything including synonyms

Public synonyms are visible for

everyone without owner prefix

(65)

Creation

CREATE SYNONYM new_name

FOR [[dbname.]schema.]old_name;

Dropping

DROP SYNONYM new_name;

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 65

Everywhere, where the you can use the old_name, you can use new_name

as well

Can be created for almost everything including synonyms

(66)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 66

(67)

The database application need communicate with other systems and/or partners

Export of materials from IS to an external accounting

B2G (business-to-government)

Reports for government

B2B (business-to-business)

Export/import of data to/from other IS or company

Electronic orders, invoices, etc.

SOA service-oriented architectures

Communication between system components

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 67

(68)

The XML format is heavily used

+ Readable for both humans and computers

+ Structured

+ Easily analyzed by parsers

Huge amount of data

Direct querying needed different approaches and data structures than in case of relational data

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 68

(69)

To communicate with neighborhood it is often needed

To export stored relational data in XML format

To extract data from an XML tree and store data in relational tables

To store and maintain original XML documents obtained from business partners

To search for XML documents that contain needed information

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 69

(70)

Xpath is one of used query languages for XML documents

The query result is a set of nodes (sub-trees) or content of a node or attribute

Document itself can be understood as a tree containing

A root node (not supposed to be a part of document)

Elements

Attributes

Textual nodes (leaves, corresponding to free text between opening and ending tags of nodes)

… (comments, name spaces, instructions for processing)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 70

(71)

Path is an basic mean for addressing of node(s) in the XML tree

Absolute beginning with the sign ’/’

Relative

Is written as a sequence of steps, delimited by slash signs.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 71

(72)

XPath query examples:

/books/book/title

List of all title elements that are direct descendants of a book element that is a direct descendant of a books

element.

//title

List of all title elements in the XML tree, regardless on the path they are located

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 72

(73)

XPath query examples:

/books/*/title

There can be any node (element) on the second step of the path

/books/book[1]/author[2]

Second author of the first book in the list of books

/books/book[author=’Pokorný J.’]/title

Titles of all books written by J. Pokorný

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 73

(74)

Xpath axis

self:: - current node

parent::, child:: - direct parent/children of current node

ancestor::, descendant:: - all ancestors/descendants, including indirect ones

descendant-or-self::, ancestor-or-self::

following::, preceding:: - nodes before/after current node and all its predecessors

following-sibling::, preceding-sibling:: - siblings

attribute:: - attributes of current node

child:: is the default axis, it need not to be explicitly stated

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 74

(75)

preceding:: following::

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 75

(76)

Shortened notation

node_name instead of child::node_name

@attr_name instead of attribute::attr_name

’.’ for current node

..’ for parent node

Query examples:

Titles of all books in textual form /books/book/title/text()

First book in the list /books/book[1]

<book><title>…</title><author>…</author></book>

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 76

(77)

The problem is the effectiveness of XPath queries, respectively of all query languages on XML

documents

Data can be stored per partes in the relational database, and then standard means as B-tree indexes, joins, etc. can be used

+ Usable in any relational database

Necessary to know the structure of documents (DTD, XMLSchema)

Data can be natively stored/processed in form of text

New type of indexes are necessary

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 77

(78)

If it is known the document schema, each document can be split to parts and stored in tables using more variants/algorithms

Idea of one of possible approaches:

It is supposed a simple DTD, respectively existence of less strict simplified DTD’ created for original DTD, where

following implication holds:

validity w.r.t. DTD  validity w.r.t. DTD’

Elements declare other elements in their content at most once

Sub-element can be optional (?) or repeatable (*)

Sub-elements can be delimited by commas, not pipes ’|’

The order of sub-elements has no matter

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 78

(79)

For such simple DTD can be created

List of all possible paths in the XML tree

A table

E(DocID,ElemID,ParID,PathID,attributes) for each element E

Columns contain

Document ID

Element ID (primary key)

Parent element ID

Path ID

Columns for all possible attributes of given element

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 79

(80)

<!ELEMENT book (title, author)>

<!ATTLIST book isbn CDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

Book(DocID,ElemID,ParID,PathID,ISBN)

Title(DocID,ElemID,ParID,PathID,PCDataID) Author(DocID,ElemID,ParID,PathID,PCDataID) PCData(DocID,ElemID,Data)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 80

(81)

XPath queries are transformed to SQL.

It is often necessary to join large number of tables.

Its number can be decreased by better splitting data to tables.

For example one table can contain not only one element, but also all sub- elements that cannot repeat.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 81

Transformation is implemented:

On the client side

In some relational DBMS’s optionally on the server side

XML Schema

ER/RMD

XML Dok.XML Dok. XPath query

SQL query Data in RDB

XML Docs

(82)

<!ELEMENT book (title, author)>

<!ATTLIST book isbn CDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

Book(

DocID,ElemID,

ISBN, TitlePCDataID, AuthorPCDataID, ParID,PathID)

PCData(DocID,ElemID,Data)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 82

(83)

<!ELEMENT book (title, autor*)>

<!ATTLIST book isbn CDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

If the author can be repeated, it is necessary to store them in its own table. The title can be merged/embedded into book table.

Book(

DocID,ElemID,ISBN,TitlePCDataID,ParID,PathID) Author(DocID,ElemID,ParID,PathID,PCDataID)

PCData(DocID,ElemID,Data)

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 83

(84)

DataGuide index

Contains the tree of all paths existing in the XML document tree

Each path is stored in the index only once

Index nodes contain meta-information about all nodes on given path

Index can be stored in form of XML document as well

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 84

(85)

XML path index

Similar to DataGuide index

Nodes contain sets of pointers to corresponding nodes in an XML document

Substantially speeds up answers to queries in form of paths without // operator

Helps with evaluation of queries that don’t begin with // but contain it somewhere in the expession.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 85

(86)

Lore index

VIndex (Value index) – indexes textual values of nodes and attributes

TIndex (Term index) – indexes words in nodes and

attributes (similar to inverted file from full-text search

LIndex holds data for determining parents of nodes

PIndex (Path index) – holds paths and keeps lists of pointers to all nodes on them

Example

/books//author[@country=”CZ”]

VIndex is used to find attributes with country=”CZ”, then parents are visited in bottom-up order uing LIndex and the path is verified

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 86

(87)

Native data type XML

CREATE TABLE XmlTab(

ID NUMERIC

CONSTRAINT XmlTabPk PRIMARY KEY, Doc XML

);

INSERT INTO XmlTab VALUES(

1, ’<a><b>b1</b></a>’

);

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 87

(88)

Native type XMLType, that allows two different ways of storage:

Using CLOB

Document is stored and returned exactly in its original form including whitespaces (formatting)

+ The XMLSchema need not to be known

More complicated partial actualizations

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 88

(89)

Native type XMLType, that allows two different ways of storage:

Using splitting data to relational tables

Returned document equivalent w.r.t. DOM model, i.e.

without original whitespaces (formatting)

The XMLSchema has to ne known + Effective partial actualization

+ Possible indexing using B-trees

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 89

(90)

Native data type XMLType

CREATE TABLE XmlTab(

ID NUMERIC

CONSTRAINT XmlTabPk PRIMARY KEY, Doc XMLType

);

INSERT INTO XmlTab VALUES(

1, ’<a><b>b1</b></a>’

);

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 90

(91)

MS SQL uses for exporting data in XML format syntax SELECT … FOR XML {RAW|AUTO|EXPLICIT|PATH}

[,XMLDATA][,ELEMENTS][BINARY BASE64]

[ROOT(’root_element’)]

RAW generates one element row for each row. Values are stored in attributes. Their names correspond to column names / alias names

<row empno=”1230” ename=”King” deptno=”10” />

<row empno=”4560” ename=”Scott” deptno=”10” />

<row empno=”7800” ename=”King” deptno=”20” />

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 91

(92)

AUTO generates one element for each table and each row.

The structure corresponds to used joins. Element names correspond to table /alias names

<dept deptno=”10”>

<emp empno=”1230” ename=”King” />

<emp empno=”4560” ename=”Scott” />

</dept>

<dept deptno=”20”>

<emp empno=”7800” ename=”King” deptno=”20” />

</dept>

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 92

(93)

The ELEMENTS modifier ensures that column values are stored as sub-elements

<dept>

<deptno>10</deptno>

<emp>

<empno>1230</empno><ename>King</ename>

<empno>4560</empno><ename>Scott</ename>

</emp>

</dept>

<dept deptno=”20”>

...

</dept>

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 93

(94)

To export data to XML the function SYS_XMLGEN can be used. It can get either atomic value or

structured data type

SELECT SYS_XMLGEN(ename) FROM …

<ENAME>SMITH</ENAME>

<ENAME>ALLEN</ENAME>

...

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 94

(95)

SYS_XMLGEN(instance)

SELECT SYS_XMLGEN(

dept_t(deptno,dname, CAST(MULTISET(

select empno, ename from emp e

where e.deptno = d.deptno) AS

emplist_t))).getClobVal() AS deptxml

FROM dept d;

<?xml version="1.0"?>

<ROW DEPTNO="100">

<DNAME>Sports</DNAME>

<EMPLIST>

<EMP_T EMPNO="200">

<ENAME>John</ENAME>

</EMP_T>

<EMP_T>

<ENAME>Jack</ENAME>

</EMP_T>

</EMPLIST>

</ROW>

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 95

(96)

XML can be generated per partes using specialized functions.

XMLElement(

name[,attrs][,child

1

, …]]

)

XMLAttributes(name

1

AS alias

1

[, …])

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 96

(97)

SELECT XMLElement(

emp”,

XMLAttributes(empno, ename AS name), XMLElement(”salary”, sal),

XMLElement(”comm”, comm)

) FROM Emp WHERE empno=1000;

<emp empno=”1000” name=”King”>

<salary>578</salary>

<comm>50</comm>

</emp>

Each row is presented as one instance of XMLType value

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 97

(98)

XMLForest(value1 AS alias1, …)

Generates a value that can have more than one root node, so it is not an XML document

SELECT XMLForest(

empno, deptno as name ) FROM emp;

<empno>1000</empno><deptno>10</deptno>

<empno>2300</empno><deptno>20</deptno>

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 98

(99)

XMLConcat(XML

1

, …)

Concatenates all XML documents, the result

contains more root nodes and so it is not an XML document

XMLSequence(fragment)

Splits fragment to a collection of XML documents, each item corresponds to one root node in the

fragment

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 99

(100)

MS SQL allows create three types of XML index on given columns

First the column has to have primary XML index CREATE PRIMARY XML INDEX XmlIdx

ON XmlTab(Doc)

Then following indexes can be created

Value index, Path index, Property index

CREATE XML INDEX InxName ON XmlTab(Doc)

USING XML INDEX XmlIdx

FOR {VALUE|PATH|PROPERTY}

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 100

(101)

Oracle allows to create following indexes on XMLType column:

Full-text index (CTXSYS.Context) CREATE INDEX XmlFtIdx

ON XmlTab(Doc) INDEXTYPE IS CTXSYS.Context;

XMLIndex (CTXSYS.CTXXPath Oracle 10g)

CREATE INDEX XmlPathIdx

ON XmlTab(Dok) INDEXTYPE IS CTXSYS.CTXXPath

XMLIndex (XDB.XmlIndexOracle 11g)

CREATE INDEX XmlIdx

ON XmlTab(Dok) INDEXTYPE IS XDB.XMLType Indexes paths, element positions, values

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 101

(102)

If the partner send an XML document, it is often necessary to extract needed

information and store it in the database

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 102

(103)

Uses either XPath for element ant their sets

addressing, respectively XQuery for construction of resulting XML document from the source one

(.query, .exist, and .value are case sensitive!)

xmlcolumn.query(XPath)

Returns fragment, containing elements that satisfy the query

select Doc.query(’/book/author’) from T;

xmlcolumn.exist(XPath)

Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).

xml.value(XPath,datatype)

Returns value of (the exactly one) addressed node

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 103

(104)

Uses XPath for element ant their sets addressing

Extract(xml,xpath)

Returns fragment, containing elements that satisfy the query

select Extract(Doc,’/book/author’) from T;

ExistsNode(xml,xpath)

Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).

ExtractValue(xml,xpath)

Returns value of (the exactly one) addressed node

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 104

(105)

SELECT Doc.query(’/book/author’) FROM Books;

<author>Pokorný J.</author>

<author>Král J.</author>

SELECT

Dok.value(’/book[1]/author[1]’,’varchar(30)’) FROM Books;

Pokorný J.

Král J.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 105

(106)

SELECT Extract(Doc,’/book/author’) FROM Book;

<author>Pokorný J.</author>

<author>Král J.</author>

SELECT ExtractValue(Doc,’/book/author’) FROM Books;

Pokorný J.

Král J.

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 106

(107)

Method xmlcolumn.modify(’XQuery’)

UPDATE Books SET Doc=Doc.modify(

’delete /book[autor=”Kopecký M.”]’

) WHERE …

UPDATE Books SET Doc=Doc.modify(

’insert <author>…</author>

into /book’) WHERE ...

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 107

(108)

Function

UpdateXML(xml,path

1

,val

1

,path

2

,val

2

,…)

UPDATE Books SET

Doc=UpdateXML(Doc,

’/book[title=”DIS ”]/author’,’Pokorný J.’

) WHERE …

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 108

(109)

Using tool „IIS Virtual Directory Management for SQL Server“ it is possible to define a

directory in IIS 5 and higher, and publish xml data on the web

//http://localhost/xml?SELECT * FROM BOOKS FOR XML AUTO&root=book

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 109

(110)

Using XML DB Repository it is possible to

publish data stored in the database in virtual directories

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 110

(111)

HttpUriType

Constructor

HttpUriType(url)

Methods GetCLOB() GetBLOB() GetXML()

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 111

Odkazy

Související dokumenty

A different type of classification of XML benchmarks distinguishes approaches which involve a fixed set of testing data sets (e.g. XML documents, XML queries, XSL transformation

Techniques for data-centric XML documents have one common idea: XML data is stored and processed in a relational or object-relational database system and using a certain method

 The order of execution for triggers of the same type can be defined using sp_settriggerorder procedure..  More types of actualization can be defined for

▪ Column value references to primary key, or candidate key (UNIQUE column) of given table. ▪ Using ON DELETE clause the deletion of master row is

 express your identified types of entities, relationships and their characteristics with constructs offered by the selected conceptual modeling language.  UML:

 After end of data files backup it is necessary to backup redo-log file, before copy of archived. logs it is necessary to archive active

MIE-PDB.16: Advanced Database Systems | Lecture 5: XML Databases: XPath, XQuery | 20...

MIE‐PDB.16: Advanced Database Systems | Lecture 5: XML Databases: XPath, XQuery | 26...