Database Applications (NDBI026) SS2021/22

(1)

course:

Database Applications (NDBI026)

SS2021/22

RNDr. Michal Kopecký, Ph.D.

Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague

(2)



Full-text Search

▪ Filtering

▪ Disambiguation and Lemmatization

▪ Indexes

▪ Thesauri

▪ Querying



XML Extension

▪ Privileges – Overview

▪ Granting Privileges

▪ Revoking Privileges

M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 2

(3)

(4)



Different principles than in case of standard data searches

▪ Not strictly structured data, where each column has its exact in advance known semantics, are searched

▪ The task is to search in texts written in natural language. The same thing can be described differently by different authors.

▪ Different words with the same meaning (Synonyms)

▪ Different collocations

▪ …

(5)



Database engines uses their means of

extensibility to allow full-text searches out of the box

(6)



Standard: SQL/MM 2

SQL Multimedia and Application Packages Part 2: Full-Text



Real implementation done by proprietary extensions

▪ There is nothing like universal best possible solution

▪ Results are evaluated subjectively by users

(7)

Full-text Search, XML Extension (NDBI026, Lect. 6)

 Writer:

▪ Text = sequence of words in natural language.

▪ Each word stands for some idea/imagination of writer.

▪ Ideas represent real subject, activity, etc.

 Reader: folows (not necessary exactly the same) mappings from left to right

...

M. Kopecký 7

(8)

 Synonymity of words

▪ More words can have the same meaning for the writer

▪ car = automobile

▪ sick = ill

...

M. Kopecký 8

(9)

 Homonymy of words

▪ One word can have more different meanings, not one exact meaning

▪ fluke: fish, anchor, …

▪ crown: currency, treetop, jewel, …

▪ class: year of studies, category in set theory, …

...

M. Kopecký 9

(10)

 Word meanings need not be exactly the same.

▪ Hierarchical overlapping

▪ animal > horse > Stallion

▪ Associativity among meanings

▪ calculator ~ computer ~ processor

...

M. Kopecký 10

(11)

 Mapping between subjects, ideas and words can depend on individual persons – readers and writers.

▪ Two people can assign partly or completely different meaning to given term.

▪ Two people can imagine different thing for the same word.

▪ mother, room, ...

 In result, by reading the same text two different readers can obtain different information

▪ Each from other

▪ In comparison with author’s intention

M. Kopecký 11

(12)

 Homonymy and ambiguities grow with transition form words/terms to sentences and bigger parts of the text.

 Example of English sentence with more grammatically correct meanings (in this case a human reader probably eliminates the nonsense meaning)

▪ In the sentence „Time flies like an arrow“ either flies (fly) or like can be chosen for the predicate, what produces two significantly different meanings.

M. Kopecký 12

(13)

 Example of Czech sentence with more

grammatically correct meanings (again, a human reader probably eliminates the nonsense meanings)

▪ see for example Podivné fungování gramatiky, http://www.scienceworld.cz/sw.nsf/lingvistika

▪ Sentence „Ženu holí stroj“ can have – depending on the predicate selection – meaning:

▪ „Poháním stroj pomocí hole“ (hnát)

▪ „Žena používá depilační přístroj“ (holit)

▪ „Návod k nekonvenčnímu způsobu oblékání“ (strojit)

▪ … and further

▪ „Oblékej ženu, patřící holím“,

▪ „Poháním stroj, patřící holím“

(14)



The query, that would return all documents, that are interesting for the user if hard

(impossible) to formulate

▪ Together with correct – relevant – documents are usually returned also incorrect – irrelevant –

documents

▪ Not all relevant documents are returned

(15)



Problems

▪ Homonyms

▪ Wants the user asking “square” to see documents concerning the geometry (special case of rectangle) or would like to find some plaza or even something other?

▪ Synonyms

▪ Is document about “bins” relevant to query asking “baskets”?

▪ Is document about “trees” relevant to query asking

“connected acyclic graphs”?

▪ Hierarchies

▪ Animal - Mammal - Predator - Bear

▪ Printed media – Magazine

(16)



Strict Boolean logic is not very suitable for full-text search query formulations

▪ Document either complies the query or not

▪ Full-text searches requires sorting answers according to presumed relevancy to the user

▪ It is useful to be able to define the measure of similarity between the query and the document

(17)

 Two TRS’s can (and do) return two different

result for the same query, that can be partly or completely unique.

How to compare quality of those systems?

Documents in the database

Relevant documents Returned

by TRS₁

Returned by TRS₂

M. Kopecký 17

(18)

 Two questioners can suppose another

documents to be relevant for their equally formulated query

How to meet both subjective

expectations of questioners?

Documents in the database

Relevant Returned

docs.

Relevant



M. Kopecký 18

(19)

 Quality of result set of documents is usually evaluated according to numbers Nv, Nr, Nrv

▪ Precision

▪ P = Nvr / Nv

▪ Probability of returned document to be relevant to the user

▪ Recall

▪ R = Nvr / Nr

▪ Probability of relevant document to be returned to the user

M. Kopecký 19

(20)



Both coefficients depend on the feeling of the questioner



The same document can fulfill information needs of first questioner while at the same time fail to meet them for the second one.

▪ Each user determines different values of Nr and Nrv coefficients

▪ Both measures P and R depend on them

M. Kopecký 20

(21)

 In optimal case

▪ P=R=1

▪ There are all and only relevant documents in the response of the system

 Usually

▪ The answer in the first iteration is neither

precise nor complete

0 0 1

1

Optimal answer

Typical initial answer

M. Kopecký 21

(22)

 Query tuning

▪ Iterative modification of the query targeted to increase the quality of the response

 Theoretically it is

possible to reach the optimum sooner or later …

0 0 1

1

Optimum

R

P

M. Kopecký 22

(23)

 … due to (not only) ambiguities both measures depend

indirectly each on the other,

ie. P*R  const. < 1

▪ In order to increase P the absolute number of

relevant documents in the response is decreased.

▪ In order to increase R the number of irrelevant

documents rapidly grows.

 The probability to reach quality above the limit is

low. ⁰ ₀

1

Optimum

R

P

M. Kopecký 23

(24)



In time of query formulation the questioner has to guess correct term (words) the author used for expression of given idea

▪ Problems are caused e.g. by

▪ Synonyms (author could use different synonym not remembered by the user)

▪ Overlapping meanings of terms

▪ Colorful poetical hyperboles

▪ …

M. Kopecký 24

(25)



The problem can be partly suppressed by inclusion of thesaurus, containing

▪ Hierarchies of terms and their meanings

▪ Sets of synonyms

▪ Definitions of associations between terms



Questioner can use it during query formulation



System can use it during query evaluation

M. Kopecký 25

(26)

 The user often tends to tune its own query in conservative way

▪ He/she tends to fix terms used in the first iteration (they must be the best because I remembered them immediately) and vary only additional terms at the end of the query

 It is useful to support the user to

(semi)automatically eliminate wrong terms and replace them with useful ones, that describe really relevant documents

M. Kopecký 26

(27)

 The questioner is usually not able or willing to go through exhaustive number of hits in the response to find out the relevant one

 Usually max. 20-50 documents according to their length

Need to not only sort out documents not matching the query but order the answer list according to

supposed relevancy in descendant order – the supposedly best documents at the begining

M. Kopecký 27

(28)

 Due to maximal criterion, the user usually tries to increase the Precision of the answer

▪ Small amount of resulting documents in the answer, containing as high ratio of relevant documents as possible

 Some problematic domains requires both high precision and recall

▪ Lawyers, especially in territories having case law based on precedents (need to find out as much similar cases as possible)

„better“

Vr.

Rel.

Vr.

Rel.

„worse“

M. Kopecký 28

(29)



SQL databases usually use modifications of Boolean models for document representation

▪ Corresponds well to standard data representation

▪ Relatively easy implementation

▪ Queries are written in form of Boolean formulae, where operands are words or collocations

▪ Many modifications

(30)

Filtering

Lemmatization Disambiguation Stoplist

Lemm. text

Indexing List

of Lemmas

Inverted file

List

of Documents Thesaurus

(31)

 Filtering

▪ Removes formatting and produces plain ASCII text

 Disambiguation

▪ Determines the meaning of the word according to context

▪ „lie on the bed“ … werb “to lie”

▪ „it’s a lie“ … noun “lie”

 Lemmatization

▪ Determines the basic form of the word

▪ Often replaces by Stemmer that eliminates prefixes/suffixes and produces the word stem

(32)



Indexing

▪ Creates auxiliary data that help find documents efficiently – typically lists of lemmas and

documents together with Inverted file

▪ Couples [doc_id, lemma_id]

sorted by lemma_id, doc_id and stored without duplicities

▪ Currently usually more information, for example quintuples

[doc_id, para_nr, sent_nr, word_nr, lemma_id]

sorted by lemma_id, doc_id, para_nr, sent_nr, word_nr

▪ Allows evaluation of proximity constraints for distances of word occurrences in the document

(33)



Common character columns are too short for text data

▪ Usually are used subtypes of LOB (Large OBject) type

 LOBs

▪ Standard types for huge data storage in the database defined in ANS SQL-92 Full

▪ Up to 4GB of data

▪ BLOB … standard binary type

▪ CLOB … character type in standard server character set

▪ NCLOB … character type in national server character set

▪ In Oracle: additional external type BFILE

▪ Read only

▪ Self-standing independent file stored externally in the OS

(34)

 LOBs

▪ In Oracle

▪ Up to (2³²-1) * STANDARD_BLOCK_SIZE bytes

▪ BLOB … standard binary type

▪ CLOB … character type in standard server character set

▪ NCLOB … character type in national server character set

▪ V MS SQL

▪ Image … binary data up to 2 GB

▪ Text … textual data up to 2 GB in standard server character set

▪ NText … textual data up to 1 GB in national server character set

(35)



In table columns are stored only descriptors (LOB locators)



The LOB locator points to data stored in self- standing data segment



Table full-scan need not read data of LOBs

(36)

(37)

 Index creation on textual column

CREATE FULLTEXT INDEX ON table_name [(column_name [TYPE COLUMN

type_column_name]

[LANGUAGE language_term] [,...n])]

KEY INDEX index_name

[ON fulltext_catalog_name]

[WITH

{CHANGE_TRACKING {MANUAL | AUTO | OFF [, NO POPULATION]}}

]

(38)

 Mandatory information

▪ Table name, name of column of following type:

char, nchar, varchar, nvarchar, text, ntext

▪ Name of (as small as possible) unique index over given table in KEY INDEX clause

 Options:

▪ Language definition (LANGUAGE)

▪ Way the index will be updated WITH CHANGE_TRACKING

▪ MANUAL … only log of changes is created. Its processing and index synchronization can be done either manually or by scheduled job

▪ AUTO … index will be synchronized immediately

▪ OFF … index will not be synchronized

(39)

 Example:

CREATE UNIQUE INDEX ui_ukJobCand ON JobCandidate(JobCandidateID);

CREATE FULLTEXT CATALOG ft AS DEFAULT;

CREATE FULLTEXT INDEX ON JobCandidate(Resume) KEY INDEX ui_ukJobCand;

(40)

(41)



Document filtering

▪ NULL_FILTER for textual documents

▪ TXT, HTML, XML

▪ INSO_FILTER for binary documents

▪ CHARSET_FILTER for conversion of documents to the database character set

(42)



Types of full-text indexes

▪ CONTEXT

▪ Basic type of index for searching in text documents

▪ Suitable (designed) for larger documents

▪ The index synchronization is by default done explicitly by invocation CTX_DDL.SYNC_INDEX procedure

(similar to WITH CHANGE_TRACKING MANUAL option in MS SQL)

(43)



Types of full-text indexes

▪ CTXCAT

▪ Suitable for smaller documents and excerpts

▪ Can be combined with other non-textual columns for combined queries (partially on standard columns and partially full-text search)

▪ The index synchronization is by default done

automatically together with changes in the table

(similar to WITH CHANGE_TRACKING AUTO option in MS SQL)

(44)



Types of full-text indexes

▪ CTXRULE

▪ Build upon set of predefined queries

▪ Is used for document classification according to queries it correspond to

(45)



Document source origin

▪ NORMAL_DATASTORE

▪ One document is in each row of given column of the table

▪ MULTI_COLUMN_DATASTORE

▪ Document is formed by concatenation of more columns in each row of the table

▪ URL_DATASTORE

▪ Document is stored on the Internet, the column contains only its URL

▪ …

(46)



Index creation on textual column CREATE INDEX index_name

ON table_name(column_name) INDEXTYPE IS

ctxsys.{context|ctxcat|ctxrule}

[PARAMETERS

(’param_name param_value …’)];

(47)



Mandatory information

▪ Name of the table, name of textual column of one of following types:

char, nchar, varchar2, nvarchar2, clob, nclob

▪ Type (implementation) of the domain index



Optional information:

▪ Filter, Datastore

▪ Lexer … splitting text to words with respect to the language, diacritics and its transcription (ö, oe, o), …

▪ Stoplist,

▪ …

(48)



Example:

CREATE INDEX myindex ON doc(htmlfile)

INDEXTYPE IS ctxsys.context PARAMETERS(

’datastore ctxsys.default_datastore filter ctxsys.null_filter

section group ctxsys.html_section_group’);

(49)



Together with new index types the database servers implement new operators for

comparison of documents with queries



Operators return number, an estimate

measure of similarity of document content with given query

(50)



MS SQL uses

a Boolean operator – a predicate CONTAINS(

column,

’query’[,

LANGUAGE lang ])

(51)



Oracle uses numerical function – a predicate CONTAINS(

column,

’query’[,

comparison_nr ])



Using auxiliary function

SCORE(comparison_nr) is possible to find out the returned CONTAINS value

(52)



Query operators

▪ AND, & … ‘mice & cats’

▪ OR, | … ‘mice | cats’

▪ AND NOT, & ! … ‘mice AND NOT cats’

▪ NEAR, ~ … proximity operator

‘mice NEAR cats’

(53)

DBI026 -DB Aplikace - MFF UK



Query operators

▪ AND, & … ‘mice & cats’

▪ OR, | … ‘mice | cats’

▪ NOT, ~ … ‘mice ~ cats’

▪ NEAR, ; … proximity operator

‘mice ; cats’

NEAR((mice,cats),5)

(54)



Query operators

▪ Phrases in double quotas

▪ FORMSOF ( INFLECTIONAL , word) the word is searched in all forms

hide, hidden, …

▪ FORMSOF ( THESAURUS , word)

given word is expanded to the list of words with the similar meaning using thesaurus

(55)



Query operators

▪ Word prefix followed by star wildcard for

searching for all words that begin with given prefix

▪ word WEIGHT (n)

▪ Weighted term, the importance of word occurrence by is multiplied by given constant, used together with a CONTAINSTABLE operator (see later)

▪ ISABOUT (theme [,…])

▪ Documents discussing given theme

(56)



Query operators

▪ The SQL wildcards _ and % can be used at the beginning and at the end of words

▪ word WEIGHT n, word * n

▪ Weighted term, the importance of word occurrence by is multiplied by given constant

▪ ABOUT ( theme )

▪ Documents discussing given theme … ‘about(politics)’

▪ STEM, $

▪ Searches for occurrences of words with given stem

(57)



Query operators

▪ query WITHIN PARAGRAPH query WITHIN SENTENCE

▪ Searching in scope of one paragraph or sentence if the document structure can be recognized and processed

(58)

 CONTAINSTABLE(

tab_name,col_name,query[,language][,top_n]

)

▪ Returns table with the schema (KEY,RANK) that – in pseudo column RANK – contains measures of similarity between document in row with given KEY and the given query

 SELECT select_list

FROM table AS FT_TBL INNER JOIN

CONTAINSTABLE(table, column, query) AS KEY_TBL ON (FT_TBL.unique_key_column = KEY_TBL.[KEY] )

(59)



SELECT

C.CategoryName, C.Description,

R.RANK

FROM Categories AS C INNER JOIN

CONTAINSTABLE(

Categories, Description,

'ISABOUT (fishes weight (.8), beers weight (.3) )'

) AS R

ON C.CategoryID = KEY_TBL.KEY ORDER BY R.RANK DESC;

(60)



SELECT

C.CategoryName , C.Description,

R.RANK

FROM Categories AS C INNER JOIN

CONTAINSTABLE (

Categories, Description,

'("sweet and savory" NEAR sauces) OR

("sweet and savory" NEAR candies)', 10) AS R ON C.CategoryID = R.KEY;

(61)



SELECT Description, SCORE(1)+SCORE(2) FROM Categories

WHERE

CONTAINS(

^Abstract,

’database & search’,1

) > 0 AND

CONTAINS(Description

,

’(object|relational) NEAR database’,2

) > 0 ORDER BY SCORE(1)+SCORE(2) DESC;

(62)



Thesauri operators

▪ BT(word[,n]) … Include also Broather Term for word (n levels higher). If the word is a homonym include all meanings

▪ NT(word[,n]) … Include also Narrower Term for word (n levels lower).

▪ PT(word) … Replace word by Preferred Term that represent a whole set of word equivalents.

▪ TT(word) … Replace word by Top Term from the meanings hierarchy thesaurus

▪ SYN(word) … Include all SYNonyms of word.

(63)

(64)

 Creation

CREATE [PUBLIC] SYNONYM new_name

FOR

[schema.]old_name[@dbname];

 Dropping

DROP [PUBLIC] SYNONYM new_name;

 Everywhere, where the you can use the old_name, you can use new_name

as well

 Can be created for almost everything including synonyms

 Public synonyms are visible for

everyone without owner prefix

(65)



Creation

CREATE SYNONYM new_name

FOR [[dbname.]schema.]old_name;



Dropping

DROP SYNONYM new_name;

 Everywhere, where the you can use the old_name, you can use new_name

as well

 Can be created for almost everything including synonyms

(66)

(67)

 The database application need communicate with other systems and/or partners

▪ Export of materials from IS to an external accounting

▪ B2G (business-to-government)

▪ Reports for government

▪ B2B (business-to-business)

▪ Export/import of data to/from other IS or company

▪ Electronic orders, invoices, etc.

▪ SOA service-oriented architectures

▪ Communication between system components

(68)



The XML format is heavily used

+ Readable for both humans and computers

+ Structured

+ Easily analyzed by parsers

▪ Huge amount of data

▪ Direct querying needed different approaches and data structures than in case of relational data

(69)



To communicate with neighborhood it is often needed

▪ To export stored relational data in XML format

▪ To extract data from an XML tree and store data in relational tables

▪ To store and maintain original XML documents obtained from business partners

▪ To search for XML documents that contain needed information

(70)

 Xpath is one of used query languages for XML documents

 The query result is a set of nodes (sub-trees) or content of a node or attribute

 Document itself can be understood as a tree containing

▪ A root node (not supposed to be a part of document)

▪ Elements

▪ Attributes

▪ Textual nodes (leaves, corresponding to free text between opening and ending tags of nodes)

▪ … (comments, name spaces, instructions for processing)

(71)



Path is an basic mean for addressing of node(s) in the XML tree

▪ Absolute beginning with the sign ’/’

▪ Relative

▪ Is written as a sequence of steps, delimited by slash signs.

(72)



XPath query examples:

▪ /books/book/title

▪ List of all title elements that are direct descendants of a book element that is a direct descendant of a books

element.

▪ //title

▪ List of all title elements in the XML tree, regardless on the path they are located

(73)



XPath query examples:

▪ /books/*/title

▪ There can be any node (element) on the second step of the path

▪ /books/book[1]/author[2]

▪ Second author of the first book in the list of books

▪ /books/book[author=’Pokorný J.’]/title

▪ Titles of all books written by J. Pokorný

(74)

 Xpath axis

▪ self:: - current node

▪ parent::, child:: - direct parent/children of current node

▪ ancestor::, descendant:: - all ancestors/descendants, including indirect ones

▪ descendant-or-self::, ancestor-or-self::

▪ following::, preceding:: - nodes before/after current node and all its predecessors

▪ following-sibling::, preceding-sibling:: - siblings

▪ attribute:: - attributes of current node

 child:: is the default axis, it need not to be explicitly stated

(75)

 preceding:: ^ following::

(76)



Shortened notation

▪ node_name instead of child::node_name

▪ @attr_name instead of attribute::attr_name

▪ ’.’ for current node

▪ ’..’ for parent node



Query examples:

▪ Titles of all books in textual form /books/book/title/text()

▪ First book in the list /books/book[1]

(77)

 The problem is the effectiveness of XPath queries, respectively of all query languages on XML

documents

▪ Data can be stored per partes in the relational database, and then standard means as B-tree indexes, joins, etc. can be used

+ Usable in any relational database

− Necessary to know the structure of documents (DTD, XMLSchema)

▪ Data can be natively stored/processed in form of text

− New type of indexes are necessary

(78)

 If it is known the document schema, each document can be split to parts and stored in tables using more variants/algorithms

 Idea of one of possible approaches:

▪ It is supposed a simple DTD, respectively existence of less strict simplified DTD’ created for original DTD, where

following implication holds:

validity w.r.t. DTD  validity w.r.t. DTD’

▪ Elements declare other elements in their content at most once

▪ Sub-element can be optional (?) or repeatable (*)

▪ Sub-elements can be delimited by commas, not pipes ’|’

▪ The order of sub-elements has no matter

(79)



For such simple DTD can be created

▪ List of all possible paths in the XML tree

▪ A table

E(DocID,ElemID,ParID,PathID,attributes) for each element E

▪ Columns contain

▪ Document ID

▪ Element ID (primary key)

▪ Parent element ID

▪ Path ID

▪ Columns for all possible attributes of given element

(80)

 <!ELEMENT book (title, author)>

<!ATTLIST book isbn CDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

 Book(DocID,ElemID,ParID,PathID,ISBN)

Title(DocID,ElemID,ParID,PathID,PCDataID) Author(DocID,ElemID,ParID,PathID,PCDataID) PCData(DocID,ElemID,Data)

(81)

 XPath queries are transformed to SQL.

 It is often necessary to join large number of tables.

Its number can be decreased by better splitting data to tables.

For example one table can contain not only one element, but also all sub- elements that cannot repeat.

• Transformation is implemented:

– On the client side

– In some relational DBMS’s optionally on the server side

XML Schema

ER/RMD

XML Dok.XML Dok. XPath query

SQL query Data in RDB

XML Docs

(82)

 <!ELEMENT book (title, author)>

 Book(

DocID,ElemID,

ISBN, TitlePCDataID, AuthorPCDataID, ParID,PathID)

PCData(DocID,ElemID,Data)

(83)

 <!ELEMENT book (title, autor*)>

 If the author can be repeated, it is necessary to store them in its own table. The title can be merged/embedded into book table.

 Book(

DocID,ElemID,ISBN,TitlePCDataID,ParID,PathID) Author(DocID,ElemID,ParID,PathID,PCDataID)

PCData(DocID,ElemID,Data)

(84)



DataGuide index

▪ Contains the tree of all paths existing in the XML document tree

▪ Each path is stored in the index only once

▪ Index nodes contain meta-information about all nodes on given path

▪ Index can be stored in form of XML document as well

(85)



XML path index

▪ Similar to DataGuide index

▪ Nodes contain sets of pointers to corresponding nodes in an XML document

▪ Substantially speeds up answers to queries in form of paths without // operator

▪ Helps with evaluation of queries that don’t begin with // but contain it somewhere in the expession.

(86)

 Lore index

▪ VIndex (Value index) – indexes textual values of nodes and attributes

▪ TIndex (Term index) – indexes words in nodes and

attributes (similar to inverted file from full-text search

▪ LIndex – holds data for determining parents of nodes

▪ PIndex (Path index) – holds paths and keeps lists of pointers to all nodes on them

 Example

▪ /books//author[@country=”CZ”]

▪ VIndex is used to find attributes with country=”CZ”, then parents are visited in bottom-up order uing LIndex and the path is verified

(87)



Native data type XML

▪ CREATE TABLE XmlTab(

ID NUMERIC

CONSTRAINT XmlTabPk PRIMARY KEY, Doc XML

);

▪ INSERT INTO XmlTab VALUES(

1, ’<a><b>b1</b></a>’

);

(88)



Native type XMLType, that allows two different ways of storage:

▪ Using CLOB

▪ Document is stored and returned exactly in its original form including whitespaces (formatting)

+ The XMLSchema need not to be known

− More complicated partial actualizations

(89)



Native type XMLType, that allows two different ways of storage:

▪ Using splitting data to relational tables

▪ Returned document equivalent w.r.t. DOM model, i.e.

without original whitespaces (formatting)

− The XMLSchema has to ne known + Effective partial actualization

+ Possible indexing using B-trees

(90)



Native data type XMLType

▪ CREATE TABLE XmlTab(

ID NUMERIC

CONSTRAINT XmlTabPk PRIMARY KEY, Doc XMLType

);

▪ INSERT INTO XmlTab VALUES(

1, ’<a><b>b1</b></a>’

);

(91)

 MS SQL uses for exporting data in XML format syntax SELECT … FOR XML {RAW|AUTO|EXPLICIT|PATH}

[,XMLDATA][,ELEMENTS][BINARY BASE64]

[ROOT(’root_element’)]

 RAW generates one element row for each row. Values are stored in attributes. Their names correspond to column names / alias names

 <row empno=”1230” ename=”King” deptno=”10” />

(92)

 AUTO generates one element for each table and each row.

The structure corresponds to used joins. Element names correspond to table /alias names

 <dept deptno=”10”>

</dept>

</dept>

(93)

 The ELEMENTS modifier ensures that column values are stored as sub-elements

 <dept>

<emp>

<empno>4560</empno><ename>Scott</ename>

</emp>

</dept>

...

</dept>

(94)

 To export data to XML the function SYS_XMLGEN can be used. It can get either atomic value or

structured data type

 SELECT SYS_XMLGEN(ename) FROM …

<ENAME>SMITH</ENAME>

<ENAME>ALLEN</ENAME>

...

(95)

 SYS_XMLGEN(instance)

▪ SELECT SYS_XMLGEN(

dept_t(deptno,dname, CAST(MULTISET(

select empno, ename from emp e

where e.deptno = d.deptno) AS

emplist_t))).getClobVal() AS deptxml

FROM dept d;

 <?xml version="1.0"?>

<DNAME>Sports</DNAME>

<EMP_T EMPNO="200">

</EMP_T>

<EMP_T>

</EMP_T>

</EMPLIST>

</ROW>

(96)



XML can be generated per partes using specialized functions.



XMLElement(

name[,attrs][,child

₁

, …]]

)



XMLAttributes(name

₁

AS alias

₁

[, …])

(97)

 SELECT XMLElement(

”emp”,

XMLAttributes(empno, ename AS name), XMLElement(”salary”, sal),

XMLElement(”comm”, comm)

) FROM Emp WHERE empno=1000;

 <emp empno=”1000” name=”King”>

</emp>

 Each row is presented as one instance of XMLType value

(98)

 XMLForest(value1 AS alias1, …)

 Generates a value that can have more than one root node, so it is not an XML document

 SELECT XMLForest(

empno, deptno as name ) FROM emp;

 <empno>1000</empno><deptno>10</deptno>

 <empno>2300</empno><deptno>20</deptno>

 …

(99)



XMLConcat(XML

₁

, …)

▪ Concatenates all XML documents, the result

contains more root nodes and so it is not an XML document



XMLSequence(fragment)

▪ Splits fragment to a collection of XML documents, each item corresponds to one root node in the

fragment

(100)



MS SQL allows create three types of XML index on given columns

▪ First the column has to have primary XML index CREATE PRIMARY XML INDEX XmlIdx

ON XmlTab(Doc)

▪ Then following indexes can be created

▪ Value index, Path index, Property index

CREATE XML INDEX InxName ON XmlTab(Doc)

USING XML INDEX XmlIdx

FOR {VALUE|PATH|PROPERTY}

(101)

 Oracle allows to create following indexes on XMLType column:

▪ Full-text index (CTXSYS.Context) CREATE INDEX XmlFtIdx

ON XmlTab(Doc) INDEXTYPE IS CTXSYS.Context;

▪ XMLIndex (CTXSYS.CTXXPath –Oracle 10g)

CREATE INDEX XmlPathIdx

ON XmlTab(Dok) INDEXTYPE IS CTXSYS.CTXXPath

▪ XMLIndex (XDB.XmlIndex–Oracle 11g)

CREATE INDEX XmlIdx

ON XmlTab(Dok) INDEXTYPE IS ^XDB.XMLType Indexes paths, element positions, values

(102)



If the partner send an XML document, it is often necessary to extract needed

information and store it in the database

(103)

 Uses either XPath for element ant their sets

addressing, respectively XQuery for construction of resulting XML document from the source one

(.query, .exist, and .value are case sensitive!)

 xmlcolumn.query(XPath)

▪ Returns fragment, containing elements that satisfy the query

▪ select Doc.query(’/book/author’) from T;

 xmlcolumn.exist(XPath)

▪ Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).

 xml.value(XPath,datatype)

▪ Returns value of (the exactly one) addressed node

(104)

 Uses XPath for element ant their sets addressing

 Extract(xml,xpath)

▪ Returns fragment, containing elements that satisfy the query

▪ select Extract(Doc,’/book/author’) from T;

 ExistsNode(xml,xpath)

▪ Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).

 ExtractValue(xml,xpath)

▪ Returns value of (the exactly one) addressed node

(105)



SELECT Doc.query(’/book/author’) FROM Books;

▪ <author>Pokorný J.</author>

▪ <author>Král J.</author>



SELECT

Dok.value(’/book[1]/author[1]’,’varchar(30)’) FROM Books;

▪ Pokorný J.

▪ Král J.

(106)



SELECT Extract(Doc,’/book/author’) FROM Book;

▪ <author>Pokorný J.</author>

▪ <author>Král J.</author>



SELECT ExtractValue(Doc,’/book/author’) FROM Books;

▪ Pokorný J.

▪ Král J.

(107)



Method xmlcolumn.modify(’XQuery’)

UPDATE Books SET Doc=Doc.modify(

’delete /book[autor=”Kopecký M.”]’

) WHERE …

UPDATE Books SET Doc=Doc.modify(

’insert <author>…</author>

into /book’) WHERE ...

(108)



Function

UpdateXML(xml,path

₁

,val

₁

,path

₂

,val

₂

,…)



UPDATE Books SET

Doc=UpdateXML(Doc,

’/book[title=”DIS ”]/author’,’Pokorný J.’

) WHERE …

(109)



Using tool „IIS Virtual Directory Management for SQL Server“ it is possible to define a

directory in IIS 5 and higher, and publish xml data on the web



//http://localhost/xml?SELECT * FROM BOOKS FOR XML AUTO&root=book

(110)



Using XML DB Repository it is possible to

publish data stored in the database in virtual directories

(111)



HttpUriType

▪ Constructor

HttpUriType(url)

▪ Methods GetCLOB() GetBLOB() GetXML()