course:
Database Applications (NDBI026)
SS2021/22
RNDr. Michal Kopecký, Ph.D.
Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague
Full-text Search
▪ Filtering
▪ Disambiguation and Lemmatization
▪ Indexes
▪ Thesauri
▪ Querying
XML Extension
▪ Privileges – Overview
▪ Granting Privileges
▪ Revoking Privileges
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 2
Different principles than in case of standard data searches
▪ Not strictly structured data, where each column has its exact in advance known semantics, are searched
▪ The task is to search in texts written in natural language. The same thing can be described differently by different authors.
▪ Different words with the same meaning (Synonyms)
▪ Different collocations
▪ …
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 4
Database engines uses their means of
extensibility to allow full-text searches out of the box
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 5
Standard: SQL/MM 2
SQL Multimedia and Application Packages Part 2: Full-Text
Real implementation done by proprietary extensions
▪ There is nothing like universal best possible solution
▪ Results are evaluated subjectively by users
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 6
Full-text Search, XML Extension (NDBI026, Lect. 6)
Writer:
▪ Text = sequence of words in natural language.
▪ Each word stands for some idea/imagination of writer.
▪ Ideas represent real subject, activity, etc.
Reader: folows (not necessary exactly the same) mappings from left to right
...
M. Kopecký 7
Full-text Search, XML Extension (NDBI026, Lect. 6)
Synonymity of words
▪ More words can have the same meaning for the writer
▪ car = automobile
▪ sick = ill
...
M. Kopecký 8
Homonymy of words
▪ One word can have more different meanings, not one exact meaning
▪ fluke: fish, anchor, …
▪ crown: currency, treetop, jewel, …
▪ class: year of studies, category in set theory, …
Full-text Search, XML Extension (NDBI026, Lect. 6)
...
M. Kopecký 9
Full-text Search, XML Extension (NDBI026, Lect. 6)
Word meanings need not be exactly the same.
▪ Hierarchical overlapping
▪ animal > horse > Stallion
▪ Associativity among meanings
▪ calculator ~ computer ~ processor
...
M. Kopecký 10
Full-text Search, XML Extension (NDBI026, Lect. 6)
Mapping between subjects, ideas and words can depend on individual persons – readers and writers.
▪ Two people can assign partly or completely different meaning to given term.
▪ Two people can imagine different thing for the same word.
▪ mother, room, ...
In result, by reading the same text two different readers can obtain different information
▪ Each from other
▪ In comparison with author’s intention
M. Kopecký 11
Full-text Search, XML Extension (NDBI026, Lect. 6)
Homonymy and ambiguities grow with transition form words/terms to sentences and bigger parts of the text.
Example of English sentence with more grammatically correct meanings (in this case a human reader probably eliminates the nonsense meaning)
▪ In the sentence „Time flies like an arrow“ either flies (fly) or like can be chosen for the predicate, what produces two significantly different meanings.
M. Kopecký 12
Example of Czech sentence with more
grammatically correct meanings (again, a human reader probably eliminates the nonsense meanings)
▪ see for example Podivné fungování gramatiky, http://www.scienceworld.cz/sw.nsf/lingvistika
▪ Sentence „Ženu holí stroj“ can have – depending on the predicate selection – meaning:
▪ „Poháním stroj pomocí hole“ (hnát)
▪ „Žena používá depilační přístroj“ (holit)
▪ „Návod k nekonvenčnímu způsobu oblékání“ (strojit)
▪ … and further
▪ „Oblékej ženu, patřící holím“,
▪ „Poháním stroj, patřící holím“
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 13
The query, that would return all documents, that are interesting for the user if hard
(impossible) to formulate
▪ Together with correct – relevant – documents are usually returned also incorrect – irrelevant –
documents
▪ Not all relevant documents are returned
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 14
Problems
▪ Homonyms
▪ Wants the user asking “square” to see documents concerning the geometry (special case of rectangle) or would like to find some plaza or even something other?
▪ Synonyms
▪ Is document about “bins” relevant to query asking “baskets”?
▪ Is document about “trees” relevant to query asking
“connected acyclic graphs”?
▪ Hierarchies
▪ Animal - Mammal - Predator - Bear
▪ Printed media – Magazine
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 15
Strict Boolean logic is not very suitable for full-text search query formulations
▪ Document either complies the query or not
▪ Full-text searches requires sorting answers according to presumed relevancy to the user
▪ It is useful to be able to define the measure of similarity between the query and the document
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 16
Full-text Search, XML Extension (NDBI026, Lect. 6)
Two TRS’s can (and do) return two different
result for the same query, that can be partly or completely unique.
How to compare quality of those systems?
Documents in the database
Relevant documents Returned
by TRS1
Returned by TRS2
M. Kopecký 17
Full-text Search, XML Extension (NDBI026, Lect. 6)
Two questioners can suppose another
documents to be relevant for their equally formulated query
How to meet both subjective
expectations of questioners?
Documents in the database
Relevant Returned
docs.
Relevant
M. Kopecký 18
Full-text Search, XML Extension (NDBI026, Lect. 6)
Quality of result set of documents is usually evaluated according to numbers Nv, Nr, Nrv
▪ Precision
▪ P = Nvr / Nv
▪ Probability of returned document to be relevant to the user
▪ Recall
▪ R = Nvr / Nr
▪ Probability of relevant document to be returned to the user
M. Kopecký 19
Full-text Search, XML Extension (NDBI026, Lect. 6)
Both coefficients depend on the feeling of the questioner
The same document can fulfill information needs of first questioner while at the same time fail to meet them for the second one.
▪ Each user determines different values of Nr and Nrv coefficients
▪ Both measures P and R depend on them
M. Kopecký 20
Full-text Search, XML Extension (NDBI026, Lect. 6)
In optimal case
▪ P=R=1
▪ There are all and only relevant documents in the response of the system
Usually
▪ The answer in the first iteration is neither
precise nor complete
0 0 1
1
Optimal answer
Typical initial answer
M. Kopecký 21
Full-text Search, XML Extension (NDBI026, Lect. 6)
Query tuning
▪ Iterative modification of the query targeted to increase the quality of the response
Theoretically it is
possible to reach the optimum sooner or later …
0 0 1
1
Optimum
R
P
M. Kopecký 22
Full-text Search, XML Extension (NDBI026, Lect. 6)
… due to (not only) ambiguities both measures depend
indirectly each on the other,
ie. P*R const. < 1
▪ In order to increase P the absolute number of
relevant documents in the response is decreased.
▪ In order to increase R the number of irrelevant
documents rapidly grows.
The probability to reach quality above the limit is
low. 0 0
1
1
Optimum
R
P
M. Kopecký 23
Full-text Search, XML Extension (NDBI026, Lect. 6)
In time of query formulation the questioner has to guess correct term (words) the author used for expression of given idea
▪ Problems are caused e.g. by
▪ Synonyms (author could use different synonym not remembered by the user)
▪ Overlapping meanings of terms
▪ Colorful poetical hyperboles
▪ …
M. Kopecký 24
Full-text Search, XML Extension (NDBI026, Lect. 6)
The problem can be partly suppressed by inclusion of thesaurus, containing
▪ Hierarchies of terms and their meanings
▪ Sets of synonyms
▪ Definitions of associations between terms
Questioner can use it during query formulation
System can use it during query evaluation
M. Kopecký 25
Full-text Search, XML Extension (NDBI026, Lect. 6)
The user often tends to tune its own query in conservative way
▪ He/she tends to fix terms used in the first iteration (they must be the best because I remembered them immediately) and vary only additional terms at the end of the query
It is useful to support the user to
(semi)automatically eliminate wrong terms and replace them with useful ones, that describe really relevant documents
M. Kopecký 26
Full-text Search, XML Extension (NDBI026, Lect. 6)
The questioner is usually not able or willing to go through exhaustive number of hits in the response to find out the relevant one
Usually max. 20-50 documents according to their length
Need to not only sort out documents not matching the query but order the answer list according to
supposed relevancy in descendant order – the supposedly best documents at the begining
M. Kopecký 27
Full-text Search, XML Extension (NDBI026, Lect. 6)
Due to maximal criterion, the user usually tries to increase the Precision of the answer
▪ Small amount of resulting documents in the answer, containing as high ratio of relevant documents as possible
Some problematic domains requires both high precision and recall
▪ Lawyers, especially in territories having case law based on precedents (need to find out as much similar cases as possible)
„better“
Vr.
Rel.
Vr.
Rel.
„worse“
M. Kopecký 28
SQL databases usually use modifications of Boolean models for document representation
▪ Corresponds well to standard data representation
▪ Relatively easy implementation
▪ Queries are written in form of Boolean formulae, where operands are words or collocations
▪ Many modifications
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 29
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 30
Filtering
Lemmatization Disambiguation Stoplist
Lemm. text
Indexing List
of Lemmas
Inverted file
List
of Documents Thesaurus
Filtering
▪ Removes formatting and produces plain ASCII text
Disambiguation
▪ Determines the meaning of the word according to context
▪ „lie on the bed“ … werb “to lie”
▪ „it’s a lie“ … noun “lie”
Lemmatization
▪ Determines the basic form of the word
▪ Often replaces by Stemmer that eliminates prefixes/suffixes and produces the word stem
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 31
Indexing
▪ Creates auxiliary data that help find documents efficiently – typically lists of lemmas and
documents together with Inverted file
▪ Couples [doc_id, lemma_id]
sorted by lemma_id, doc_id and stored without duplicities
▪ Currently usually more information, for example quintuples
[doc_id, para_nr, sent_nr, word_nr, lemma_id]
sorted by lemma_id, doc_id, para_nr, sent_nr, word_nr
▪ Allows evaluation of proximity constraints for distances of word occurrences in the document
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 32
Common character columns are too short for text data
▪ Usually are used subtypes of LOB (Large OBject) type
LOBs
▪ Standard types for huge data storage in the database defined in ANS SQL-92 Full
▪ Up to 4GB of data
▪ BLOB … standard binary type
▪ CLOB … character type in standard server character set
▪ NCLOB … character type in national server character set
▪ In Oracle: additional external type BFILE
▪ Read only
▪ Self-standing independent file stored externally in the OS
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 33
LOBs
▪ In Oracle
▪ Up to (232-1) * STANDARD_BLOCK_SIZE bytes
▪ BLOB … standard binary type
▪ CLOB … character type in standard server character set
▪ NCLOB … character type in national server character set
▪ V MS SQL
▪ Image … binary data up to 2 GB
▪ Text … textual data up to 2 GB in standard server character set
▪ NText … textual data up to 1 GB in national server character set
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 34
In table columns are stored only descriptors (LOB locators)
The LOB locator points to data stored in self- standing data segment
Table full-scan need not read data of LOBs
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 35
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 36
Index creation on textual column
CREATE FULLTEXT INDEX ON table_name [(column_name [TYPE COLUMN
type_column_name]
[LANGUAGE language_term] [,...n])]
KEY INDEX index_name
[ON fulltext_catalog_name]
[WITH
{CHANGE_TRACKING {MANUAL | AUTO | OFF [, NO POPULATION]}}
]
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 37
Mandatory information
▪ Table name, name of column of following type:
char, nchar, varchar, nvarchar, text, ntext
▪ Name of (as small as possible) unique index over given table in KEY INDEX clause
Options:
▪ Language definition (LANGUAGE)
▪ Way the index will be updated WITH CHANGE_TRACKING
▪ MANUAL … only log of changes is created. Its processing and index synchronization can be done either manually or by scheduled job
▪ AUTO … index will be synchronized immediately
▪ OFF … index will not be synchronized
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 38
Example:
CREATE UNIQUE INDEX ui_ukJobCand ON JobCandidate(JobCandidateID);
CREATE FULLTEXT CATALOG ft AS DEFAULT;
CREATE FULLTEXT INDEX ON JobCandidate(Resume) KEY INDEX ui_ukJobCand;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 39
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 40
Document filtering
▪ NULL_FILTER for textual documents
▪ TXT, HTML, XML
▪ INSO_FILTER for binary documents
▪ CHARSET_FILTER for conversion of documents to the database character set
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 41
Types of full-text indexes
▪ CONTEXT
▪ Basic type of index for searching in text documents
▪ Suitable (designed) for larger documents
▪ The index synchronization is by default done explicitly by invocation CTX_DDL.SYNC_INDEX procedure
(similar to WITH CHANGE_TRACKING MANUAL option in MS SQL)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 42
Types of full-text indexes
▪ CTXCAT
▪ Suitable for smaller documents and excerpts
▪ Can be combined with other non-textual columns for combined queries (partially on standard columns and partially full-text search)
▪ The index synchronization is by default done
automatically together with changes in the table
(similar to WITH CHANGE_TRACKING AUTO option in MS SQL)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 43
Types of full-text indexes
▪ CTXRULE
▪ Build upon set of predefined queries
▪ Is used for document classification according to queries it correspond to
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 44
Document source origin
▪ NORMAL_DATASTORE
▪ One document is in each row of given column of the table
▪ MULTI_COLUMN_DATASTORE
▪ Document is formed by concatenation of more columns in each row of the table
▪ URL_DATASTORE
▪ Document is stored on the Internet, the column contains only its URL
▪ …
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 45
Index creation on textual column CREATE INDEX index_name
ON table_name(column_name) INDEXTYPE IS
ctxsys.{context|ctxcat|ctxrule}
[PARAMETERS
(’param_name param_value …’)];
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 46
Mandatory information
▪ Name of the table, name of textual column of one of following types:
char, nchar, varchar2, nvarchar2, clob, nclob
▪ Type (implementation) of the domain index
Optional information:
▪ Filter, Datastore
▪ Lexer … splitting text to words with respect to the language, diacritics and its transcription (ö, oe, o), …
▪ Stoplist,
▪ …
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 47
Example:
CREATE INDEX myindex ON doc(htmlfile)
INDEXTYPE IS ctxsys.context PARAMETERS(
’datastore ctxsys.default_datastore filter ctxsys.null_filter
section group ctxsys.html_section_group’);
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 48
Together with new index types the database servers implement new operators for
comparison of documents with queries
Operators return number, an estimate
measure of similarity of document content with given query
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 49
MS SQL uses
a Boolean operator – a predicate CONTAINS(
column,
’query’[,
LANGUAGE lang ])
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 50
Oracle uses numerical function – a predicate CONTAINS(
column,
’query’[,
comparison_nr ])
Using auxiliary function
SCORE(comparison_nr) is possible to find out the returned CONTAINS value
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 51
Query operators
▪ AND, & … ‘mice & cats’
▪ OR, | … ‘mice | cats’
▪ AND NOT, & ! … ‘mice AND NOT cats’
▪ NEAR, ~ … proximity operator
‘mice NEAR cats’
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 52
DBI026 -DB Aplikace - MFF UK
Query operators
▪ AND, & … ‘mice & cats’
▪ OR, | … ‘mice | cats’
▪ NOT, ~ … ‘mice ~ cats’
▪ NEAR, ; … proximity operator
‘mice ; cats’
NEAR((mice,cats),5)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 53
Query operators
▪ Phrases in double quotas
▪ FORMSOF ( INFLECTIONAL , word) the word is searched in all forms
hide, hidden, …
▪ FORMSOF ( THESAURUS , word)
given word is expanded to the list of words with the similar meaning using thesaurus
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 54
Query operators
▪ Word prefix followed by star wildcard for
searching for all words that begin with given prefix
▪ word WEIGHT (n)
▪ Weighted term, the importance of word occurrence by is multiplied by given constant, used together with a CONTAINSTABLE operator (see later)
▪ ISABOUT (theme [,…])
▪ Documents discussing given theme
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 55
Query operators
▪ The SQL wildcards _ and % can be used at the beginning and at the end of words
▪ word WEIGHT n, word * n
▪ Weighted term, the importance of word occurrence by is multiplied by given constant
▪ ABOUT ( theme )
▪ Documents discussing given theme … ‘about(politics)’
▪ STEM, $
▪ Searches for occurrences of words with given stem
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 56
Query operators
▪ query WITHIN PARAGRAPH query WITHIN SENTENCE
▪ Searching in scope of one paragraph or sentence if the document structure can be recognized and processed
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 57
CONTAINSTABLE(
tab_name,col_name,query[,language][,top_n]
)
▪ Returns table with the schema (KEY,RANK) that – in pseudo column RANK – contains measures of similarity between document in row with given KEY and the given query
SELECT select_list
FROM table AS FT_TBL INNER JOIN
CONTAINSTABLE(table, column, query) AS KEY_TBL ON (FT_TBL.unique_key_column = KEY_TBL.[KEY] )
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 58
SELECT
C.CategoryName, C.Description,
R.RANK
FROM Categories AS C INNER JOIN
CONTAINSTABLE(
Categories, Description,
'ISABOUT (fishes weight (.8), beers weight (.3) )'
) AS R
ON C.CategoryID = KEY_TBL.KEY ORDER BY R.RANK DESC;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 59
SELECT
C.CategoryName , C.Description,
R.RANK
FROM Categories AS C INNER JOIN
CONTAINSTABLE (
Categories, Description,
'("sweet and savory" NEAR sauces) OR
("sweet and savory" NEAR candies)', 10) AS R ON C.CategoryID = R.KEY;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 60
SELECT Description, SCORE(1)+SCORE(2) FROM Categories
WHERE
CONTAINS(
Abstract,’database & search’,1
) > 0 AND
CONTAINS(Description
,’(object|relational) NEAR database’,2
) > 0 ORDER BY SCORE(1)+SCORE(2) DESC;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 61
Thesauri operators
▪ BT(word[,n]) … Include also Broather Term for word (n levels higher). If the word is a homonym include all meanings
▪ NT(word[,n]) … Include also Narrower Term for word (n levels lower).
▪ PT(word) … Replace word by Preferred Term that represent a whole set of word equivalents.
▪ TT(word) … Replace word by Top Term from the meanings hierarchy thesaurus
▪ SYN(word) … Include all SYNonyms of word.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 62
Creation
CREATE [PUBLIC] SYNONYM new_name
FOR
[schema.]old_name[@dbname];
Dropping
DROP [PUBLIC] SYNONYM new_name;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 64
Everywhere, where the you can use the old_name, you can use new_name
as well
Can be created for almost everything including synonyms
Public synonyms are visible for
everyone without owner prefix
Creation
CREATE SYNONYM new_name
FOR [[dbname.]schema.]old_name;
Dropping
DROP SYNONYM new_name;
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 65
Everywhere, where the you can use the old_name, you can use new_name
as well
Can be created for almost everything including synonyms
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 66
The database application need communicate with other systems and/or partners
▪ Export of materials from IS to an external accounting
▪ B2G (business-to-government)
▪ Reports for government
▪ B2B (business-to-business)
▪ Export/import of data to/from other IS or company
▪ Electronic orders, invoices, etc.
▪ SOA service-oriented architectures
▪ Communication between system components
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 67
The XML format is heavily used
+ Readable for both humans and computers
+ Structured
+ Easily analyzed by parsers
▪ Huge amount of data
▪ Direct querying needed different approaches and data structures than in case of relational data
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 68
To communicate with neighborhood it is often needed
▪ To export stored relational data in XML format
▪ To extract data from an XML tree and store data in relational tables
▪ To store and maintain original XML documents obtained from business partners
▪ To search for XML documents that contain needed information
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 69
Xpath is one of used query languages for XML documents
The query result is a set of nodes (sub-trees) or content of a node or attribute
Document itself can be understood as a tree containing
▪ A root node (not supposed to be a part of document)
▪ Elements
▪ Attributes
▪ Textual nodes (leaves, corresponding to free text between opening and ending tags of nodes)
▪ … (comments, name spaces, instructions for processing)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 70
Path is an basic mean for addressing of node(s) in the XML tree
▪ Absolute beginning with the sign ’/’
▪ Relative
▪ Is written as a sequence of steps, delimited by slash signs.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 71
XPath query examples:
▪ /books/book/title
▪ List of all title elements that are direct descendants of a book element that is a direct descendant of a books
element.
▪ //title
▪ List of all title elements in the XML tree, regardless on the path they are located
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 72
XPath query examples:
▪ /books/*/title
▪ There can be any node (element) on the second step of the path
▪ /books/book[1]/author[2]
▪ Second author of the first book in the list of books
▪ /books/book[author=’Pokorný J.’]/title
▪ Titles of all books written by J. Pokorný
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 73
Xpath axis
▪ self:: - current node
▪ parent::, child:: - direct parent/children of current node
▪ ancestor::, descendant:: - all ancestors/descendants, including indirect ones
▪ descendant-or-self::, ancestor-or-self::
▪ following::, preceding:: - nodes before/after current node and all its predecessors
▪ following-sibling::, preceding-sibling:: - siblings
▪ attribute:: - attributes of current node
child:: is the default axis, it need not to be explicitly stated
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 74
preceding:: following::
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 75
Shortened notation
▪ node_name instead of child::node_name
▪ @attr_name instead of attribute::attr_name
▪ ’.’ for current node
▪ ’..’ for parent node
Query examples:
▪ Titles of all books in textual form /books/book/title/text()
▪ First book in the list /books/book[1]
<book><title>…</title><author>…</author></book>
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 76
The problem is the effectiveness of XPath queries, respectively of all query languages on XML
documents
▪ Data can be stored per partes in the relational database, and then standard means as B-tree indexes, joins, etc. can be used
+ Usable in any relational database
− Necessary to know the structure of documents (DTD, XMLSchema)
▪ Data can be natively stored/processed in form of text
− New type of indexes are necessary
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 77
If it is known the document schema, each document can be split to parts and stored in tables using more variants/algorithms
Idea of one of possible approaches:
▪ It is supposed a simple DTD, respectively existence of less strict simplified DTD’ created for original DTD, where
following implication holds:
validity w.r.t. DTD validity w.r.t. DTD’
▪ Elements declare other elements in their content at most once
▪ Sub-element can be optional (?) or repeatable (*)
▪ Sub-elements can be delimited by commas, not pipes ’|’
▪ The order of sub-elements has no matter
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 78
For such simple DTD can be created
▪ List of all possible paths in the XML tree
▪ A table
E(DocID,ElemID,ParID,PathID,attributes) for each element E
▪ Columns contain
▪ Document ID
▪ Element ID (primary key)
▪ Parent element ID
▪ Path ID
▪ Columns for all possible attributes of given element
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 79
<!ELEMENT book (title, author)>
<!ATTLIST book isbn CDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
Book(DocID,ElemID,ParID,PathID,ISBN)
Title(DocID,ElemID,ParID,PathID,PCDataID) Author(DocID,ElemID,ParID,PathID,PCDataID) PCData(DocID,ElemID,Data)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 80
XPath queries are transformed to SQL.
It is often necessary to join large number of tables.
Its number can be decreased by better splitting data to tables.
For example one table can contain not only one element, but also all sub- elements that cannot repeat.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 81
• Transformation is implemented:
– On the client side
– In some relational DBMS’s optionally on the server side
XML Schema
ER/RMD
XML Dok.XML Dok. XPath query
SQL query Data in RDB
XML Docs
<!ELEMENT book (title, author)>
<!ATTLIST book isbn CDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
Book(
DocID,ElemID,
ISBN, TitlePCDataID, AuthorPCDataID, ParID,PathID)
PCData(DocID,ElemID,Data)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 82
<!ELEMENT book (title, autor*)>
<!ATTLIST book isbn CDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
If the author can be repeated, it is necessary to store them in its own table. The title can be merged/embedded into book table.
Book(
DocID,ElemID,ISBN,TitlePCDataID,ParID,PathID) Author(DocID,ElemID,ParID,PathID,PCDataID)
PCData(DocID,ElemID,Data)
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 83
DataGuide index
▪ Contains the tree of all paths existing in the XML document tree
▪ Each path is stored in the index only once
▪ Index nodes contain meta-information about all nodes on given path
▪ Index can be stored in form of XML document as well
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 84
XML path index
▪ Similar to DataGuide index
▪ Nodes contain sets of pointers to corresponding nodes in an XML document
▪ Substantially speeds up answers to queries in form of paths without // operator
▪ Helps with evaluation of queries that don’t begin with // but contain it somewhere in the expession.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 85
Lore index
▪ VIndex (Value index) – indexes textual values of nodes and attributes
▪ TIndex (Term index) – indexes words in nodes and
attributes (similar to inverted file from full-text search
▪ LIndex – holds data for determining parents of nodes
▪ PIndex (Path index) – holds paths and keeps lists of pointers to all nodes on them
Example
▪ /books//author[@country=”CZ”]
▪ VIndex is used to find attributes with country=”CZ”, then parents are visited in bottom-up order uing LIndex and the path is verified
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 86
Native data type XML
▪ CREATE TABLE XmlTab(
ID NUMERIC
CONSTRAINT XmlTabPk PRIMARY KEY, Doc XML
);
▪ INSERT INTO XmlTab VALUES(
1, ’<a><b>b1</b></a>’
);
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 87
Native type XMLType, that allows two different ways of storage:
▪ Using CLOB
▪ Document is stored and returned exactly in its original form including whitespaces (formatting)
+ The XMLSchema need not to be known
− More complicated partial actualizations
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 88
Native type XMLType, that allows two different ways of storage:
▪ Using splitting data to relational tables
▪ Returned document equivalent w.r.t. DOM model, i.e.
without original whitespaces (formatting)
− The XMLSchema has to ne known + Effective partial actualization
+ Possible indexing using B-trees
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 89
Native data type XMLType
▪ CREATE TABLE XmlTab(
ID NUMERIC
CONSTRAINT XmlTabPk PRIMARY KEY, Doc XMLType
);
▪ INSERT INTO XmlTab VALUES(
1, ’<a><b>b1</b></a>’
);
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 90
MS SQL uses for exporting data in XML format syntax SELECT … FOR XML {RAW|AUTO|EXPLICIT|PATH}
[,XMLDATA][,ELEMENTS][BINARY BASE64]
[ROOT(’root_element’)]
RAW generates one element row for each row. Values are stored in attributes. Their names correspond to column names / alias names
<row empno=”1230” ename=”King” deptno=”10” />
<row empno=”4560” ename=”Scott” deptno=”10” />
<row empno=”7800” ename=”King” deptno=”20” />
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 91
AUTO generates one element for each table and each row.
The structure corresponds to used joins. Element names correspond to table /alias names
<dept deptno=”10”>
<emp empno=”1230” ename=”King” />
<emp empno=”4560” ename=”Scott” />
</dept>
<dept deptno=”20”>
<emp empno=”7800” ename=”King” deptno=”20” />
</dept>
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 92
The ELEMENTS modifier ensures that column values are stored as sub-elements
<dept>
<deptno>10</deptno>
<emp>
<empno>1230</empno><ename>King</ename>
<empno>4560</empno><ename>Scott</ename>
</emp>
</dept>
<dept deptno=”20”>
...
</dept>
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 93
To export data to XML the function SYS_XMLGEN can be used. It can get either atomic value or
structured data type
SELECT SYS_XMLGEN(ename) FROM …
<ENAME>SMITH</ENAME>
<ENAME>ALLEN</ENAME>
...
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 94
SYS_XMLGEN(instance)
▪ SELECT SYS_XMLGEN(
dept_t(deptno,dname, CAST(MULTISET(
select empno, ename from emp e
where e.deptno = d.deptno) AS
emplist_t))).getClobVal() AS deptxml
FROM dept d;
<?xml version="1.0"?>
<ROW DEPTNO="100">
<DNAME>Sports</DNAME>
<EMPLIST>
<EMP_T EMPNO="200">
<ENAME>John</ENAME>
</EMP_T>
<EMP_T>
<ENAME>Jack</ENAME>
</EMP_T>
</EMPLIST>
</ROW>
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 95
XML can be generated per partes using specialized functions.
XMLElement(
name[,attrs][,child
1, …]]
)
XMLAttributes(name
1AS alias
1[, …])
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 96
SELECT XMLElement(
”emp”,
XMLAttributes(empno, ename AS name), XMLElement(”salary”, sal),
XMLElement(”comm”, comm)
) FROM Emp WHERE empno=1000;
<emp empno=”1000” name=”King”>
<salary>578</salary>
<comm>50</comm>
</emp>
Each row is presented as one instance of XMLType value
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 97
XMLForest(value1 AS alias1, …)
Generates a value that can have more than one root node, so it is not an XML document
SELECT XMLForest(
empno, deptno as name ) FROM emp;
<empno>1000</empno><deptno>10</deptno>
<empno>2300</empno><deptno>20</deptno>
…
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 98
XMLConcat(XML
1, …)
▪ Concatenates all XML documents, the result
contains more root nodes and so it is not an XML document
XMLSequence(fragment)
▪ Splits fragment to a collection of XML documents, each item corresponds to one root node in the
fragment
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 99
MS SQL allows create three types of XML index on given columns
▪ First the column has to have primary XML index CREATE PRIMARY XML INDEX XmlIdx
ON XmlTab(Doc)
▪ Then following indexes can be created
▪ Value index, Path index, Property index
CREATE XML INDEX InxName ON XmlTab(Doc)
USING XML INDEX XmlIdx
FOR {VALUE|PATH|PROPERTY}
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 100
Oracle allows to create following indexes on XMLType column:
▪ Full-text index (CTXSYS.Context) CREATE INDEX XmlFtIdx
ON XmlTab(Doc) INDEXTYPE IS CTXSYS.Context;
▪ XMLIndex (CTXSYS.CTXXPath –Oracle 10g)
CREATE INDEX XmlPathIdx
ON XmlTab(Dok) INDEXTYPE IS CTXSYS.CTXXPath
▪ XMLIndex (XDB.XmlIndex–Oracle 11g)
CREATE INDEX XmlIdx
ON XmlTab(Dok) INDEXTYPE IS XDB.XMLType Indexes paths, element positions, values
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 101
If the partner send an XML document, it is often necessary to extract needed
information and store it in the database
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 102
Uses either XPath for element ant their sets
addressing, respectively XQuery for construction of resulting XML document from the source one
(.query, .exist, and .value are case sensitive!)
xmlcolumn.query(XPath)
▪ Returns fragment, containing elements that satisfy the query
▪ select Doc.query(’/book/author’) from T;
xmlcolumn.exist(XPath)
▪ Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).
xml.value(XPath,datatype)
▪ Returns value of (the exactly one) addressed node
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 103
Uses XPath for element ant their sets addressing
Extract(xml,xpath)
▪ Returns fragment, containing elements that satisfy the query
▪ select Extract(Doc,’/book/author’) from T;
ExistsNode(xml,xpath)
▪ Returns one (1), if exists at least one node that corresponds to the query. Else it returns zero (0).
ExtractValue(xml,xpath)
▪ Returns value of (the exactly one) addressed node
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 104
SELECT Doc.query(’/book/author’) FROM Books;
▪ <author>Pokorný J.</author>
▪ <author>Král J.</author>
SELECT
Dok.value(’/book[1]/author[1]’,’varchar(30)’) FROM Books;
▪ Pokorný J.
▪ Král J.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 105
SELECT Extract(Doc,’/book/author’) FROM Book;
▪ <author>Pokorný J.</author>
▪ <author>Král J.</author>
SELECT ExtractValue(Doc,’/book/author’) FROM Books;
▪ Pokorný J.
▪ Král J.
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 106
Method xmlcolumn.modify(’XQuery’)
UPDATE Books SET Doc=Doc.modify(
’delete /book[autor=”Kopecký M.”]’
) WHERE …
UPDATE Books SET Doc=Doc.modify(
’insert <author>…</author>
into /book’) WHERE ...
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 107
Function
UpdateXML(xml,path
1,val
1,path
2,val
2,…)
UPDATE Books SET
Doc=UpdateXML(Doc,
’/book[title=”DIS ”]/author’,’Pokorný J.’
) WHERE …
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 108
Using tool „IIS Virtual Directory Management for SQL Server“ it is possible to define a
directory in IIS 5 and higher, and publish xml data on the web
//http://localhost/xml?SELECT * FROM BOOKS FOR XML AUTO&root=book
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 109
Using XML DB Repository it is possible to
publish data stored in the database in virtual directories
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 110
HttpUriType
▪ Constructor
HttpUriType(url)
▪ Methods GetCLOB() GetBLOB() GetXML()
M. Kopecký Full-text Search, XML Extension (NDBI026, Lect. 6) 111