6. Assignment - XML Technologies - Winter Term 2015 (Release date: Nov 26 - Date due: Dec 02, 8:00 am)
1. Task - XQuery Update

Answer the following queries using the XQuery Update Facility. All queries are based on the addressbook.xml document, which you find in the repository. The results should be persistent. For each query the original document (as it is stored in the repository) may be used, however.

Example:

delete node fn:doc("addressbook")//address

  1. Replace @id attributes of <address> elements by <myID> child elements. The <myID> element should be the first child. For address0 the result looks like this (abbreviated):
    (: after :)
    <address state="sync">
      <myID>address0</myID>
      <name>Johannes Schmid</name>
      ...
    </address>
    
  2. Write a function local:insert-twitter($pid as xs:string, $tw as xs:string) taking an address @id and a twitter name to update twitter account details. I.e., if no twitter entries exist they are created. If there is already a name listed, the new one gets appended at the end of the list. If the name already exists or the id is not available, nothing happens.
  3. You want to sync your addresses with another device but with street, code, city and twitternames removed. Each @state flagged with 'sync' is relevant for this task. However your syncer tools are not reliable and you want to create a test file about what is happening first. Using the put(document, filename) function you create such a log file containing the original addresses and the ones to be sync'ed, i.e., copies of the original addresses but with street, code, city and twitternames removed. The database remains unchanged.
    <org>
      <address id="address0" state="sync">
        <name>Johannes Schmid</name>
        <street>Badstrasse 13</street>
        <code>80327</code>
        <city>80327 Munich</city>
        <country>Germany</country>
        <twitternames>
          <twitter>jo</twitter>
          <twitter>mo</twitter>
          <twitter>momo</twitter>
        </twitternames>
      </address>
    </org>
    <cpy>
      <address id="address0" state="sync">
        <name>Johannes Schmid</name>
        <country>Germany</country>
      </address>
    </cpy>
    ...
    

2. Task - XQuery Full-Text - Evaluating Results: Precision and Recall

  1. A retrieval algorithm returns 10 texts of which 8 are relevant. It fails to return 5 additional relevant texts. What is the resulting Precision and Recall? Include the calculation.
  2. Sketch and explain a scenario for each of the two results with:
    • Precision = 1.0, Recall approaches 0.0
    • Recall = 1.0, Precision approaches 0.0

3. Task - XQuery Full-Text - Ranking Documents: TF/IDF

Take a look at the lecture slides about document ranking and the TF-IDF measure. Consider the following three tweets as our input documents with di in D:

  • d1 : "i'm really looking forward to see all of you in berlin. berlin rocks!"
  • d2 : "february is probably the last chance to see them live in concert, this is soo, soo sad :("
  • d3 : "yay! found a really cool live recording of JBT in berlin - best of all: it's free!"

For the query Q=(concert, berlin, live) and t in Q:

  1. Calculate the normalized term frequency (TF) for the given query terms and documents.
  2. Calculate the inverse document frequency (IDF) for all query terms.
  3. Rank the documents with the function: Formula

Discussion of 6. Assignment - XML Technologies - Winter Term 2015
====================
TASK 1 XQuery Update
====================

- Error messages important!

--------------------
1.1
--------------------
for $idattr in doc("addressbook")//address/@id
return (
   delete node $idattr,
   insert node element myID {string($idattr)} as first into $idattr/..
)

--------------------
1.2
--------------------
declare updating function local:insert-twitter($pid as xs:string, $tw as xs:string) {
   for $a in doc("addressbook")//address[@id = $pid]
   return 
     if(empty($a/twitternames)) 
		then insert node element twitternames { element twitter {$tw}} into $a
     else 
       if ($a/twitternames/twitter/text() = $tw)
       	then ()
       	else insert node <twitter> { $tw } </twitter> as last into $a/twitternames
};
local:insert-twitter("address3", "jo2")

--------------------
1.3
--------------------
let $new := 
	for $e in doc("addressbook")//address[@state = "sync"]
  return 
   copy $je := $e
   modify delete node $je/street 
                union $je/code 
                union $je/city 
                union $je/twitternames
   return ( element org { $e }, element cpy { $je })
return put(document { $new }, "/synced.xml")


=======================
TASK 2 Precision/Recall
=======================

-----------------------
2.1
-----------------------
Precision = |{relevant documents} intersection {retrieved documents}| / |{retrieved documents}|
Recall = |{relevant documents} intersection {retrieved documents}| / |{relevant documents}|

For the given example:
Precision = 8/(8+2) = 8/10 = 0.8
Recall = 8/(8+5) = 8/13 = 0.61

-----------------------
2.2
-----------------------
Precision 1.0, recall approaches 0.0
 falseNeg=unlimited, truePos=1, falsePos=0: pure but incomplete result
 
Recall 1.0, precision approaches 0.0 
 falseNeg=0, truePos=1, falsePos=unlimited: result resembles complete db content
 
 
================================
TASK 3 Ranking Documents: TF/IDF
================================

Term Frequency (Tf) = f(t,d)
Normalized Term Frequency (NTf) = f(t,d)/max{f(w,d):w->d}
Inverse Document Frequency (IDf)= log(|N|/{1+|{d->D:t->d}|})

--------------------------------
3.1
--------------------------------
For 'concert':
f(concert,d1) = 0 => NTf = 0
f(concert,d2) = 1 => NTf = 1/2 (max{f(w,d2):w->d2} = 2 for 'soo' or 'is')
f(concert,d3) = 0 => NTf = 0

For 'berlin':
f(berlin,d1) = 2 => NTf = 2/2 = 1 (max{f(w,d1):w->d1} = 2 for 'berlin')
f(berlin,d2) = 0 => NTf = 0
f(berlin,d3) = 1 => NTf = 1/2 (max{f(w,d3):w->d3} = 2 for 'of')

For 'live':
f(live,d1) = 0 => NTf = 0
f(live,d2) = 1 => NTf = 1/2 (max{f(w,d2):w->d2} = 2 for 'berlin')
f(live,d3) = 1 => NTf = 1/2 (max{f(w,d3):w->d3} = 2 for 'of')


--------------------------------
3.2
--------------------------------
IDf(concert,D) = log(3/1+1) = log(3/2)
IDf(berlin,D) = log(3/1+2) = 0
IDf(live,D) = log(3/1+2) = 0

--------------------------------
3.3
--------------------------------
score(Q,d1) = Tf(concert,d1).IDf(concert,D) + Tf(berlin,d1).IDf(berlin,D) + Tf(live,d1).IDf(live,D)
= 0.log(3/2) + 2.0 + 0.0
= 0

score(Q,d2) = Tf(concert,d2).IDf(concert,D) + Tf(berlin,d2).IDf(berlin,D) + Tf(live,d2).IDf(live,D)
= 1.log(3/2) + 0.0 + 1.0
= log(3/2)

score(Q,d3) = Tf(concert,d3).IDf(concert,D) + Tf(berlin,d3).IDf(berlin,D) + Tf(live,d3).IDf(live,D)
= 0.log(3/2) + 1.0 + 1.0
= 0