2. Assignment - XML Technologies - Winter Term 2015 (Release date: Oct 28 - Date due: Nov 4, 8:00 am)
1. Task - XPath on ComicML
The document Calvin & Hobbes - Strip 4 is based on ComicML DTD as discussed in the tutorial. Express the following queries with XPath:
  1. Print out the names of all characters available in the document.
  2. Print out all text spoken in an “screaming” tone.
  3. Print out the number of all scenes a character described as “Hobbes, sardonic, stuffed and anthropomorphic Bengal Tiger.” is visible.
  4. Which characters appear in the strip after Hobbes has spoken to Calvin?
Which elements are selected by the following XPath queries? Identify an element with the line number of its start tag.
  1. /child::strip/child::panels/child::panel[attribute::no = "1"]/descendant-or-self::*
  2. /child::strip/child::panels/child::panel/preceding::panel
  3. /child::strip/child::prolog/child::characters/following::*
2. Task - XPath on DBLP
Context:

dblp is a service that provides open bibliographic information on major computer science journals and proceedings.

Raw dblp data can be downloaded in a single XML file. A simple DTD is available. The paper DBLP - Some Lessons Learned documents technical details of this XML file. Since dblp data is updated on a daily basis, please use this local dblp dataset (2015-10-28).
Exercise: Express the following queries with XPath:
  1. Print the number of authors.
  2. Find all publications of
    • 'Marc H. Scholl'
    • 'Michael Grossniklaus'
  3. List joint publications of 'Marc H. Scholl' and 'Michael Grossniklaus'?
  4. List joint publications of 'Marc H. Scholl' and 'Michael Grossniklaus', but only those written without the cooperation of 'Andreas Weiler'.
  5. List all coauthors of 'Donald D. Chamberlin'.
  6. List all publication titles in which 'Donald D. Chamberlin' is the single author.
  7. List all joint publications in which 'Donald D. Chamberlin' is the first author.
  8. Do 'Donald D. Chamberlin' and 'Daniela Florescu' have only joint publications that deal with 'XML' and/or 'XQuery'?
  9. 'Donald D. Chamberlin' and 'Morton M. Astrahan' have 11 joint publications. In which publications is 'Donald D. Chamberlin' listed before 'Morton M. Astrahan' (in which after) (document order of authors).
  10. At a SIGMOD Conference 'Donald D. Chamberlin' spoke about 'XQuery: A Query Language for XML.'. What other publications are listed for Don in that very year?
Remarks:
Preparations:
  • Java 7 is required.
  • Download XPath processor ( BaseX.jar )
    $ wget http://files.basex.org/releases/BaseX.jar
  • Make sure you have ~2.5 GB of disk space available.
  • Download dblp dataset (2015-10-15).
  • Create database from input XML file and start BaseX
    $ ls -l
    -rw-r--r--  1 holu  staff   3.5M Oct 29 11:09 BaseX.jar
    -rw-r--r--  1 holu  staff   8.9K Oct 29 10:51 dblp.dtd
    -rw-r--r--  1 holu  staff   302M Oct 28 22:43 dblp.xml.gz
    -rw-r--r--  1 holu  staff    46B Oct 28 22:39 dblp.xml.gz.md5
    # Create database files in current directory.
    $ touch .basexhome
    # Create database from input file
    # ( ... takes a while ~2GB of XML is indexed etc. ... )
    # 308.38s user 13.53s system 139% cpu 3:51.02 total (on my machine)
    $ java -cp BaseX.jar org.basex.BaseX -c "create database dblp dblp.xml.gz"
    # Print some info about newly created database  
    $ java -cp BaseX.jar org.basex.BaseX -c "open dblp; info database; info index;"
    # Start BaseX GUI to formulate and evaluate XPath queries
    $ java -jar BaseX
    
Discussion of 2. Assignment - XML Technologies - Winter Term 2015
1. Task - XPath on ComicML
The document Calvin & Hobbes - Strip 4 is based on ComicML DTD as discussed in the tutorial. Express the following queries with XPath:
  1. Print out the names of all characters available in the document.

    Query: calvin-1-1.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/descendant-or-self::character/text()
    Result: calvin-1-1.txt
    Calvin, a precocious, mischievous, and adventurous six-year-old boy.
    Hobbes, sardonic, stuffed and anthropomorphic Bengal Tiger.
    Miss Wormwood, Calvin's world-weary teacher.

  2. Print out all text spoken in an “screaming” tone.

    Query: calvin-1-2.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/descendant-or-self::*[attribute::tone eq "screaming"]/text()
    Result: calvin-1-2.txt
    A STUPID FIELD? YOU'VE GOT THAT NOW! THINK BIG! RICHES! POWER! PRETEND YOU COULD HAVE ANYTHING!

  3. Print out the number of all panels a character described as “Hobbes, sardonic, stuffed and anthropomorphic Bengal Tiger.” is visible.

    Query: calvin-1-3.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/descendant-or-self::panel[child::scene[contains(attribute::visible, doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/child::strip/child::prolog/child::characters/child::character[text() eq "Hobbes, sardonic, stuffed and anthropomorphic Bengal Tiger."]/attribute::id/data())]]/attribute::no/data()
    Result: calvin-1-3.txt
    1
    2
    4

  4. Which characters appear in the strip after Hobbes has spoken to Calvin?

    Query: calvin-1-4.xq

    distinct-values(doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')//bubble[@speaker eq 'hobbes' and @to eq 'calvin']/following::scene/@visible/tokenize(data(), ' '))
    Result: calvin-1-4.txt
    calvin
    hobbes

Which elements are selected by the following XPath queries? Identify an element with the line number of its start tag.
  1. Query: calvin-2-1.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/child::strip/child::panels/child::panel[attribute::no = "1"]/descendant-or-self::*
    Result: calvin-2-1.txt

    4 results
    
    <panel no="1">
      <scene visible="calvin hobbes">...</scene>
      <bubbles>
        <bubble speaker="calvin" to="hobbes" tone="question">...</bubble>
      </bubbles>
    </panel>
    
    <scene visible="calvin hobbes">...</scene>
    
    <bubbles>
      <bubble speaker="calvin" to="hobbes" tone="question">...</bubble>
    </bubbles>
    
    <bubble speaker="calvin" to="hobbes" tone="question">...</bubble>
    

    Links:

  2. Query: calvin-2-2.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/child::strip/child::panels/child::panel/preceding::panel
    Result: calvin-2-2.txt

    3 results
    
    <panel no="1">...</panel>
    <panel no="2">...</panel>
    <panel no="3">...</panel>
    

    Links:

  3. Query: calvin-2-3.xq

    doc('http://phobos103.inf.uni-konstanz.de/xml15/xml/calvin-4.xml')/child::strip/child::prolog/child::characters/following::*
    Result: calvin-2-3.txt

    18 results

    Links:

2. Task – XPath on DBLP
  1. Print the number of authors.

    Query: dblp-1.xq

    (
      (: Print the number of author elements. :)
      count(//author)
      ,
      (: Print the number of all authors. :)
      count(distinct-values(//author)) 
    )
    Result: dblp-1.txt
    10419465
    1652423

  2. Find all publications of
    • 'Marc H. Scholl'
    • 'Michael Grossniklaus'

    Query: dblp-2a.xq

    //*[author = 'Marc H. Scholl']
    Result: dblp-2a.txt

    94 results --> (6.18ms)

    Query: dblp-2b.xq

    //*[author = 'Michael Grossniklaus']
    Result: dblp-2b.txt

    63 results --> (3.85ms)

    Remarks:

    $ time zgrep 'Marc H. Scholl' dblp.xml.gz | grep author | wc -l
          94
    zgrep 'Marc H. Scholl' dblp.xml.gz  23.15s user 0.11s system 99% cpu 23.278 total
    grep author  0.00s user 0.00s system 0% cpu 23.277 total
    wc -l  0.00s user 0.00s system 0% cpu 23.277 total

  3. List joint publications of 'Marc H. Scholl' and 'Michael Grossniklaus'?

    Query: dblp-3.xq

    //*[./author = 'Marc H. Scholl'] intersect //*[author = 'Michael Grossniklaus']
    Result: dblp-3.txt

    7 results --> (3.34ms)
  4. List joint publications of 'Marc H. Scholl' and 'Michael Grossniklaus', but only those written without the cooperation of 'Andreas Weiler'.

    Query: dblp-4.xq

    (//*[./author = 'Marc H. Scholl'] intersect //*[author = 'Michael Grossniklaus']) except //*[author = 'Andreas Weiler']
    Result: dblp-4.txt

    2 results --> (2.88ms)
  5. List all coauthors of 'Donald D. Chamberlin'.

    Query: dblp-5.xq

    distinct-values(//*[author = 'Donald D. Chamberlin']/author[. ne 'Donald D. Chamberlin'])
    Result: dblp-5.txt

    57 results --> (3.17ms)
                 
    Morton M. Astrahan
    W. Frank King III
    ...
    Leonard Y. Liu
  6. List all publication titles in which 'Donald D. Chamberlin' is the single author.

    Query: dblp-6.xq

    (//*[count(author) eq 1][./author eq 'Donald D. Chamberlin'])/title
    Result: dblp-6.txt
    <title>2003 SIGMOD Innovations Award Speech.</title>
    <title>On "Human Factors Comparison of a Procedural and a Nonprocedural Query Language".</title>
    <title>Early History of SQL.</title>
    <title>Relational Data-Base Management Systems.</title>
    <title>XQuery: An XML query language.</title>
    <title>Document Convergence in an Interactive Formatting System.</title>
    <title>SQL.</title>
    <title>A Complete Guide to DB2 Universal Database</title>
    <title>Using the New DB2: IBM's Object-Relational Database System.</title>
    <title>A Summary of user Experience with the SQL Data Sublanguage.</title>
    <title>XQuery: A Query Language for XML.</title>
    <title>XQuery: Where Do We Go From Here?</title>
    <title>The "single-assignment" approach to parallel processing.</title>
    <title>Query Languages and XML.</title>
    <title>Home Page</title>

    15 results --> (6.85ms)
  7. List all joint publications in which 'Donald D. Chamberlin' is the first author.

    Query: dblp-7.xq

    //*[count(author) gt 1][(author)[1] eq 'Donald D. Chamberlin']
    Result: dblp-7.txt

    15 results --> (18678.04ms)

    Query: dblp-7w.xq

    //*[count(author) gt 1][author = 'Donald D. Chamberlin']
    Result: dblp-7w.txt

    37 results --> (5.08ms)
  8. Do 'Donald D. Chamberlin' and 'Daniela Florescu' have only joint publications that deal with 'XML' and/or 'XQuery'?

    Query: dblp-8.xq

    every $title in //*[(./author = 'Donald D. Chamberlin' and ./author = 'Daniela Florescu')]/title satisfies $title[contains(., 'XML') or contains(., 'XQuery')]
    Result: dblp-8.txt
    true

    Remarks:

    This expression is true if every part element has a discounted attribute (regardless of the values of these attributes):
                  
    every $part in /parts/part satisfies $part/@discounted
                  
    http://www.w3.org/TR/xpath20/ - 3.9 Quantified Expressions
                                  

  9. 'Donald D. Chamberlin' and 'Morton M. Astrahan' have 11 joint publications. In which publications is 'Donald D. Chamberlin' listed before 'Morton M. Astrahan' (in which after) (document order of authors).

    Query: dblp-9.xq

    //*[count(author) gt 1][(author[. = 'Donald D. Chamberlin']) << (author[. = 'Morton M. Astrahan'])]
    Result: dblp-9.txt

    3 results --> (22997.12ms)
  10. At a SIGMOD Conference 'Donald D. Chamberlin' spoke about 'XQuery: A Query Language for XML.'. What other publications are listed for Don in that very year?

    Query: dblp-10.xq

    ((//*[author = 'Donald D. Chamberlin'])[year = ( (//*[author = 'Donald D. Chamberlin'])[title eq 'XQuery: A Query Language for XML.'])/year])[title ne 'XQuery: A Query Language for XML.']
    Result: dblp-10.txt

    1 results --> (45088.33ms)

    Remark: Solution using XQuery

    Query: dblp-10-xq.xq

    let $title := 'XQuery: A Query Language for XML.'
    let $pubs := //*[author = 'Donald D. Chamberlin']
    let $year := $pubs[title eq $title]/year
    for $pub in $pubs
    where $pub/year eq $year
      and $pub/title ne $title
    return
      $pub
    Result: dblp-10-xq.txt

    1 results --> (3.5ms)


    Question: Why does this XPath expression not yield the desired result?

    Query: dblp-10w.xq

    (: Why does this XPath expression not yield the desired result? :)
    (//*[author = 'Donald D. Chamberlin'])[./year = (.[title eq 'XQuery: A Query Language for XML.'])/year]
    Result: dblp-10w.txt
    <inproceedings mdate="2010-06-07" key="conf/sigmod/Chamberlin03">
      <author>Donald D. Chamberlin</author>
      <title>XQuery: A Query Language for XML.</title>
      <pages>682</pages>
      <year>2003</year>
      <crossref>conf/sigmod/2003</crossref>
      <booktitle>SIGMOD Conference</booktitle>
      <url>db/conf/sigmod/sigmod2003.html#Chamberlin03</url>
      <ee>http://doi.acm.org/10.1145/872757.872877</ee>
    </inproceedings>

Additional Questions:
  • What is the difference between eq and =?