1. Assignment - XML Technologies - Winter Term 2015 (Release date: Oct 22 - Date due: Oct 28, 8:00 am)

Exercise: Comic Markup Language (ComicML)

  • The lecture introduced Markup Languages.
  • The terms descriptive and semantic markup have been discussed.
  • XML dialects seem to be all around the place.
  • Make sure that you have a clear understanding of the italic terms mentioned above.
Task

Invent your own XML dialect to represent information of one of comic strips shown below.

Calvin Comic - Doing nothing
Calvin Comic - Childhood is short
Calvin Comic - Kill the messenger
Original versions: Go Comics
Try to find a representation that allows to (later) answer queries on structure and content, e.g.
  • What panels does Calvin appear in?
  • In which scenes can we see Calvin, but not Hobbes?
  • Give me all strips featuring characters talking about homework.
Discussion of 1. Assignment - XML Technologies - Winter Term 2015
Comic Markup Language (ComicML)
Setup: We want to build a Comic Strip Information System
  • Query a database of (Calvin and Hobbes, Asterix, Dilbert, …) comic strips by content.
  • We want to approach the system with queries like:
    • Find all strips featuring Calvin.
    • Find all panels with Calvin screaming at Hobbes.
    • List all discussions between Calvin and his mom.
Approach: Unless we have nextn generation image recognition software available, we have to annotate the comic strips to be able to process the queries above:
[ bitmap ]
Calvin - Happy Hobbes



[ annotation ]

"Calvin ... Hobbes ... Mom"
Problem: Represent information about a Calvin & Hobbes comic strip.
Stage 1: Use an ASCII Representation
Calvin - Happy Hobbes
Series: Calvin and Hobbes
Author: Bill Watterson

Panel 1: Calvin and Hobbes are walking down the alley.
Calvin: »If you could wish for anything, what would it be?«

Panel 2: Hobbes looking happy with a smile. Calvin siting next to Hobbes, listening to him.
Hobbes: »A big sunny field to be in.«

Panel 3: Calvin, very excited, is screaming.
Calvin: »A STUPID FIELD? YOU'VE GOT THAT NOW! THINK BIG! RICHES! POWER! PRETEND YOU COULD HAVE ANYTHING!«

Panel 4: Calvin, thoughtful, looking at Hobbes. Hobbes, absolutely satisfied, lying in the field, sleeping.
Calvin: »Actually, it's hard to argue with someone who looks so happy.«
Hobbes: »zzZZZzzz«
  • Structure is represented by special characters (newline, :, »)
  • ASCII Control character sequence 0x0d, 0x0a (CR, LF) divides lines
  • each line contains a character name, then a colon ( : ), then a line of speech (comic-speak: bubble)
  • the contents of each bubble are delimited by » and «
  • Interpretation of this structure is possible, however, hard for machines.
Stage 2: HTML-Style Presentational Markup
  • HTML (W3C http://www.w3.org/MarkUp) defines a number of markup tags, some of which are required to match (<b>...</b>).
    Calvin - Happy Hobbes
    <html>
      <h1>Calvin and Hobbes</h1>
      <h2>Panel 1</h2>
      <ul>
        <li>
          <b>Calvin</b>: <i>If you could wish for anything, what would it be?</i>
        </li>
      </ul>
      <h2>Panel 2</h2>
    </html>
  • Note that HTML tags primarily define presentational markup (heading level, font weight, ...)
  • Presentational markup is a in the first place meant to instruct rendering engines on how to layout and pretty print content.
  • Presentational markup is of limited use for the comic strip finder (tags do not reflect the structure of the comic strip content).
Stage 3: XML-Style Logical Markup
  • We create a set of tags that is customized to represent the content of comics, e.g.:
    <author>Bill Watterson</author>
    <scene>Calvin and Hobbes are walking down the alley.</scene>
  • Tags may be created according to the application needs.
  • New types of queries may require new tag types. No problem for XML.
  • The resulting set of tags forms a new custom markup language (an XML dialect).
  • The XML dialect encodes the semantic and inner structure of content as it is seen from a specific problem domain (e.g., the wish of a publishing house to efficiently search and query their comic assets).
  • All tags appear as properly nested pairs: <t>...<f/>...</t>
<strip>
  <prolog>
    <series>Calvin and Hobbes</series>
    <author>Bill Watterson</author>
    <characters>
      <character>Calvin</character>
      <character>Hobbes</character>
    </characters>
  </prolog>
  <panels>
    <panel>
      <scene>Calvin and Hobbes are walking down the alley.</scene>
      <bubbles>
        <bubble>
          <speaker>Calvin</speaker>
          <speech>If you could wish for anything, what would it be?</speech>
        </bubble>
      </bubbles>
    </panel>
    <panel>
      <scene>Hobbes looking happy with a smile. Calvin sitting next to Hobbes, listening to him.</scene>
      <bubbles>
        <bubble>
          <speaker>Hobbes</speaker>
          <speech>A big sunny field to be in.</speech>
        </bubble>
      </bubbles>
    </panel>
    <panel>
      <scene>Calvin, very excited, is screaming.</scene>
      <bubbles>
        <bubble>
          <speaker>Calvin</speaker>
          <speech>A STUPID FIELD? YOU'VE GOT THAT NOW! THINK BIG! RICHES! POWER! PRETEND YOU COULD HAVE ANYTHING!</speech>
        </bubble>
      </bubbles>
    </panel>
    <panel>
      <scene>Calvin, thoughtful, looking at Hobbes. Hobbes, absolutely satisfied, lying in the field, sleeping.</scene>
      <bubbles>
        <bubble>
          <speaker>Calvin</speaker>
          <speech>Actually, it's hard to argue with someone who looks so happy.</speech>
        </bubble>
        <bubble>
          <speaker>Hobbes</speaker>
          <speech>zzZZZzzz</speech>
        </bubble>
      </bubbles>
    </panel>
  </panels>
</strip>
Stage 4: Full-Featured XML
  • Although fairly simplistic, the previous stage clearly constitutes an improvement.
  • XML comes with a number of additional constructs which allow us to convey even more useful information, e.g.,
    1. Attributes may be used to qualify tags to reduce the number of different tag names (tag soup):
      • Instead of
        <question>If you could wish ... ?</question>
        <scream>A STUPID FIELD?! ...</scream>
      • we could use
        <bubble tone="question">If you could wish ...?</bubble>
        <bubble tone="scream" mood="excited">A STUPID FIELD?! ...?</bubble>
    2. References establish links internal to an XML document:
      • Establish link target:
        <character id="calvin">Calvin, a six-year-old boy, named after the 16th-century theologian John Calvin.</character>
        <character id="hobbes">Hobbes, stuffed and anthropomorphic Bengal Tiger.</character>
      • Reference the target:
        <bubble speaker="hobbes" to="calvin" tone="answer" mood="relaxed">A big sunny field to be in.</bubble>
<strip copyright="Universal Press Syndicate" year="1986">
  <prolog>
    <series url="http://www.gocomics.com/calvinandhobbes/">Calvin and Hobbes</series>
    <author>Bill Watterson</author>
    <genres>
      <genre>humor</genre>
      <genre>family life</genre>
      <genre>politics</genre>
      <genre>satire</genre>
    </genres>
    <characters>
      <character id="calvin">Calvin, a precocious, mischievous, and adventurous six-year-old boy.</character>
      <character id="hobbes">Hobbes, sardonic, stuffed and anthropomorphic Bengal Tiger.</character>
      <character id="wormwood">Miss Wormwood, Calvin's world-weary teacher.</character>
    </characters>
  </prolog>
  <panels length="4">
    <panel no="1">
      <scene visible="calvin hobbes">Calvin and Hobbes are walking down the alley.</scene>
      <bubbles>
        <bubble speaker="calvin" to="hobbes" tone="question">If you could wish for anything, what would it be?</bubble>
      </bubbles>
    </panel>
    <panel no="2">
      <scene visible="calvin hobbes">Hobbes looking happy with a smile. Calvin sitting next to Hobbes, listening to him.</scene>
      <bubbles>
        <bubble speaker="hobbes" to="calvin" tone="answer" mood="relaxed">A big sunny field to be in.</bubble>
      </bubbles>
    </panel>
    <panel no="3">
      <scene visible="calvin">Calvin, very excited, is screaming.</scene>
      <bubbles>
        <bubble speaker="calvin" to="hobbes" tone="screaming" mood="excited">A STUPID FIELD? YOU'VE GOT THAT NOW! THINK BIG! RICHES! POWER! PRETEND YOU COULD HAVE ANYTHING!</bubble>
      </bubbles>
    </panel>
    <panel no="4">
      <scene visible="calvin hobbes">Calvin, thoughtful, looking at Hobbes. Hobbes, absolutely satisfied, lying in the field, sleeping.</scene>
      <bubbles>
        <bubble speaker="calvin" to="calvin" tone="whispering" mood="thoughtful">Actually, it's hard to argue with someone who looks so happy.</bubble>
        <bubble speaker="hobbes" mood="relaxed">zzZZZzzz</bubble>
      </bubbles>
    </panel>
  </panels>
</strip>
  • In addition to pure text content, the XML encodings in Stage 3 and 4 make the document structure available to the XML processor.
  • Tag names and nesting of tags convey the information necessary to implement the comic strip finder queries.
  • We can now answer queries on structure and content, e.g.,
    • What panels does Calvin appear in?
  • Supporting tools: Validate input, such that queries are expected to work.
  • ComicML.dtd
XML as a High-Volume Data Format

XML was designed to satisfy the needs of two worlds:

  1. document-centric applications (e.g. Information Retrieval)
    • Information with little structure (e.g. text documents with chapters, sections, ... )
  2. data-centric applications ("traditional" database applications)
    • Very regular data; sometimes, however, we might want some flexibility (e.g. persons without a phone number, or with even two, ... )

While in data-centric application domains we expect high-volume data, you probably imagine how an XML-based approach to represent document-centric data can ultimately lead to a high-volume data processing challenge as well:

A Complete Comic Strip Database
<comic-strips>
  <calvin-strips>
    <strip date="1985-11-18"> ... </strip>
        ...
        <strip date="1995-12-31"> ... </strip>
  </calvin-strips>
  <dilbert-strips>
    <strip date="1988-07-21"> ... </strip>
        ...
        <strip date="2005-10-12"> ... </strip>
  </dilbert-strips>
</comic-strips>
XML Can Embed Fragments of Binary Data
Encode binary data such that the encoding is guaranteed to never contain the character ’<’. Example: base64 (Characters used: A...Z, a...z, 0...9, +, /, = )
<strip>
  <prolog> ... </prolog>
  <drawing encoding="base64" mime-type="application/pdf">
  JVBERi0xLjMKJcTl8uXrp/Og0MTGCjIgMCBvYmoKPDwgL0xlbmd0aCA0IDA... bGF0ZURlY29kZSA+PgpzdHJlYW0KeNorVAhUKFTQD0gtSk4tKClNzFEoygQ... ...
  </drawing>
  <panels> ... </panels>
</strip>
Database-Supported XML Processors
Current XML processors (XQuery and XSLT engines like, e.g., Saxon) have been built under the assumption that the input XML documents may be processed in main memory.
  • In the course of this lecture we will have a look at database-supported XML processors able to process input XML documents of 1 GB size and beyond (see next assignment).
  • XPath and XQuery will be used to filter, transform, join XML documents much like tables in the relational data model.