Markup Languages

This chapter is based on Chapter 2 of Schneider & Perry's book [Schneider 2000] , Chapters 4, 5, 20 and 21 of the Deitel, et.al. tome [Deitel 2001] , some examples from Elliotte Rusty Harold's book [Harold 1999] , plus additional material from the web.

Introduction

SGML, HTML, and XML are the most important markup languages. SGML because it is the parent language of both HTML and XML, HTML because it is the current language of the web, and XML because it is the future language of the web.

Standard Generalised Markup Language

In the late 1960s, IBM researchers worked on the problem of building a portable system for the interchange and manipulation of legal documents. Their prototype language marked up structural elements, with formatting information kept in separate files, called style sheets. The document structure was defined in yet another file, called a Document Type Definition (DTD). By 1969, the researchers had developed the General Markup Language (GML). After further work worldwide, in 1986, the International Standards Organisation (ISO) adopted a particular version called the Standard Generalised Markup Language (SGML). It quickly became the business standard for data storage and interchange. SGML has the following advantages.

However, it also has the following disadvantages.

Put bluntly, it is too elaborate for the ever-changing web.

HyperText Markup Language

Tim Berners-Lee and Robert Calliau, working independently from the other at CERN, invented the HyperText Markup Language (HTML) based on SGML. HTML is one particular SGML DTD that is easier to learn and use than SGML. HTML is a trimmed-down version of SGML, eliminating SGML features that are rarely needed, but including hyperlinks to link web documents.

Cascading Style Sheets

With earlier versions of HTML, web browsers controlled the appearance (rendering) of every web page. With the advent of Cascading Style Sheets (CSS), the document author can control the way the browser renders the page, or the entire web site for that matter. Style sheets allow document authors to specify the style of their page elements (spacing, margins, etc.) separately from their structure (section headers, body text, etc.), thus allowing greater manageability.

Extensible Markup Language

The Extensible Markup Language (XML) is also a descendent of SGML, representing an industry-wide effort to define which data are displayed (or printed), whereas HTML defines how a page is displayed. XML will overtake HTML because of its ability to describe content. XML has the following advantages.

XML defines a document's structure by marking the start and end (tags) of its logical parts (elements). This is similar to HTML, but also defines record structures for databases and other applications. Figure 1 illustrates an XML-formatted week from my calendar in file calendar.xml.

<?xml version="1.0"?>
<!DOCTYPE calendar SYSTEM "calendar.dtd">

<calendar>
  <year value="2001">
    <date month="01" day="22">
      <event time="1700">
        Eric Hobsbawm's lecture
      </event>
    </date>
    <date month="01" day="23">
      <event time="0930">
        Lewisham Hospital
      </event>
      <event time="1730">
        Quizmaster's Cup
      </event>
    </date>
    <date month="01" day="24">
      <event time="1600">
        Teaching Committee
      </event>
    </date>
    <date month="01" day="25">
      <event time="1800">
        Computer Networking 3
      </event>
    </date>
    <date month="01" day="26">
      <event time="1400">
        School Meeting
      </event>
      <event time="1800">
        Electronic Commerce 3
      </event>
    </date>
  </year>
</calendar>

Figure 1: On-line calendar

The first line is an XML declaration, specifying which version of XML the document conforms to. The second line is a comment using the same syntax as HTML. All XML documents must contain exactly one root element, e.g. <calendar> in this example, containing all other elements. Element <year> is a child element because it is nested inside element <calendar>.

How do we know that the above XML document is well formed, i.e. correctly structured? Enter a document model in the form of a Document Type Definition (DTD), which is a hand-me-down from SGML defining the allowed structure. Figure 2 is calendar.dtd.

<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT calendar (year)*>

<!ELEMENT year (date)*>
<!ATTLIST year value CDATA #REQUIRED>

<!ELEMENT date (event)*>
<!ATTLIST date day   CDATA #REQUIRED
               month CDATA #REQUIRED>

<!ELEMENT event (#PCDATA)>
<!ATTLIST event time CDATA #IMPLIED>

Figure 2: Our calendar's DTD

This archaic structure contains a set of rules or declarations. Each declaration adds a new element, set of attributes, or notation to the language we are describing. Briefly it states that a <calendar> contains zero or more <year>s, a <year> contains zero or more <date>s and that the attribute value is mandatory, etc.

The format in Figure 1 is consistent with that outlined in [RFC 2445] for the exchange of Calendar information between applications, e.g. your PC and your Personal Data Assistant (PDA). In addition, this format can easily be transformed into another format using a parser and a style sheet. Figure 3 is the result of parsing the above XML document with the style sheet xml2html.xsl.

Calendar for 2001

  • 22/01
    1. Eric Hobsbawm's lecture
  • 23/01
    1. Lewisham Hospital
    2. Quizmaster's Cup
  • 24/01
    1. Teaching Committee
  • 25/01
    1. Computer Networking 3
  • 26/01
    1. School Meeting
    2. Electronic Commerce 3

Figure 3: Our calendar in HTML

On reflection, there are certain disadvantages to the chosen style of markup. It would be easier to remove month as an attribute of <date> and make it a child element of <year>. In addition, an <event> should have start and stop times. However, it is not difficult to provide a parser and style sheet to generate virtually any other markup. This is illustrated in Figure 4.

XSL transformations
Figure 4: XSL transformations

Notice that we have introduced yet another document type here, written in the Extensible Stylesheet Language (XSL). Virtually a programming language, XSL supports functions, recursion, and templates. The advantage in having a single source document being used to generate a number of alternative documents is a big win and, of course, the style sheets only need to be written once.

XML Markup

XML tags are case-sensitive, using the wrong mixture is an error. XML can use Unicode characters. Unicode is a standard defining the characters for the world's major languages (Klingon is currently undergoing review :-). Markup text is enclosed within angle brackets (< and >). Character data is the text between a start tag and an end tag, e.g. Electronic Commerce 3 in Figure 1. In XML all start tags must have an end tag. Consider Figure 5.

HTML XML
<img src="image.gif"> <img src="image.gif"></img> <img src="image.gif"/>

Figure 5: HTML vs XML

The middle entry is called an empty element, which can be written more concisely as given on the right. Elements define structure, and may or may not contain content. Attributes describe elements, and attribute values are enclosed in quotes. Figure 6 illustrates another XML application.

<?xml version="1.0"?>
<!-- Connex Train Information Database -->

<schedule date="07/03/01">
  <train route="24">
    <status depart="1602">
      Cancelled
    </status>
  </train>
  <train route="25">
    <status depart="1605" platform="2">
      Delayed waiting for in-bound driver
    </status>
  </train>
  <train route="34">
    <status depart="1628" platform="3">
      About to depart
    </status>
  </train>
</schedule>

Figure 6: Connex SE at Charing Cross

Presumably this <schedule> is updated periodically as conditions change. The route attribute is used as a key into another database containing information about which stations are serviced. This XML document is easily parsed and transformed into whatever format the public information displays require. It could also be used to update their web pages!

Custom Markup Languages

XML languages are being developed for many areas of document processing and e-commerce. Some of the more prominent ones are presented below.

Bank Internet Payment System

The Bank Internet Payment System facilitates secure electronic transactions over the Internet. Transactions can be initiated by either the payer or payee, and are secured using digital certificates.

Bean Markup Language (BML)

JavaBeans (also called beans) are software components that can be combined to create Java applications and applets. The Bean Markup Language is used for describing JavaBeans. BML defines how various beans are interconnected.

Chemical Markup Language (CML)

Peter Murray-Rust's Chemical Markup Language is used for representing molecular and chemical information. Figure 7 illustrates the CML document for a water molecule (H2O).

<?xml version="1.0"?>
<CML>
  <MOL TITLE="Water">
    <ATOMS>
      <ARRAY BUILTIN="ELSYM">H O H</ARRAY>
    </ATOMS>
    <BONDS>
      <ARRAY BUILTIN="ATID1">1 2</ARRAY>
      <ARRAY BUILTIN="ATID2">2 3</ARRAY>
      <ARRAY BUILTIN="ORDER">1 1</ARRAY>
    </BONDS>
  </MOL>
</CML>

Figure 7: Water molecule in CML

Unfortunately, this example cannot be displayed in current browsers. However, Figure 8 should give some idea of what it should look like.

Rendered water molecule
Figure 8: Rendered water molecule

Commerce XML

Commerce XML is used for describing catalog data and performing business-to-business electronic transactions that use the data.

Electronic Business XML

Electronic Business XML is the result of an 18-month project by the United Nations to standardise the global exchange of business information. Rather than emphasising business documents, ebXML emphasises business processes.

Extensible Business Reporting Language

The Extensible Business Reporting Language captures existing financial and accounting information standards in XML. Future versions of XBRL will expand to encompass descriptions of information in other areas of business.

Extensible User Interface Language

The Extensible User Interface Language (pronounced zool) is an XML-based language developed by the Mozilla project for describing user interfaces. Cross-platform applications can load the information from a XUL document to create the appropriate user interface.

Geography Markup Language (GML)

The Geography Markup Language describes geographical information for use and reuse by different applications for different purposes. In GML, geographic information is described in terms of features. A feature is composed of properties and geometries. A property contains name, type and value elements. Geometries contain the bulk of geometric data.

LegalXML

In the USA court documents must be filed with a clerk, and the information often must be entered into different document management systems multiple times. With LegalXML, the information in court documents can be described to enable more efficient processing.

Mathematical Markup Language

The Mathematical Markup Language was developed for describing mathematical notations and expressions using XML. It allows mathematical expressions to be processed by different applications for different purposes. Figure 9 shows the MathML for the quadratic equation x2 + 4x + 4 = 0 (in HTML) or here in MathML.

<math>
  <mrow>
    <mrow>
      <msup>
        <mi>x</mi>
        <mn>2</mn>
      </msup>
      <mo>+</mo>
      <mrow>
        <mn>4</mn>
        <mo>&InvisibleTimes;</mo>
        <mi>x</mi>
      </mrow>
      <mo>+</mo>
      <mn>4</mn>
    </mrow>
    <mo>=</mo>
    <mn>0</mn>
  </mrow>
</math>

Figure 9: MathML quadratic equation

The <mi> element is for identifiers, the <mn> element is for numbers, the <mo> element is for operators, etc. The entity &InvisibleTimes; is important - it's invisible when rendered for viewing, spoken when rendered for voice, but indicates multiplication if the equation is being computed!

NewsML

News items exist in many different formats and are presented and received through different means. NewsXML is designed to be media independent, so that all news-content formats (e.g. text, photo, etc.) can be described. NewsXML also enables tracking and revision of documents over time.

Open eBook Publication Structure

This is the result of a collaboration of companies, the Open eBook Forum, dedicated to electronic text publication. The language is designed to be platform independent, but maintains flexibility and permits document authors to embed platform-specific content as long as a platform-independent alternative is provided.

OpenMath

OpenMath is a standard for describing mathematical content as objects which can be exchanged, manipulated, and displayed by different browsers in different contexts.

Scalable Vector Graphics (SVG)

The Scalable Vector Graphics markup language is a way to describe vector graphics data over the web. Current methods (e.g. GIF, JPEG, PNG) use bitmaps, which have a fixed resolution and cannot be scaled without a loss in image quality. Vector graphics describe graphical information in terms of lines, curves, etc. which can be scaled and printed quite easily. Think PostScript for pictures.

If your browser is Mozilla (Version 0.9 or higher) or Internet Explorer (Version 4.0 or higher), Adobe provide a free plug-in for rendering SVG documents. The plug-in is available at http://www.adobe.com/svg/. A static demonstration is here (1747 bytes) and an animated demonstration is here (2054 bytes). Both are from the Deitel book.

Synchronised Multimedia Integration Language (SMIL)

The Synchronised Multimedia Integration Language (pronounced "smile") enables web authors to co-ordinate the presentation of a wide range of multimedia elements. In SMIL, multimedia elements can work together; this enables authors to specify when and how these multimedia elements appear in the document.

Visa XML Invoice Specification

Visa has developed this to enable its business customers to exchange credit-card purchase information between businesses over the Internet in a secure and standardised form. Currently, the specification provides a framework that describes credit-card purchases in the areas of procurement (i.e. business-to-business purchasing) and travel & entertainment (T&E) expenses.

VoxML

Motorola's VoxML is an XML application for the spoken word, in particular for automated telephone response systems. VoxML enables the same data on the web to be served up via the telephone. Figure 10 is an example taken from Elliotte Rusty Harold's book.

<?xml version="1.0"?>
<DIALOG>
  <CLASS NAME="help_top">
    <HELP>Welcome to TIC consumer products division. 
          For shampoo information, say shampoo now. 
    </HELP>
  </CLASS>

  <STEP NAME="init" PARENT="help_top">
    <PROMPT>Welcome to Wonder Shampoo
      <BREAK SIZE="large"/>
       Which color did Wonder Shampoo turn your hair? 
      </PROMPT>
      <INPUT TYPE="OPTIONLIST"> 
        <OPTION NEXT="#green">green</OPTION>
        <OPTION NEXT="#purple">purple</OPTION>
        <OPTION NEXT="#bald">bald</OPTION>
        <OPTION NEXT="#bye">exit</OPTION>
     </INPUT>
  </STEP>

  <STEP NAME="green" PARENT="help_top">
     <PROMPT>
       If Wonder Shampoo turned your hair green and you wish
       to return it to its natural color, simply shampoo seven
       times with three parts soap, seven parts water, four
       parts kerosene, and two parts iguana bile.
     </PROMPT>
     <INPUT TYPE="NONE" NEXT="#bye"/>
  </STEP>

  <STEP NAME="purple" PARENT="help_top">
     <PROMPT>
       If Wonder Shampoo turned your hair purple and you wish
       to return it to its natural color, please walk  
       widdershins around your local cemetery 
       three times while chanting "Surrender Dorothy".
       
     </PROMPT>
     <INPUT TYPE="NONE" NEXT="#bye"/>
  </STEP>

  <STEP NAME="bald" PARENT="help_top">
     <PROMPT>
       If you went bald as a result of using Wonder Shampoo,
       please purchase and apply a three months supply
       of our Magic Hair Growth Formula(TM). Please do not
       consult an attorney as doing so would violate the
       license agreement printed on inside fold of the Wonder 
       Shampoo box in 3 point type which you agreed to
       by opening the package.  
     </PROMPT>
     <INPUT TYPE="NONE" NEXT="#bye"/>
  </STEP>

  <STEP NAME="bye" PARENT="help_top">
    <PROMPT>
     Thank you for visiting TIC Corp. Goodbye. 
    </PROMPT>   
    <INPUT TYPE="NONE" NEXT="#exit"/>
  </STEP>

</DIALOG> 

Figure 10: Wonder shampoo

It's not possible to show a screen shot of this example, because it's not intended for the web. Just pick up your phone!

Wireless Markup Language (WML)

The Wireless Markup Language allows web pages to be displayed on wireless devices such as cellular phones and PDAs. WML works with the Wireless Application Protocol (WAP) to deliver the content.

Conclusions

People are beginning to use XML to store their data. Software applications can use XML to store preferences and virtually any kind of information from chemical formulae to file archives. XML is not the perfect solution for every data storage problem; access times may be slow and documents can be large. Some kinds of data just don't need XML - a raster image is usually a long sequence of binary digits, monolithic, unparsable and huge.

However, XML has great possibilities for programmers. It is well suited to being read, written, and altered by software. Its syntax is straightforward and easy to parse. It's well documented and there are many tools and code libraries available to developers. As an open standard, with support from many popular programming languages, XML may well become the lingua franca for computer communication.

References

  1. Harvey Deitel, Paul Deitel, Tem Nieto, Ted Lin, & Praveen Sadhu, XML How to Program, Prentice Hall, Upper Saddle River, NJ, 2001, ISBN 0-13-028417-3.
  2. Elliotte Rusty Harold, XML Bible, IDG Books Worldwide, Inc., Foster City, CA, 1999, ISBN 0-7645-3236-7.
  3. RFC 2445, Internet Calendaring and Scheduling Core Object Specification (iCalendar), November 1998.
  4. Garry Schneider & James Perry, Electronic Commerce, Course Technology - ITP, 2000, ISBN 0-7600-1179-6.


Last modified: Tue Oct 25 15:19:01 2005