This chapter is based on Chapter 2 of Schneider & Perry's book [Schneider 2000] , Chapters 4, 5, 20 and 21 of the Deitel, et.al. tome [Deitel 2001] , some examples from Elliotte Rusty Harold's book [Harold 1999] , plus additional material from the web.
SGML, HTML, and XML are the most important markup languages. SGML because it is the parent language of both HTML and XML, HTML because it is the current language of the web, and XML because it is the future language of the web.
In the late 1960s, IBM researchers worked on the problem of building a portable system for the interchange and manipulation of legal documents. Their prototype language marked up structural elements, with formatting information kept in separate files, called style sheets. The document structure was defined in yet another file, called a Document Type Definition (DTD). By 1969, the researchers had developed the General Markup Language (GML). After further work worldwide, in 1986, the International Standards Organisation (ISO) adopted a particular version called the Standard Generalised Markup Language (SGML). It quickly became the business standard for data storage and interchange. SGML has the following advantages.
However, it also has the following disadvantages.
Put bluntly, it is too elaborate for the ever-changing web.
Tim Berners-Lee and Robert Calliau, working independently from the other at CERN, invented the HyperText Markup Language (HTML) based on SGML. HTML is one particular SGML DTD that is easier to learn and use than SGML. HTML is a trimmed-down version of SGML, eliminating SGML features that are rarely needed, but including hyperlinks to link web documents.
With earlier versions of HTML, web browsers controlled the appearance (rendering) of every web page. With the advent of Cascading Style Sheets (CSS), the document author can control the way the browser renders the page, or the entire web site for that matter. Style sheets allow document authors to specify the style of their page elements (spacing, margins, etc.) separately from their structure (section headers, body text, etc.), thus allowing greater manageability.
The Extensible Markup Language (XML) is also a descendent of SGML, representing an industry-wide effort to define which data are displayed (or printed), whereas HTML defines how a page is displayed. XML will overtake HTML because of its ability to describe content. XML has the following advantages.
XML defines a document's structure by marking the start and
end (tags) of its logical parts
(elements). This is similar to HTML,
but also defines record structures for databases and other
applications. Figure 1 illustrates an XML-formatted week
from my calendar in file calendar.xml.
<?xml version="1.0"?>
<!DOCTYPE calendar SYSTEM "calendar.dtd">
<calendar>
<year value="2001">
<date month="01" day="22">
<event time="1700">
Eric Hobsbawm's lecture
</event>
</date>
<date month="01" day="23">
<event time="0930">
Lewisham Hospital
</event>
<event time="1730">
Quizmaster's Cup
</event>
</date>
<date month="01" day="24">
<event time="1600">
Teaching Committee
</event>
</date>
<date month="01" day="25">
<event time="1800">
Computer Networking 3
</event>
</date>
<date month="01" day="26">
<event time="1400">
School Meeting
</event>
<event time="1800">
Electronic Commerce 3
</event>
</date>
</year>
</calendar>
|
The first line is an XML declaration,
specifying which version of XML the document conforms to.
The second line is a comment using the same syntax as HTML.
All XML documents must contain exactly one root
element, e.g. <calendar> in
this example, containing all other elements. Element
<year> is a child
element because it is nested inside element
<calendar>.
How do we know that the above XML document is well
formed, i.e. correctly structured? Enter a
document model in the form of a
Document Type Definition (DTD), which
is a hand-me-down from SGML defining the allowed structure.
Figure 2 is calendar.dtd.
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT calendar (year)*>
<!ELEMENT year (date)*>
<!ATTLIST year value CDATA #REQUIRED>
<!ELEMENT date (event)*>
<!ATTLIST date day CDATA #REQUIRED
month CDATA #REQUIRED>
<!ELEMENT event (#PCDATA)>
<!ATTLIST event time CDATA #IMPLIED>
|
This archaic structure contains a set of rules or
declarations. Each declaration adds a
new element, set of attributes, or notation to the language
we are describing. Briefly it states that a
<calendar> contains zero or more
<year>s, a <year>
contains zero or more <date>s and that
the attribute value is mandatory, etc.
The format in Figure 1 is consistent with that outlined in
[RFC 2445]
for the exchange of Calendar
information between applications, e.g. your PC and your
Personal Data Assistant (PDA). In
addition, this format can easily be transformed into another
format using a parser and a
style sheet. Figure 3 is the result of
parsing the above XML document with the style sheet
xml2html.xsl.
Calendar for 2001
|
On reflection, there are certain disadvantages to the chosen
style of markup. It would be easier to remove
month as an attribute of
<date> and make it a child element of
<year>. In addition, an
<event> should have start and stop times.
However, it is not difficult to provide a parser and style
sheet to generate virtually any other
markup. This is illustrated in Figure 4.

Notice that we have introduced yet another document type here, written in the Extensible Stylesheet Language (XSL). Virtually a programming language, XSL supports functions, recursion, and templates. The advantage in having a single source document being used to generate a number of alternative documents is a big win and, of course, the style sheets only need to be written once.
XML tags are case-sensitive, using the wrong mixture is an
error. XML can use Unicode characters.
Unicode is a
standard defining the characters for the world's major
languages (Klingon is currently undergoing review :-).
Markup text is enclosed within angle
brackets (< and >).
Character data is the text between a
start tag and an end tag, e.g. Electronic Commerce
3 in Figure 1. In XML all start tags must have an
end tag. Consider Figure 5.
| ||||||
The middle entry is called an empty element, which can be written more concisely as given on the right. Elements define structure, and may or may not contain content. Attributes describe elements, and attribute values are enclosed in quotes. Figure 6 illustrates another XML application.
<?xml version="1.0"?>
<!-- Connex Train Information Database -->
<schedule date="07/03/01">
<train route="24">
<status depart="1602">
Cancelled
</status>
</train>
<train route="25">
<status depart="1605" platform="2">
Delayed waiting for in-bound driver
</status>
</train>
<train route="34">
<status depart="1628" platform="3">
About to depart
</status>
</train>
</schedule>
|
Presumably this <schedule> is updated
periodically as conditions change. The route
attribute is used as a key into another database containing
information about which stations are serviced. This XML
document is easily parsed and transformed into whatever
format the public information displays require. It could
also be used to update their web pages!
XML languages are being developed for many areas of document processing and e-commerce. Some of the more prominent ones are presented below.
The Bank Internet Payment System facilitates secure electronic transactions over the Internet. Transactions can be initiated by either the payer or payee, and are secured using digital certificates.
JavaBeans (also called beans) are software components that can be combined to create Java applications and applets. The Bean Markup Language is used for describing JavaBeans. BML defines how various beans are interconnected.
Peter Murray-Rust's Chemical Markup Language is used for representing molecular and chemical information. Figure 7 illustrates the CML document for a water molecule (H2O).
<?xml version="1.0"?>
<CML>
<MOL TITLE="Water">
<ATOMS>
<ARRAY BUILTIN="ELSYM">H O H</ARRAY>
</ATOMS>
<BONDS>
<ARRAY BUILTIN="ATID1">1 2</ARRAY>
<ARRAY BUILTIN="ATID2">2 3</ARRAY>
<ARRAY BUILTIN="ORDER">1 1</ARRAY>
</BONDS>
</MOL>
</CML>
|
Unfortunately, this example cannot be displayed in current browsers. However, Figure 8 should give some idea of what it should look like.

Commerce XML is used for describing catalog data and performing business-to-business electronic transactions that use the data.
Electronic Business XML is the result of an 18-month project by the United Nations to standardise the global exchange of business information. Rather than emphasising business documents, ebXML emphasises business processes.
The Extensible Business Reporting Language captures existing financial and accounting information standards in XML. Future versions of XBRL will expand to encompass descriptions of information in other areas of business.
The Extensible User Interface Language (pronounced zool) is an XML-based language developed by the Mozilla project for describing user interfaces. Cross-platform applications can load the information from a XUL document to create the appropriate user interface.
The Geography Markup Language describes geographical information for use and reuse by different applications for different purposes. In GML, geographic information is described in terms of features. A feature is composed of properties and geometries. A property contains name, type and value elements. Geometries contain the bulk of geometric data.
In the USA court documents must be filed with a clerk, and the information often must be entered into different document management systems multiple times. With LegalXML, the information in court documents can be described to enable more efficient processing.
The Mathematical Markup Language was developed for describing mathematical notations and expressions using XML. It allows mathematical expressions to be processed by different applications for different purposes. Figure 9 shows the MathML for the quadratic equation x2 + 4x + 4 = 0 (in HTML) or here in MathML.
<math>
<mrow>
<mrow>
<msup>
<mi>x</mi>
<mn>2</mn>
</msup>
<mo>+</mo>
<mrow>
<mn>4</mn>
<mo>⁢</mo>
<mi>x</mi>
</mrow>
<mo>+</mo>
<mn>4</mn>
</mrow>
<mo>=</mo>
<mn>0</mn>
</mrow>
</math>
|
The <mi> element is for identifiers, the
<mn> element is for numbers, the
<mo> element is for operators, etc. The
entity ⁢ is important - it's
invisible when rendered for viewing, spoken when rendered
for voice, but indicates multiplication if the equation is
being computed!
News items exist in many different formats and are presented and received through different means. NewsXML is designed to be media independent, so that all news-content formats (e.g. text, photo, etc.) can be described. NewsXML also enables tracking and revision of documents over time.
This is the result of a collaboration of companies, the Open eBook Forum, dedicated to electronic text publication. The language is designed to be platform independent, but maintains flexibility and permits document authors to embed platform-specific content as long as a platform-independent alternative is provided.
OpenMath is a standard for describing mathematical content as objects which can be exchanged, manipulated, and displayed by different browsers in different contexts.
The Scalable
Vector Graphics markup language is a way
to describe vector graphics data over the web. Current
methods (e.g. GIF, JPEG,
PNG) use bitmaps, which
have a fixed resolution and cannot be scaled without a loss
in image quality. Vector graphics describe graphical
information in terms of lines, curves, etc. which can be
scaled and printed quite easily. Think PostScript for
pictures.
If your browser is Mozilla (Version 0.9 or higher) or Internet Explorer (Version 4.0 or higher), Adobe provide a free plug-in for rendering SVG documents. The plug-in is available at http://www.adobe.com/svg/. A static demonstration is here (1747 bytes) and an animated demonstration is here (2054 bytes). Both are from the Deitel book.
The Synchronised Multimedia Integration Language (pronounced "smile") enables web authors to co-ordinate the presentation of a wide range of multimedia elements. In SMIL, multimedia elements can work together; this enables authors to specify when and how these multimedia elements appear in the document.
Visa has developed this to enable its business customers to exchange credit-card purchase information between businesses over the Internet in a secure and standardised form. Currently, the specification provides a framework that describes credit-card purchases in the areas of procurement (i.e. business-to-business purchasing) and travel & entertainment (T&E) expenses.
Motorola's VoxML is an XML application for the spoken word, in particular for automated telephone response systems. VoxML enables the same data on the web to be served up via the telephone. Figure 10 is an example taken from Elliotte Rusty Harold's book.
<?xml version="1.0"?>
<DIALOG>
<CLASS NAME="help_top">
<HELP>Welcome to TIC consumer products division.
For shampoo information, say shampoo now.
</HELP>
</CLASS>
<STEP NAME="init" PARENT="help_top">
<PROMPT>Welcome to Wonder Shampoo
<BREAK SIZE="large"/>
Which color did Wonder Shampoo turn your hair?
</PROMPT>
<INPUT TYPE="OPTIONLIST">
<OPTION NEXT="#green">green</OPTION>
<OPTION NEXT="#purple">purple</OPTION>
<OPTION NEXT="#bald">bald</OPTION>
<OPTION NEXT="#bye">exit</OPTION>
</INPUT>
</STEP>
<STEP NAME="green" PARENT="help_top">
<PROMPT>
If Wonder Shampoo turned your hair green and you wish
to return it to its natural color, simply shampoo seven
times with three parts soap, seven parts water, four
parts kerosene, and two parts iguana bile.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="purple" PARENT="help_top">
<PROMPT>
If Wonder Shampoo turned your hair purple and you wish
to return it to its natural color, please walk
widdershins around your local cemetery
three times while chanting "Surrender Dorothy".
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="bald" PARENT="help_top">
<PROMPT>
If you went bald as a result of using Wonder Shampoo,
please purchase and apply a three months supply
of our Magic Hair Growth Formula(TM). Please do not
consult an attorney as doing so would violate the
license agreement printed on inside fold of the Wonder
Shampoo box in 3 point type which you agreed to
by opening the package.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="bye" PARENT="help_top">
<PROMPT>
Thank you for visiting TIC Corp. Goodbye.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#exit"/>
</STEP>
</DIALOG>
|
It's not possible to show a screen shot of this example, because it's not intended for the web. Just pick up your phone!
The Wireless Markup Language allows web pages to be displayed on wireless devices such as cellular phones and PDAs. WML works with the Wireless Application Protocol (WAP) to deliver the content.
People are beginning to use XML to store their data. Software applications can use XML to store preferences and virtually any kind of information from chemical formulae to file archives. XML is not the perfect solution for every data storage problem; access times may be slow and documents can be large. Some kinds of data just don't need XML - a raster image is usually a long sequence of binary digits, monolithic, unparsable and huge.
However, XML has great possibilities for programmers. It is well suited to being read, written, and altered by software. Its syntax is straightforward and easy to parse. It's well documented and there are many tools and code libraries available to developers. As an open standard, with support from many popular programming languages, XML may well become the lingua franca for computer communication.
| Last modified: Tue Oct 25 15:19:01 2005 |