XML - Markup Basics
Objective: Learn basic markup structure and syntax of an XML document.
Learn:
-
Structure of xml document
-
Elements
-
Syntax rules: elements, characters, entities, comments
-
Parsers and display in browser
-
Creating a document
-
Common mistakes
- CDATA sections
- Processing instructions
-
key terms like Elements, Tags, Content, Attributes, XML declaration (at top),
Nesting, Well formed
Notes
Structure
XML was designed to exchange data; not necessarily to create nice displays. The
structure reflects its data centric focus.
The basic markup concepts along with some jargon are:
-
actual XML file is called a document, so documents contain data/content
-
later we'll learn about associated files/technologies like data models (DTD and schema) and
display instructions (XSL)
-
documents
are text files with .xml extension in filename.
-
documents
contain both
- markup (tags, elements, attributes etc)
- and text content (your data)
-
well formed: xml has many rigid syntax rules and document that follows the rules is
well formed. Rules are strictly enforced; unlike HTML which has loose rules meaning browser will try to display something
even if erroneous but in XML if there is 1 tiny error then entire document is
bad and nothing is processed/displayed)
-
Elements are the heart of a document;
- the markup
language uses tags <> to identify elements.
- Your data goes inside the start & end tags for some element.
- Unlike HTML which has a finite set of pre-defined
elements that the browser uses to format data, in XML you make up your own
elements. So elements in XML are more like "field" names in a database (DB) .
- You can make up as many elements as you want and the same elements can be
repeated, just like the same field appears multiple time for each record in a
DB.
The text of a basic XML document (
note7b_example_Basics.xml) looks like
<?xml version="1.0"?>
<!-- File Name: note3_basic_example1.xml -->
<BOOK>
<!-- My 1st xml -->
<TITLE>XML for Smarties</TITLE>
<AUTHOR>Ed Van</AUTHOR>
<PRICE unit="$">5.99</PRICE>
<TITLE>Advanced XML</TITLE>
<AUTHOR>Mark Twain</AUTHOR>
<PRICE unit="$">6.49</PRICE>
<!-- Note html tags dont display -->
<img src="image_music.gif"></img>
<h1>Is this a heading?</h1>
</BOOK>
There are 2 parts:
-
prolog shown in green goes at
top
-
document (or root) element between the tags
<book> </book> (where
book can be any valid name you want)
The prolog consists of 3 lines all of which are optional
-
xml declaration specifies the xml version (latest is 1.0);
although optional it is suggested to always have this
-
line 2 is blank
-
3rd line is a comment (remember these are optional so don't
need them)
Although example above has no other options, a prolog can also contain other
options (that we will cover later) like:
-
DTD
-
processing instructions like a style sheet
A root element is required. The root is akin to a database
name. In general the root contains all the content but realize there are two
types of content:
-
text or data
-
element (or other markup) content
For example
<AUTHOR>Ed Van</AUTHOR>
AUTHOR is the element name and "Ed Van" is the text data
The root usually does not contain text content (although it can) and instead has
child elements which have data. The root may contain various
optional objects such as:
-
elements (start & end tags) with text data (content) in
between
-
DTD
-
processing instructions
-
CDATA section delimiters
-
entity references
-
character references
The above example simply contains some child elements and text, but
none of the other options listed (since we cover these later).
Elements
Elements are the most important part of a document, some concepts are:
-
Elements make up the document structure and contain the content. You make
up the element names and type your text content between the elements.
-
Unlike HTML which has a finite set of pre-defined
elements that the browser uses to format data, in XML you make up your own
elements.
-
Elements in XML are like "field" names in a database (DB) when they
have data, and empty elements are also like table and record ids in a DB
-
You can make up as many elements as you want and the same elements can be
repeated, just like the same field appears multiple time for each record in a
DB.
-
Elements are defined by a name (that you make up) and include everything between
their start <> and end tag </>
Elements in above example are:
-
BOOK is the root element
-
title, author, and price are child elements acting like
field names in a DB. The data varies for each element but are things like
"Ed Van" "Mark Twain" etc. Price also has 1 attribute
(called unit) whose value is "$"
-
<img> and <h1> are also
elements and are just there to show if you use known HTML elements they are
still like any other element you make up and they do NOT have any meaning to the
browser (i.e., the browser does not display the image, nor does it make a
heading-1 style.
Syntax
Well formed xml documents have correct syntax, which is
different than a document that is valid or that makes sense. The syntax rules are:
-
All elements must have a start and end tag
using same tag syntax as html which is
-
<xxx> is start tag where xxx is any element name
-
</xxx> is end tag such that end name must match start name
except
end is prefixed with / for example:
<myelement>Text goes here </myelement>
-
Markup is in start tag (between < >) basically means the
element name and its attributes go in start <>, as opposed to character data (see below)
-
Character data is the content text or data (like Mark Twain
above) between start & end tag of element (basically between start > and
end </)
and can contain any special characters except < > " ' or &.
-
Tags are case sensitive
-
The first <> is the root. Must have 1 and
only 1 root. All other elements must be nested within the root element
-
Elements can have sub elements (children) that
must be nested correctly within their parent element as shown below
(child tags must end before the parent)
-
Non-empty element has content; it is like a field name in a DB. An empty
element is like a table name in a DB and often used as a container to nest other
elements inside.
-
Elements can have attributes in name=value pairs.
There can be multiple attributes=value pairs separated by spaces as long as
the attribute names are unique for example <file name1="xml" name2="html">
but NOT <file name="xml" name="html">
-
Attribute
value must be quoted for example <PRICE unit="$">
-
Quotes must be matched exactly but can be single ' or
double "
-
Attribute values can have quotes embedded in the text
using a different type of quote for example <book title="It's a Girl">)
-
Names (for elements or attributes) cannot have spaces,
must begin with letter or _, avoid "xml" as a prefix, otherwise can
contain letters, digits, -, _ or period.
-
White space is preserved and CR/LF (returns) are
converted to line feeds when passed to the application (although the
application like the browser may collapse white space)
-
Comments: like html <!-- This is a comment
--> can be anywhere except inside a tag (i.e., <mytag <!-- illegal
comment-->>)
-
Entity references are used if you want to embed
special characters in content, similar to html, using syntax &entity;
for example < > a main use of entities is showing xml examples in xml
documents.
-
Elements have just character data or child elements or both,
or can be empty. In example above <img> is empty.
-
Html tags don't display by default (note the <img> in example
above does not display an image)
Below shows nesting
<root>
<child> can add content here
<subchild>your content goes here</subchild>
</child>
</root>
Your data can actually be either:
-
character data between element start & end tags
-
attribute values
The choice is yours, however, later will learn some technical
details that suggest do not use attributes and instead make all of your data be
element character data. Virtually every XML document has character data, but may
or may not have attributes. Either is okay, but realize attributes cause
problems with some associated technologies, for example writing a javascript
program to process/display xml is way more complicated if there are attributes.
Display
XML can be viewed in several ways using
- using a browser
- any software that can display XML like Microsoft Excel
- a custom program you (or someone writes) in some programming tool like
Visual Studio
Of
course XML is text so any text enabled software including Notepad can display
the document in raw form. However the intent is to display XML usually use a
- parser: checks if xml is well formed, and
- style sheet: instructions on how to display
XML may appear 3 different ways in a browser as described in
w3schools.com/xml/xml_view.asp
- plain xml code
- nicely formatted using some display technique or application program
- an error if not well formed
XML without a style sheet will display in newer versions of browsers like Internet Explorer (IE) because
they have a default xml style sheet and the Microsoft parser (msxml) is built-in
to IE. XML may not display in older browsers.
For now we'll use IE and investigate other parsers and how to load parsers using Javascript later. Displaying XML in IE:
- has color coding
- shows a tree structure for elements (with + and
- to expand/collapse the
tree)
- indicates if there are any syntax errors (if not then document is well formed)
- does not have any useful formatting
The display of above example in IE looks like (right side shows collapsed
element)
So the default display is not pretty at all. To make XML pretty, you must
create your own style or some other extra feature. Remember, you make up the elements so the browser has no idea how to format
content. To format a display that are various techniques which we cover later
like
- cascading style sheet CSS
- data binding with HTML
- XSL
- writing a program (javascript)
For now you can
see a crude display of XML just by opening in a browser
(remember to View | Source to see the actual XML). To test your browser
works okay with xml click on
noteX_XmInBrowser.htm
Look at links below to
see display of XML.
Creating XML Documents
Can create document with text editor or special software to edit xml. It is
easier to use software than to code manually, and just like web editors there
are various software to do xml
Typical steps for a new document are:
- Copy & paste prolog from an existing document (like top 3 lines in my
example at top)
- Type start and end tag for root element
- Type the child elements and content inside the root. For any element
usually best to type start tag then copy-paste it and add / to make the end
tag, this avoids typing errors or forgetting end tag.
- Decide on how you will nest elements and use attributes, if at all
- Save the text file (with .XML extension) , then open file in a browser to see if any syntax errors.
also check that the nesting is what you intended (try collapsing and expanding
the elements). If
no errors then all done, otherwise correct errors and repeat this step
Errors
There are many types of syntax errors you may encounter since XML has many rigid
rules. If a document has errors the software (like IE) will indicate the error. Look closely at the line# for error that browser displays but realize
some line#'s actually are due to errors in the line(s) before it. For example if
you miss a closing ">"IE may indicate the error is on a line# below the one
missing the >. Some common errors are:
- forget end tags (like </me>)
- forget XML is case sensitive
- use spaces in name of an element
- forget quotes for attribute value
CDATA is an alternative to using entities like > to show
special characters (< > & ' ") in content. A CDATA section is not parsed and so
can contain special characters. CDATA is an easier way to show text that has
many special characters. It's especially useful in teaching documents that
contain examples of script
code or xml code since these by nature have lots of < > & characters. CDATA can
-
go anywhere character data can (not in markup itself)
-
starts with "<![CDATA["
-
ends with "]]>"
-
text in between start-end cannot contain the string ]]>
Example is
<![CDATA[
function button_onclick() {
if (myval=0 && total < 0) alert("error");
}
]]>
Processing Instructions
Processing instructions contain information in xml document that applications
(i.e., software separate from the xml) use to process it. Processing itself is
advanced and we cover later, however, you should have some notion what
instructions are since you may see these in examples. The general form is
below where
target value is a file or instruction that is up to the application to
process otherwise it is ignored
<?target value?>
Examples you may see with IE are to include a style sheet or to provide
information to your own script like:
- <?xml-stylesheet type="text/xsl" href="MyXslFile.xsl"?>
- <?MyScript Version="2"?>
Summary
XML document:
- contains many elements marked by start and end tags
- you make up the element names
- your content (data) goes between start and end tags for an element; all
content is associated with an element
- elements can be nested
- need 1 and only 1 root element which usually has no character data but
has all the child elements
- child elements can be deeply nested and usually the lowest nesting has
data so acts like field name in DB; higher level empty-elements usually act
as containers like table names or record IDs in a DB
- the markup is similar to html except the syntax has strict rules
- without style sheets or process instructions, the browser will not format
your display
- can be created with text editor
- has XML declaration at top
- uses .XML for file extension
First impressions:
- Displaying an XML document (without a style sheet or other technology) is
usually disappointing for those used to the colorful display of html web
pages.
- XML document is about describing data, not a nicely formatted display
(although there are related technologies like XSL to make a display)
- Document structure often makes more sense from a database perspective,
than from an html display one, since the elements are like fields and the same
group of elements can be repeated to form records. However, elements can be
deeply nested and the same group does not have to be repeated which is unlike
most relational DB's (unless you view the nesting as a one-many relationship).
This course includes both a database as well as a display/HTML perspective
which is why will often mention how XML compares to DB and HTML terms. I
realize students may not have both perspectives but hopefully you have at
least 1 of them.