XML - Elements and Data Modeling

Objective: More details on elements, xml structure,  and data modeling approaches.

Learn:

Notes

Just knowing xml syntax is not enough to make a sensible document. Often XML is used to represent data and to that end must use a logical structure involving elements, attributes and nesting.

Elements & Attributes

Previously the syntax of elements and attributes was discussed. Below describes more about element and attribute concepts. Elements:

Elements vs. Attributes:

Data can be stored in elements or in attributes as shown below. The following both work but store data in different ways:

data stored in elements & attributes data only in elements
<person gender="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>
<person>
<gender>female</gender>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

It is often suggested to use elements if the information feels like data since attributes are harder to maintain because:

An attribute can be useful for something like an ID (just as in Html) which acts as a reference mark or counter, something like <person id="12">  although even that can be a child element as well. If you have many attributes, it may be time to convert some into elements. Recall elements are necessary but attributes are optional and may never be needed.

Nesting

Elements can be nested so

Incorrect nesting syntax will result in an error message, however, illogic nesting will not show an error. For example, below likely is data for 2 people but has a logic error in that the 2nd set of Name/Age is not nested under a Person element for example

<Data>
  <Person>
    <Name>Ed Van</Name>
    <Age>29</Age>
  </Person>
     <Name>Laura Van</Name>
     <Age>30</Age>
</Data>

so above is well formed and to human eye looks like data for 2 people (Ed and Laura) but really there is nothing linking "Laura" to an age=30. So above makes more sense if it were

<Data>
  <Person>
    <Name>Ed Van</Name>
    <Age>29</Age>
  </Person>
  <Person>
     <Name>Laura Van</Name>
     <Age>30</Age>
  </Person>
</Data>

In example above <Person> element has no character data and acts as a container, like a record name in a DB, whereas <Name> and <Age> act like field names which have data.. Of course in DBMS records do not have names but in XML elements with nested children are like records but all elements must have names.

Other Structure Issues

Nesting is most important to give data some structure but there are other issues. One that comes to mind is consistency, in element names, for example,

 <Data>
   <Person>
     <Name>Ed Van</Name>
     <Age>29</Age>
   </Person>
   <Person>
      <aName>Laura Van</aName>
      <AGE>30</AGE>
   </Person>
 </Data>

above has reasonable nesting and is well formed but element names are not consistent so better if

Databasing

XML can be considered a database of sorts because it:

  1. is a collection of data
  2. gives structure to data (by nesting elements)
  3. can be sorted and queried
  4. uses schemas
  5. easily integrated into data driven programs

But XML is more useful for data exchange and keeping small data sets than acting as a large enterprise database cause:

  1. XML is a text file so is not efficient for storing or querying large data sets. If one piece of data is changed then the entire file must be re-written which is okay for small files but not large ones.
  2. There are no indexes or transaction features to make it efficient and secure.
  3. It is not for multi-user access.

Why not use a database instead of XML? depends on the situation. XML is portable and supports deeply nested structures making it more suitable than a DBMS for things like:

There is no right or wrong when to use XML or not use it. The biggest uses are

But there are other uses and it can be used (but not yet common):

Interesting to note, Microsoft Visual Studio utilizes the term data "source" rather than database to show that data driven applications can be from a database or any other source, mainly XML. In fact when a data object is created in it, an XML schema is automatically generated (even it connects to a database) since the underlying basis for data driven apps is XML.

Data Modeling

Data modeling has two considerations :

  1. validating a document with the rules of a DTD or schema which is covered in book later
  2. how you structure your data using nesting and elements versus attributes

#1 is covered when we talk about DTD and Schemas, however, #2 is rather subjective and is harder to find information on in books or web sites. The idea is what way to structure and nest your elements to best represent your data

Often you follow someone else's schema and do not get to make up the structure. But if you have to invent the structure or schema then you often have many ways to represent the same data. Although the most common web sites discuss syntax, you can find web sites discussing representing data like

Whether you use elements or attributes was discussed above. The other consideration is how you group elements using nesting. As long as you follow the syntax rules for nesting, there is no right or wrong or even industry guidelines. It is somewhat akin to structuring a database except most DB are relational and have set guidelines. Data in XML can be represented many ways like:

Unrelated data can be stacked in a document, for example below has information about employees and inventory (two unrelated groups)

<Employees>
   <Name>Ed</Name>
   <Name>John</Name>
</Employees>
<Inventory>
   <Item>Toyota Hybrid Car of the Year</Item>
   <Item>Gas Guzzlin Al Queda Supporting SUV</Item>
</Inventory>

or a document intended to be readable can be stacked and given some loose non-relational structure like below which is a product description. Instead of using html which only has formatting, XML elements are used to give it some structure that potentially could be processed by an application.

<Product>
<Intro>The <ProductName>Turkey Wrench</ProductName> from <Developer>Full
Fabrication Labs, Inc.</Developer> is <Summary>like a monkey wrench,
but not as big.</Summary>
</Intro>
<Description>
<Para>The turkey wrench, which comes in <i>both right- and left-
handed versions (skyhook optional)</i>, is made of the <b>finest
stainless steel</b>
</Para>
<Para>You can:</Para>
<List>
  <Item><Link URL="Order.html">Order your own turkey wrench</Link></Item>
  <Item><Link URL="Catalog.zip">Download the catalog</Link></Item>
</List>
</Description>
</Product>

Nesting is way to group data. For example, a well formed but poor way to structure data would be

<People>    <!-- in this case the data re not grouped -->
   <Name>Janie</Name><Phone>123</Phone><Age>29</Age>
   <Name>Laura</Name><Phone>456</Phone><Age>32</Age>
</People>

it would be better to add a Person element to act as a container to group the data for each person

<People>    <!-- now Person acts as a container to group each person -->
   <Person>
      <Name>Janie</Name><Phone>123</Phone><Age>29</Age>
   </Person>
   <Person>
      <Name>Laura</Name><Phone>456</Phone><Age>32</Age>
   </Person>
</People>

Relational data can be nested or relations set up. Below from msdn.microsoft.com shows orders nested in customer element

<CustomerOrders>
  <Customers>
    <CustomerID>ALFKI</CustomerID>
    <Orders>
      <OrderID>10643</OrderID>
      <CustomerID>ALFKI</CustomerID>
      <OrderDate>1997-08-25</OrderDate>
    </Orders>
    <Orders>
      <OrderID>10692</OrderID>
      <CustomerID>ALFKI</CustomerID>
      <OrderDate>1997-10-03</OrderDate>
    </Orders>
    <CompanyName>Alfreds Futterkiste</CompanyName>
  </Customers>
  <Customers>
    <CustomerID>ANATR</CustomerID>
    <Orders>
      <OrderID>10308</OrderID>
      <CustomerID>ANATR</CustomerID>
      <OrderDate>1996-09-18</OrderDate>
    </Orders>
    <CompanyName>Ana Trujillo Emparedados y helados</CompanyName>
  </Customers>
</CustomerOrders>

The same data could be modeled using relationships (as one to many) using schemas but that is for later (see msdn.microsoft.com)

Name conventions

Mostly there are no naming conventions or guidelines, other than some DTD keywords  (like PCDATA) are uppercase and some words in the declaration are lowercase (like xml version). But you can name elements and attributes as you wish. I tend to agree with wdvl.internet.com/Authoring which suggests mixed case like 'LastName" is best. So I prefer lower case or mixed case and of the two I think mixed case is more readable. I shy away from UPPER case because

  1. having to use SHIFT or CapsLock key is just one more thing to remember
  2. UPPER case is not considered as readable as mixed or lower case.
  3. I like reserving upper case for keywords

Of course if your XML is simply a data exchange and never read by human eyes, then it does not matter. But one of the goals of XML is represent data so it is readable by humans and can be processed by machines.

Summary

XML syntax is clear but how you should model data is not. XML structure allows many different ways to model data and must consider: