DOMC <domc.h>

DOMC is a light weight C implementation of the DOM as specified in the W3C Document Object Model Level 1, Level 2, and Level 2 Events recommendations. The DOM is a popular API for manipulating XML and HTML documents as a tree of nodes in memory. It is the more sophisticated but more memory constraining alternative to the SAX API.

This implementation is not W3C compliant because it lacks support for namespace functionality, entity references, DOCTYPE nodes, DTD default attribute values, and other peripheral functionality. The DOM_Node type and it's associated operations should work well however because what functionality is supported has been tested thoroughly.

The definitive information on the DOM is the collection of W3C recommendations which can be found at the below locations:

DOM_Implementation

This DOM_Implementation interface provides functions for testing the functionality of a DOM implementation as well as creating DOM_Document and DOM_DocumentType nodes.

DOM_Document

A DOM_Document represents an entire XML document and acts as the root of the DOM tree. Because nodes cannot exist outside of the context of a DOM_Document this interface provides the factory methods needed to create individual nodes to compose and modify DOM trees. The ownerDocument member of a DOM_Node points to the document from which it was created (except DOM_DocumentType and DOM_Document which may have a NULL ownerDocument member). This interface also provides the DOM_Document_getElementsByTagName function for retriving all elements with a specified name.

To build a document from scratch use the expression DOM_Implementation_createDocument(NULL, NULL, NULL) to create an empty document and add new nodes using DOM_Document_createElement, DOM_Document_createComment, etc with DOM_Node_appendChild, DOM_Node_insertBefore or similar. See the DOM_Implementation and DOM_Node interface documentation for details.

Memory Management

The DOM_DocumentLS_load, DOM_DocumentLS_read, and DOM_Document_createXxx functions allocate memory that must at some point be freed with DOM_Document_destoryNode. The DOM_Document_destroyNode function may be used to released nodes of all types such as DOM_Element, DOM_Text, DOM_Attr, DOM_Document. All children of a node are freed when the parent is freed. An entire document may be free with the expression DOM_Document_destroyNode(doc, doc). Beware that freeing a node that is still a decendant of another node will result in a tree with invalid pointers and will cause the program to crash when freed again. There are only two other special cases to consider. First, the DOM_Document_destroyNodeList function must be called for each DOM_NodeList returned by DOM_Element_getElementsByTagName and DOM_Document_getElementsByTagName. Second, the DOM_DocumentFragment node cannot be a child of another node. When added to the tree, it's children are actually moved into the target node leaving an empty DOM_DocumentFragment. This empty node must be freed with DOM_Document_destroyNode if it will no longer be used. For completeness, the DOM_DocumentEvent_destroyEvent function must be called to free DOM_Event objects however that non-core API is not yet documented here.

DOM_Node

The DOM_Node type is the primary datatype of the Document Object Model. Most of the other DOM interfaces inherit this interface. All DOM_Nodes have nodeName, nodeValue, and nodeType members. The vaules of these members depends on the node type. For example the DOM_Element node has a nodeValue corresponding to the tag name and a NULL nodeValue.

Only the DOM_Element node type has attributes. All other node types have a NULL attributes member. Child nodes are accessable through the childNodes DOM_NodeList member and the firstChild, lastChild, previousSibling, and nextSibling members. Not all element types have child nodes.

In DOMC node inheritance is emulated with simple typedef statements and a union that contains all possible subclass attributes. To access a child interface specific attribute it may be necessary to access it through this union. For example the systemId of a notation node is currently only accessible through the union like:

  DOM_String *sysid;
  ...
  sysid = node->u.Notation.systemId;
  
Care must be taken when modifing these union members (this is not well defined yet). Attributes accessible through the union that may need to be modified have helper methods to make this less awkward. The DOM_Node_setNodeValue function must be used to set the nodeValue member.

The all-important DOM_Node structure follows although some fields are left out in the interest of brevity. It may be necessary to look at this structure in the domc.h header.

  struct DOM_Node {
  	DOM_String *nodeName;
  	DOM_String *nodeValue;
  	unsigned short nodeType;
  	DOM_Node *parentNode;
  	DOM_NodeList *childNodes;
  	DOM_Node *firstChild;
  	DOM_Node *lastChild;
  	DOM_Node *previousSibling;
  	DOM_Node *nextSibling;
  	DOM_NamedNodeMap *attributes;
  	DOM_Document *ownerDocument;
  	union {
  		struct {
  			DOM_DocumentType *doctype;
  			DOM_Element *documentElement;
  			DOM_String *version;
  			DOM_String *encoding;
  			int standalone;
  		} Document;
  		struct {
  			DOM_NamedNodeMap *entities;
  			DOM_NamedNodeMap *notations;
  			DOM_String *publicId;
  			DOM_String *systemId;
  			DOM_String *internalSubset;
  		} DocumentType;
  		struct {
  			int specified;
  			DOM_Element *ownerElement;
  		} Attr;
  		struct {
  			int length;
  		} CharacterData;
  		struct {
  			DOM_String *publicId;
  			DOM_String *systemId;
  		} Notation;
  		struct {
  			DOM_String *publicId;
  			DOM_String *systemId;
  			DOM_String *notationName;
  		} Entity;
  		struct {
  			DOM_String *target;
  			DOM_String *data;
  		} ProcessingInstruction;
  	} u;
  };
  

DOM_Element

The DOM_Element interface represents an element in an XML document. The following is a description of each DOM_Node member in the context of a DOM_Element: nodeName This DOM_String * corresponds to the tag name of the element. It is read-only and cannot be modified. nodeValue This is always NULL childNodes This DOM_NodeList * contains the child nodes of this element. attributes This DOM_NamedNodeMap * contains the DOM_Attr attribute nodes of this element. firstChild This DOM_Node * points to the first child of this element or NULL if the element currently has no children. lastChild This DOM_Node * points to the last child of this element or NULL if the element currently has no children. previousSibling This DOM_Node * points to the previous node in the childNodes list of the parent element of this element. nextSibling This DOM_Node * points to the next node in the childNodes list of the parent element of this element. In addition to the functions provided by the DOM_Node interface this interface provides additional functions mainly for manipulating attributes.

The DOM specifications require support for entity references which may result in the childNodes of an attribute containing a potentially complex subtree of DOM nodes. DOMC currently has very weak support for entity references and as a result attributes will never have children. The default module for loading and storing XML documents uses the Expat XML parser which expands entity references by default. Expat recently added support for parsing external entities but DOMC does not yet use this functionalty.

DOM_NamedNodeMap

The DOM_NamedNodeMap type provides access to an unordered map that premits nodes to be retrieved and set by their nodeName. The attributes member of a DOM_Element node type is a DOM_NamedNodeMap as are the entities and notations members of the DOM_DocumentType interface.

DOM_NodeList

The DOM_NodeList type provides access to an ordered collection of nodes. The childNodes member of DOM_Node is a DOM_NodeList. The getElementsByTagName functions also return a DOM_NodeList.

The DOM recommendations specify that these lists are live meaning that modifying the children of a node should be reflected in a list returned by the getElementsByTagName functions. Currently DOMC does not update a DOM_NodeList returned by the getElementsByTagName functions if source nodes are subsequently removed or if a node is added that should be included.

DOM_CharacterData

The DOM_CharacterData interface provides some basic text manipulation functions for the DOM_Text, DOM_Comment, and DOM_CDATASection nodes. DOM_CharacterData nodes cannot be instatiated directly.

Currently all of these functions set DOM_Exception if an error occurs however there is no return value to detect the error event. A future version of DOMC will likely return a value that indicates that an error has occured.

DOM specifications require that character data is UTF-16 encoded. DOMC does not support UTF-16. The locale dependant 8 bit encoding is used instead. This permits common char * strings to be used in place of DOM_String *. Many UNIX and Linux systems support the UTF-8 locale. If a DOMC program is running in a UTF-8 locale the offsets of these string operations refer to characters rather than bytes or individual multibyte sequences. Thus the behavior of these functions should be very similar or identical to that of a DOM implementation that uses UTF-16. Also note that UTF-8 support may be disabled for the sake of installation simplicity. It may be necessary to obtain the source code and rebuild DOMC if i18n support is required.

DOM_Text

The DOM_Text node inherits the structure of the DOM_CharacterData interface. It represents the character data between elements (and much less frequently the character data associated with an attribute of an element). The length of the text string may be retrived with the DOM_CharacterData_getLength function.