Create a quick-and-dirty XML parser
Tutorial Details:
Java Tip 128: Create a quick-and-dirty XML parser
Java Tip 128: Create a quick-and-dirty XML parser
By: By Steven R. Brandt
Parse valid XML using minimal code
ML is a popular data format for several reasons: it is human readable, self-describing, and portable. Unfortunately, many Java-based XML parsers are very large; for example, Sun Microsystems' jaxp.jar and parser.jar libraries are 1.4 MB each. If you are running with limited memory (for example, in a J2ME (Java 2 Platform, Micro Edition) environment), or bandwidth is at a premium (for example, in an applet), using those large parsers might not be a viable solution.
Those libraries' large size is partly due to having a lot of functionality?perhaps more than you require. They validate XML DTDs (document type definitions), possibly schemas, and more. However, you might already know that your application will receive valid XML. Also, you might already decide that you want just the UTF-8 character set. Therefore, you really want event-based processing of XML elements and translation of standard XML entities?you want a nonvalidating parser.
Note: You can download this article's source code in Resources .
Why not just use SAX?
You could implement SAX (Simple API for XML) interfaces with limited functionality, throwing an exception named NotImplemented when you encountered something unnecessary.
Undoubtedly, you could develop something much smaller than the 1.4 MB jaxp.jar/parser.jar libraries. But instead, you can cut down the code size even more by defining your own classes. In fact, the package we construct here will be considerably smaller than the jar file containing the SAX interface definitions.
Our quick-and-dirty parser is event-based like the SAX parser. Also like the SAX parser, it lets you implement an interface to catch and process events corresponding to attributes and start/end element tags. Hopefully, those of you who have used SAX will find this parser familiar.
Limit XML functionality
Many people want XML's simple, self-describing textual data format. They want to easily pick out elements, attributes and their values, and elements' textual content. With that in mind, let's consider what functionality we need to preserve.
Our simple parsing package has just one class, QDParser , and one interface, DocHandler . The QDParser itself has one public static method, parse(DocHandler,Reader) , which we will implement as a finite state machine.
Our limited functionality parser treats the DTD and processing instructions simply as comments, so it won't be confused by their presence nor use their content.
Because we won't process DOCTYPE , our parser cannot read custom entity definitions. We will have only the standard ones available: &, <, >, ', and ". If this is a problem, you can insert code to expand custom definitions, as the source code shows. Alternatively, you could preprocess the document?replacing custom entity definitions with their expanded text before handing the document to the QDParser .
Our parser also cannot support conditional sections; for example, or . Without the ability to define custom entity definitions in DOCTYPE , we don't really need this functionality anyway. We could process such sections, if any, before the data is sent to our limited-space application.
Because we won't process any attribute declarations, the XML specification requires that we consider all attribute types to be CDATA . Thus, we can simply use java.util.Hashtable instead of org.xml.sax.AttributeList to hold an element's attribute list. We have only name/value information to use in Hashtable , but we don't need a getType() method because it would always return CDATA anyway.
The lack of attribute declarations has other consequences as well. For example, the parser won't supply default attribute values. In addition, we can't automatically reduce white space using a NMTOKENS declaration. However, we could handle both issues when preparing our XML document, so the extra programming could be excluded from the application using the parser.
In fact, all the missing functionality can be compensated for by preparing the document appropriately. You can offload all the work associated with the missing features (if you want them) from the quick-and-dirty parser to the document preparation step.
Parser functionality
Enough about what the parser cannot do. What can it do?
It recognizes all the elements' start tags and end tags
It lists attributes, where attribute values can be enclosed in single or double quotes
It recognizes the <[CDATA[ ... ]]> construct
It recognizes the standard entities: &, <, >, ", and ', as well as numeric entities
It maps lines ending in \r\n and \r to \n on input, in accordance with the XML Specification , Section 2.11
The parser does only minimal error checking and throws an Exception if it encounters unexpected syntax, such as unknown entities. Again, however, this parser does not validate; it assumes the XML document it receives is valid.
How to use this package
Using the quick-and-dirty XML parser is simple. First, implement the DocHandler interface. Then, easily parse a file named config.xml :
DocHandler doc = new MyDocHandler();
QDParser.parse(doc,new FileReader("config.xml"));
The source code includes two examples that provide full DocHandler implementations. The first DocHandler , called Reporter , simply reports all events to System.out as it reads them. You can test the Reporter with the sample XML file ( config.xml ).
The second and more complex example, Conf , updates fields on an existing data structure that resides in memory. Conf uses the java.lang.reflect package to locate fields and objects described in config.xml . If you run this program, it will print diagnostic information telling you what objects it is updating and how. It prints error messages if the config file asks it to update nonexistent fields.
Modify this package
You'll likely want to modify this package for your own application. You might add custom entity definitions?line 180 in QDParser.java contains an "Insert custom entity definitions here" comment.
You could also add to the finite state machine's functionality, restoring functionality I have excluded here. If so, the source code's small size should make this task relatively easy.
Keep it small
The QDParser class occupies around 3 KB after you compile and pack it into a jar file. The source code itself, with comments, is just over 300 lines. This should be small enough for most space-constrained applications, and retain enough of the XML specification to enjoy most of its useful features.
This page formated for crawlers and browsers that don't support scripts and tables.
Home
EZone
Read
Tutorial at: Click here to view the tutorial
Rate Tutorial: Create a quick-and-dirty XML parser
View Tutorial: Create a quick-and-dirty XML parser
Related
Tutorials:
Singletons vs.
class (un)loading - JavaWorld - May
1998
Singletons vs.
class (un)loading - JavaWorld - May
1998 |
XML JavaBeans, Part 2 - JavaWorld March 1999
XML JavaBeans, Part 2 - JavaWorld March 1999 |
Programming XML in Java, Part 1 - JavaWorld March 2000
Programming XML in Java, Part 1 - JavaWorld March 2000 |
Programming XML in Java, Part 3 - JavaWorld July
2000
Programming XML in Java, Part 3 - JavaWorld July
2000 |
Easy Java/XML integration with
JDOM, Part 1 - JavaWorld May 2000
Easy Java/XML integration with
JDOM, Part 1 - JavaWorld May 2000 |
Mapping XML to Java, Part 1 - JavaWorld August 2000
Mapping XML to Java, Part 1 - JavaWorld August 2000 |
Jato: The new kid on the open source block - JavaWorld March 2001
Jato: The new kid on the open source block - JavaWorld March 2001 |
XML APIs for databases - JavaWorld January 2000
XML APIs for databases - JavaWorld January 2000 |
Use XML data binding to do your
laundry
Use XML data binding to do your
laundry |
XML documents on
the run, Part 1
XML documents on
the run, Part 1 |
Take the sting out of SAX
Take the sting out of SAX |
Create a quick-and-dirty XML parser
Create a quick-and-dirty XML parser |
JavaWorld article
JavaWorld article |
XML glossary
XML glossary |
AurigaDoclet: Javadoc doclet for generating javadoc in pdf, postscript, etc
What Is AurigaDoclet?
AurigaDoclet is a Javadoc doclet which can generate Java API document in fo, pdf, postscript, pcl, and svg format. AurigaDoclet accepts command line options which can be used to further customize the generated output. |
SpeedJG - XML Builder
SpeedJG - XML based Java Swing GUI Builder |
FastParser 1.6.3
FastParser 1.6.9.1
XML Edition
FastParser is a Java Xml parser
High performance XML parser (benchmarks* : up to +100% faster compared to Xerces and JDK1.4 integrated parser)
SAX Level 1 and 2 compliant
DOM support
JAXP compatibility
Names |
PrEd
PrEd is a Java based graphical utility to find and edit Java property files in JAR, WAR, EAR and other kind of ZIP archives. It is the perfect tool for customizing Java applications, which use XML and Property files for their configuration. |
XML Document Validation with an XML Schema
This tutorial explains the procedure of validating an XML document with an XML schema. |
Chat Transcript: Java Web Services Developer Pack (Java WSDP) 1.5
Learn about the exciting new web services features in the recently-released Java WSDP 1.5. |
|
|
|