Programming Tutorials Browser Tutorials Articles Struts Tutorials Hibernate Tutorials

  Tutorial: XML documents on the run, Part 1

XML documents on the run, Part 1

Tutorial Details:

XML documents on the run, Part 1
XML documents on the run, Part 1
By: By Dennis M. Sosnoski
SAX speeds through XML documents with parse-event streams
ne of the oldest approaches to processing XML documents in Java also proves one of the fastest: parse-event streams. That approach became standardized in Java with the SAX (Simple API for XML) interface specification, later revised as SAX2 to include support for XML Namespaces.
Read the whole "XML Documents on the Run" series:
Part 1: SAX speeds through XML documents with parse-event streams
Part 2: Better SAX2 handling and the pull-parser alternative
Part 3: How do SAX2 parsers perform compared to new XMLPull parsers?
Event-stream processing offers other advantages beyond just speed. Because the parser processes the document on the fly, you can handle it as soon as you read its first part. Other approaches generally require you to parse the complete document before you start working with it -- fine if the document comes off a local disk drive, but if the document is sent from another system, parsing the complete document can cause significant delays.
Event-stream processing also eliminates any document size limits. In contrast, approaches that store the document's representation in memory can run out of space with very large documents. Setting a hard limit on a real-world document's size is often difficult, and potentially a major problem in many applications.
A note on the source code
This article features two example source code files: stock.jar and option.jar , both found in a downloadable zip file in Resources . Each jar file includes full example implementations, along with sample documents and test driver programs. To try an example, create a new directory, then extract the jar's files to that directory with jar xvf stock.jar or jar xvf option.jar . The readme.txt file gives instructions for setting up and running the test drivers.
The event view
Parsers with event-stream interfaces deliver a document one piece at a time. Think of the document's text as spread out in time, as it would be if read from a stream. The parser looks for significant document components (start and end tags, character data, and so on) in the text, generating parse events for each.
For example, here's a simple document:

Dennis
Sosnoski

The table shows the parse-event sequence a SAX2 parser would generate for this document (though the parser can divide up the character data reported by characters events differently than I've shown, as I discuss when I get to the actual code).
Parse events for document
Text processed
Parse event
""
startDocument()
""
startElement("author")
"\n "
characters("\n ")
""
startElement("first-name")
"Dennis"
characters("Dennis")
"
"
endElement("first-name")
"\n "
characters("\n ")
""
startElement("last-name")
"Sosnoski"
characters("Sosnoski")
"
"
endElement("last-name")
"\n"
characters("\n")
"
"
endElement("author")
Notice in the table that the parse events include both start of element and end of element notifications -- important information for your program because it lets you track the document's nested structure. Without the end notifications, you couldn't know which elements or character data are part of the content of some earlier element. Also note that the parse events include all the character data in the document, even the whitespace sequences most people would consider unimportant.
With the event-driven approach, your application turns control over to the parser, passing it the document (as a stream or URI/URL). The parser reads the document, then breaks it into components, calling a method in a handler class supplied by your program to report each event. That isn't the only way of working with parse-event streams (as I'll show in Part 2), but it's the mostly widely used approach at present.
SAX and SAX2
Most event-stream parsers for XML in Java first used SAX. Unlike most other Internet and Web standards, SAX originally materialized without the official involvement of any sponsoring standards organization. Instead, it developed through a series of discussions, prototypes, and eventual consensus, coordinated by David Megginson on the XML-DEV mailing list.
SAX2 extends the SAX API to include full support for XML Namespaces. It also incorporates fixes to the original SAX interface. Most current parsers implement the SAX2 interface natively, though the original SAX interface is available if desired. New development should probably use the SAX2 interface even if Namespaces are not required, if for no other reason than to avoid deprecated APIs. The example code in this article follows that approach.
Event-driven programming
Enough of the background material, let's plunge in to programming the interface. You first want to get a parser, in the form of an org.xml.sax.XMLReader instance. These parser instances are serially reusable, meaning you can use one for parsing as many documents as you like, but only one document at a time. Indeed, if you're writing a simple single-threaded application, you can simply use the same instance over and over.
Usually you get the XMLReader by calling the static org.xml.sax.helpers.XMLReaderFactory.createXMLReader() method (you need to have a SAX2 parser implementation in your classpath for this to work, of course; see Resources for a link to the SAX2 project page where you can find a list of parsers supporting SAX2). createXMLReader() lets you specify a particular implementation class, or you can simply use the default one defined by a system property.
Once you have the XMLReader , you can set and check a variety of options for the parser. You can also hook up various handler types for the parse events. Each handler type must implement a particular interface. For your purposes, you'll build on the handy handler base class defined by SAX2, org.xml.sax.helpers.DefaultHandler , which supplies default implementations for the full handler set. By using that as a base class, you can override only the methods you're interested in, while not worrying about the rest.
If you're working with Sun's JAXP (Java API for XML Parsing) 1.1 or higher, you can get your SAX2 parser instance through the JAXP API. With this approach, you first call the static javax.xml.parsers.SAXParserFactory.newInstance() method to get an SAXParserFactory instance, then use that instance's newSAXParser() method to get a javax.xml.parsers.SAXParser instance. That gives you an interface for parsing a document using a specified DefaultHandler .
Both approaches support a variety of options for the parser type you want to create, including whether or not you want to validate the parsed documents. Let's ignore most of those options (and the whole validation issue) for this introduction to SAX2 parsing, but you can find the full details on the official SAX2 and JAXP sites .
One option I won't ignore is the namespace handling. Directly created SAX2 parsers default to namespace-handling enabled, while those created through JAXP have it disabled by default. This option affects how element names are reported, even if you don't use namespaces in your documents. For the sample code in this article, I assume that namespace handling is enabled. The easiest way to enable it with JAXP is to call the SAXParserFactory.setNamespaceAware() method with a true value before creating your parser.
So far this doesn't sound too bad, but the interesting part starts when you call the parser with a document. The parser won't return from that call until parsing completes, but in the meantime, it'll call your handler methods for each and every parse event of the types you registered to handle. Your handler code makes sense of the call sequence and interprets it for your application.
Writing event-driven programs, as this handler technique is known, can be difficult. The problem: event streams turn the normal program structure inside out; instead of your program running the operation and requesting what it wants from the document, it hooks to an event stream hose that pushes the document at it, one small piece at a time.
Most applications need more structure than basic event streams provide. If you're working with an event-based parser, you must provide that structure by keeping state information that tracks your location in the document. Your state-information needs depend on the structure level you're working with. Using an event-based approach to handling your documents will be easiest when you work with simple structures within the document.
Watch the market
As an example, we'll work with a document that gives the history of stock trades over some span of time:



SUNW

86.24
500


MSFT

22.26
1000


For each trade, the document above includes the symbol for traded stock, the time the trade occurred, the price, and the number of shares, all as content of specific elements. The above sample shows only two trades (taking place at some unspecified future date), but you could easily extend it to any number of trades over any time period. In particular, it makes sense to use such a format in a ticker stream that provided a feed of all trades on an exchange during a trading day.
Suppose you want to parse such a stream and track all stock information, including high, low, and last trade prices for the day, along with share and dollar volumes, for each stock traded. An event-stream parser approach should give you what you need -- you can handle each individual stock-trade element as it's received, immediately updating your accumulated inf


 

Read Tutorial at: Click here to view the tutorial

Rate Tutorial:
XML documents on the run, Part 1

View Tutorial:
XML documents on the run, Part 1

Related Tutorials:

XML JavaBeans, Part 1 - JavaWorld February 1999
XML JavaBeans, Part 1 - JavaWorld February 1999
 
XML JavaBeans, Part 2 - JavaWorld March 1999
XML JavaBeans, Part 2 - JavaWorld March 1999
 
Java makes the most of XML's extensibility - JavaWorld July 1999
Java makes the most of XML's extensibility - JavaWorld July 1999
 
Programming XML in Java, Part 1 - JavaWorld March 2000
Programming XML in Java, Part 1 - JavaWorld March 2000
 
Programming XML in Java, Part 3 - JavaWorld July 2000
Programming XML in Java, Part 3 - JavaWorld July 2000
 
Easy Java/XML integration with JDOM, Part 2 - JavaWorld July 2000
Easy Java/XML integration with JDOM, Part 2 - JavaWorld July 2000
 
Mapping XML to Java, Part 1 - JavaWorld August 2000
Mapping XML to Java, Part 1 - JavaWorld August 2000
 
Validation with Java and XML Schema, Part 2 - JavaWorld October 2000
Validation with Java and XML Schema, Part 2 - JavaWorld October 2000
 
Jato: The new kid on the open source block - JavaWorld March 2001
Jato: The new kid on the open source block - JavaWorld March 2001
 
Clean up your wire protocol with SOAP, Part 1 - JavaWorld March 2001
Clean up your wire protocol with SOAP, Part 1 - JavaWorld March 2001
 
Jato: The new kid on the open source block, Part 2 - JavaWorld April 2001
Jato: The new kid on the open source block, Part 2 - JavaWorld April 2001
 
XML messaging, Part 3
XML messaging, Part 3
 
Use XML data binding to do your laundry
Use XML data binding to do your laundry
 
XML documents on the run, Part 1
XML documents on the run, Part 1
 
XML documents on the run, Part 2
XML documents on the run, Part 2
 
JavaWorld article
JavaWorld article
 
Yes, you can secure your Web services documents, Part 1
Yes, you can secure your Web services documents, Part 1
 
Yes, you can secure your Web services documents, Part 2
Yes, you can secure your Web services documents, Part 2
 
JSP 2.0: The New Deal, Part 3
JSP 2.0: The New Deal, Part 3 More Flexible JSP Document Format Rules The JSP specification supports two types of JSP pages: regular JSP pages containing any type of text or markup, and JSP Documents, which are well-formed XML documents; i.e., docum
 
Parsing and Processing Large XML Documents with Digester Rules
Parsing and Processing Large XML Documents with Digester Rules XML is commonly used for integration with third-party applications or web services, especially those that are running on non-Java platforms. On the other hand, if the code is running in a man
 
Site navigation
 

 

Send your comments, Suggestions or Queries regarding this site at roseindia_net@yahoo.com.

Copyright © 2006. All rights reserved.