How does XML handle white-space in my documents?
All white-space, including linebreaks, TAB characters, and normal spaces, even between ‘structural’ elements where no text can ever appear, is passed by the parser unchanged to the application (browser, formatter, viewer, converter, etc), identifying the context in which the white-space was found (element content, data content, or mixed content, if this information is available to the parser, eg from a DTD or Schema). This means it is the application's responsibility to decide what to do with such space, not the parser's:
* insignificant white-space between structural elements (space which occurs where only element content is allowed, ie between other elements, where text data never occurs) will get passed to the application (in SGML this white-space gets suppressed, which is why you can put all that extra space in HTML documents and not worry about it)
* significant white-space (space which occurs within elements which can contain text and markup mixed together, usually mixed content or PCDATA) will still get passed to the application exactly as under SGML. It is the application's responsibility to handle it correctly.
The parser must inform the application that white-space has occurred in element content, if it can detect it. (Users of SGML will recognize that this information is not in the ESIS, but it is in the Grove.)
My title for
In the example above, the application will receive all the pretty-printing linebreaks, TABs, and spaces between the elements as well as those embedded in the chapter title. It is the function of the application, not the parser, to decide which type of white-space to discard and which to retain. Many XML applications have configurable options to allow programmers or users to control how such white-space is handled.
Which parts of an XML document are case-sensitive?
All of it, both markup and text. This is significantly different from HTML and most other SGML applications. It was done to allow markup in non-Latin-alphabet languages, and to obviate problems with case-folding in writing systems which are caseless.
* Element type names are case-sensitive: you must follow whatever combination of upper- or lower-case you use to define them (either by first usage or in a DTD or Schema). So you can't say <BODY>…</body>: upper- and lower-case must match; thus <Img/>, <IMG/>, and <img/> are three different element types;
* For well-formed XML documents with no DTD, the first occurrence of an element type name defines the casing;
* Attribute names are also case-sensitive, for example the two width attributes in <PIC width="7in"/> and <PIC WIDTH="6in"/> (if they occurred in the same file) are separate attributes, because of the different case of width and WIDTH;
* Attribute values are also case-sensitive. CDATA values (eg Url="MyFile.SGML") always have been, but NAME types (ID and IDREF attributes, and token list attributes) are now case-sensitive as well;
* All general and parameter entity names (eg Á), and your data content (text), are case-sensitive as always.
How can I make my existing HTML files work in XML?
Either convert them to conform to some new document type (with or without a DTD or Schema) and write a stylesheet to go with them; or edit them to conform to
XHTML. It is necessary to convert existing HTML files because XML does not permit end-tag minimisation (missing, etc), unquoted attribute values, and a number of other SGML shortcuts which have been normal in most HTML DTDs. However, many HTML authoring tools already produce almost (but not quite) well-formed XML.
You may be able to convert HTML to XHTML using the Dave Raggett's HTML Tidy program, which can clean up some of the formatting mess left behind by inadequate HTML editors, and even separate out some of the formatting to a stylesheet, but there is usually still some hand-editing to do.
Is there an XML version of HTML?
Yes, the W3C recommends using XHTML which is ‘a reformulation of HTML 4 in XML 1.0’. This specification defines HTML as an XML application, and provides three DTDs corresponding to the ones defined by HTML 4.* (Strict, Transitional, and Frameset).
The semantics of the elements and their attributes are as defined in the W3C Recommendation for HTML 4. These semantics provide the foundation for future extensibility of XHTML. Compatibility with existing HTML browsers is possible by following a small set of guidelines (see the W3C site).