XML Interviews Question page7
If XML is just a subset of SGML, can I use XML files directly with existing SGML tools?
Yes, provided you use up-to-date SGML software which knows about the WebSGML Adaptations TC to ISO 8879 (the features needed to support XML, such as the variant form for EMPTY elements; some aspects of the SGML Declaration such as NAMECASE GENERAL NO; multiple attribute token list declarations, etc). An alternative is to use an SGML DTD to let you create a fully-normalised SGML file, but one which does not use empty elements; and then remove the DocType Declaration so it becomes a well-formed DTDless XML file. Most SGML tools now handle XML files well, and provide an option switch between the two standards.
Can XML use non-Latin characters?
Yes, the XML Specification explicitly says XML uses ISO 10646, the international standard character repertoire which covers most known languages. Unicode is an identical repertoire, and the two standards track each other. The spec says (2.2): ‘All XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646…’. There is a Unicode FAQ at http://www.unicode.org/faq/FAQ.
UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, and higher-order characters are used to encode anything else from Unicode into sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other languages using the Latin alphabet without diacritics. Note that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code point 127 decimal (the end of ASCII).
UTF-16 is an encoding of Unicode into 16-bit characters, which lets it represent 16 planes. UTF-16 is incompatible with ASCII because it uses two 8-bit bytes per character (four bytes above U+FFFF).
What's a Document Type Definition (DTD) and where do I get one?
A DTD is a description in XML Declaration Syntax of a particular type or class of document. It sets out what names are to be used for the different types of element, where they may occur, and how they all fit together. (A question C.16, Schema does the same thing in XML Document Syntax, and allows more extensive data-checking.) For example, if you want a document type to be able to describe Lists which contain Items, the relevant part of your DTD might contain something like this:
<!ELEMENT List (Item)+>
<!ELEMENT Item (#PCDATA)>
This defines a list as an element type containing one or more items (that's the plus sign); and it defines items as element types containing just plain text (Parsed Character Data or PCDATA). Validators read the DTD before they read your document so that they can identify where every element type ought to come and how each relates to the other, so that applications which need to know this in advance (most editors, search engines, navigators, and databases) can set themselves up correctly.
The example above lets you create lists like:
(The indentation in the example is just for legibility while editing: it is not required by XML.)
A DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD and a validating editor means you can be certain that all documents of that particular type will be constructed and named in a consistent and conformant manner.
DTDs are not required for processing the tip in question Bwell-formed documents, but they are needed if you want to take advantage of XML's special attribute types like the built-in ID/IDREF cross-reference mechanism; or the use of default attribute values; or references to external non-XML files (‘Notations’); or if you simply want a check on document validity before processing.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML/XML Web pages for pointers). Many of them can be downloaded and used freely; or you can write your own (see the question on creating your own DTD. Old SGML DTDs need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, but most popular SGML DTDs are already available in XML form. The alternatives to a DTD are various forms of question C.16, Schema. These provide more extensive validation features than DTDs, including character data content validation.
Does XML let me make up my own tags?
No, it lets you make up names for your own element types. If you think tags and elements are the same thing you are already in considerable trouble: read the rest of this question carefully.
How do I create my own document type?
Document types usually need a formal description, either a DTD or a Schema. Whilst it is possible to process well-formed XML documents without any such description, trying to create them without one is asking for trouble. A DTD or Schema is used with an XML editor or API interface to guide and control the construction of the document, making sure the right elements go in the right places.
Creating your own document type therefore begins with an analysis of the class of documents you want to describe: reports, invoices, letters, configuration files, credit-card verification requests, or whatever. Once you have the structure correct, you write code to express this formally, using DTD or Schema syntax.