Format overview
Author: Daniel Carrera
An OpenDocument is a ZIP file that contains several XML files. The exact files and directories in the archive will depend on the content of the document (e.g. images, macros, etc). A typical document, when unzipped, will have the following contents:
content.xml META-INF/manifest.xml meta.xml mimetype Pictures/ settings.xml styles.xml
We look at these in turn:
content.xml
This is the most important file. It carries the actual content of the document (except for binary data, like images). The base format is inspired by HTML, and though far more complex, it should be reasonably legible to humans:
<text:h text:style-name="Heading_2"> This is a title </text:h> <text:p text:style-name="Text_body"/> <text:p text:style-name="Text_body"> This is a paragraph. The formating (font, colour, etc.) are specified in the Text_body style. The empty text:p tag above is a blank paragraph (ie. an empty line). </text:p>
META-INF/manifest.xml
The manifest file contains a list of all the files in the ZIP archive. The contents might look like this:
<manifest:file-entry manifest:media-type="image/png" manifest:full-path="Pictures/10ECF14403.png"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="styles.xml"/>
The presense of a manifest means that OpenDocument files are also JAR archives. This is another example of OpenDocument reusing well established standards instead of reinventing the wheel.
meta.xml
This file contains the file metadata. For example, Author, "Last modified by", date of last modification, etc. The contents look somewhat like this:
<meta:creation-date> 2003-09-10T15:31:11 </meta:creation-date> <dc:creator>Daniel Carrera</dc:creator> <dc:date> 2005-06-29T22:02:06 </dc:date> <dc:language>es-ES</dc:language> <meta:document-statistic meta:table-count="6" meta:object-count="0" meta:page-count="59" meta:paragraph-count="676" meta:image-count="2" meta:word-count="16701" meta:character-count="98757"/>
The names of the tags are taken from the Dublin Core XML standard. the date follows the ISO standard.
mimetype
This is a one-line file containing the mimetype of the file. For a text document that would be:
application/vnd.oasis.opendocument.text
Pictures/
This is a directory that contains images in common image formats such as JPEG and PNG. They are referenced from content.xml in a way similar to the <img> tag in HTML.
<draw:image xlink:href="Pictures/1D67595BF2E.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad"/>
settings.xml
This includes settings such as the zoom factor or the cursor position. These are properties that are not content or layout.
styles.xml
This file contains style information. Styles include things like font size, colour, page width, and any kind of formatting.
OpenDocument provides a strong separation between content (in content.xml) and formatting (in styles.xml). The style types include:
- Paragraph styles.
- Page Styles.
- Character Styles.
- Frame Styles.
- List styles.
In OpenDocument all formatting is done through styles. Even "manual" formatting is implemented through styles (the application dynamically makes new styles as needed).
Why use a ZIP file?
File size
The most obvious benefit is file size. Because they are compressed, OpenDocument files are normally a lot smaller than equivalent Microsoft .doc files. Furthermore, the longer the document the greater the benefit due to compression.
Cleaner XML
You can embed binary data (like images) in a way that keeps the XML code clean and understandable. You can put an image in the Pictures directory and refer to it from inside the document with the syntax:
<draw:image xlink:href="Pictures/12010E7CF14403.png" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad"/>
If the format didn't use a ZIP file, it would have to embed the binary data directly into the file. That might look somewhat like this:
<image content="AAAAB3NzaC1kc3MAAACBALhIE5ZbPWD uB44Qo/+DGECA8u1Jl4QdwubYgiweQQX4ZeD6LduuZk+HMW bfvGpADeOAzS7Aw1nBPbp1F7AKo9LpGBwv/70dX0HE5hm5X 2JKXhzom4M2IPtv9BV7qKXvqdibltAPX6kTWS7Bp/o3krNL zNsV6zkuMEETFz3Rmt2hAAAAFQCfLFFL0ouPHx3wKtgyeL5 aUO8W+QAAAIBf7MGYYn8ylLOdAs4LX00pQpaAEuwjYalnxy ZUMpBnBhwjOkY0OH10m3hASu/jnTvbJchm43NK0YyvW1zCa YGKrUllFrfh4pamr4Ov3zmoL7BUK0zGwowrD6ILd3OroNch pAetV0YMJ2FkSfPlDuBaddMhymtWoFDLQ9QEzkbaTgAAAIA sxyNNHT7MN8VaF8GZWLUaq+dl9rj2wgSPYHkDV/EvGqgB0e GNEXzty3X2GqAg9z10Qj1W2Ua7FnC57kUdSG6B68Ei1Qkv/ N0yGFSZ3xbnP9hxFa2H/2DDPAftuBJT8MUZJHKttXVZ6jJ/ 3aXEMBcL8eXw6kWroOR/L7NUxHpvyw"/>
(Though in reality the code would be much longer).