hy PDF is an awful format and your conversions never work

This article is not intended to explain how to do good PDF conversions, but rather an explanation as to why it's nearly impossible.

The PDF format is based on a language called Postscript, developed by Adobe to be a device independent language for printing (meaning any Postscript printer could print any valid postscript document). Postscript is actually very useful, it makes it, not only possible, but easy to create effects that were previously very complicated, such as vector based fonts (fonts that can be drawn at any size without distortion) and the ability to draw text at any angle. PDF inherited all of these features.


But then something weird happened. Everyone was using this format that was designed for printing, and using it to distribute documents. This works very well if all you're doing is printing them, it's probably the best format to use for that (few other formats can guarantee so much that the document will look the same across computers), but it's otherwise useless.

Content vs. Appearance
There are two basic ways of describing digital documents, content based (logical) and appearance based (visual). Content based documents describe the contents by saying what it logically contains, like where the paragraphs are and where to put chapter headings, they do not have information like where lines end and where pages end, these are part of the visual description (a separate file called a stylesheet is used to describe what these look like, such as indent paragraphs this much, and make the chapter headings centred).

Appearance based documents describe only what the document looks like. For example, "Chapter One" at this place at 20pts, write "hello" at position x,y, then "world" at position z,y, and so on. These formats have information about line breaks and page breaks.

So where does PDF fall? PDF is fully appearance based. When you open a PDF, the reader has no idea where paragraphs are, it only knows where to put each word. Only a human can effectively parse where paragraphs end—we know that they are either indented or have a vertical blank between them.

At the other end, we have Epub and Mobi. They are fully content based*. When you convert from PDF, the program converting needs to guess where paragraphs and, and try to remove headers and footers on the page. This may look trivial to a human, but is pretty much impossible for a computer to do it.

* They are both based on HTML, which does support some appearance based instructions like changing the font size and inserting a line break, but these should be used sparingly.

0 comments:

Post a Comment

 
© 2009 windows 8 download free Software | Powered by Blogger | Built on the Blogger Template Valid X/HTML (Just Home Page) | Design: Choen | PageNav: Abu Farhan