Perfect EPUB

Posted by Andy at 11:15 PM

What is a Perfect EPUB?
A Perfect EPUB is a proofread and edited book that complies with the EPUB standard.

What is an EPUB file?
An EPUB file is a zipped file that contains files that tell an EPUB reader what to display, how, and in what order to display it.

This Tutorial
This tutorial starts from having an OCR copy of a book in plain text format (convert with Calibre to get to here if needed). The justification for this is that it is easier to build up from scratch than it is to edit another implementation of the EPUB standard.

It will take longer to convert this book than it would to read the book normally, how much longer depends on the quality of the OCR (some books are so bad as to be unreadable and need to be rescanned, this tutorial won't help with those).

Needed to follow this tutorial:
- Web Browser (I'm using Firefox)
- Text Editor (Vim is used for this tutorial, it's very powerful and can have a steep learning curve. I'll try to be detailed enough not to lose anyone. It can be downloaded from http://www.vim.org/download.php)
- EpubCheck (Download from http://code.google.com/p/epubcheck/)
- Zip (Unverified download http://www.willus.com/archive/zip64/ I'm using linux and this implementation of zip, if someone could see if it works on windows with the instructions in this tutorial it would be greatly appreciated)
- A Perfect EPUB book (for unchanging files)
- A text file of the book to be worked on.

Useful for using Vim:
death2y uploaded Ultimate Guide to the VI and EX Text Editors (http://bibliotik.org/torrents/29579) which is a great book explaining how to use Vim.

Unchanging files:
- mimetype - First file in the EPUB file, tells readers that this is an EPUB file.
- META-INF/container.xml - Standard directory and file that tells the EPUB reader where the content.opf file is.
- cover.html - Front cover of the book, loads the cover_image.jpg image for the front cover.
- stylesheet.css - This file can change (adding a format element), but mostly it has the elements you use to format your EPUBs, which doesn't change much.

Semi-changing files:
- content.opf - Contains all of the EPUB's meta-data, and a listing of all of the content, stylesheet, and indexing files. Also the display order of those files.
- toc.ncx - A structured listing of the chapters in the book (EPUB readers build their Table of Contents from this file). Can also list sub-chapter items.

Changing files:
- cover_image.jpg - A quality image of the front cover of the book.
- Chapter_xx.html (Prologue.html, Epilogue.html) - The first 9 lines and last 2 lines of these files are mostly the same (the title meta changes), the rest is the content of the book.

Tutorial

Step One: Unzip the Perfect EPUB into a working directory.

Step Two: Delete all of the Chapter_xx.html files and the cover_image.html file.

Step Three: Copy text file of the book into the working directory and open it in Vim.

Vim Tip: To save and exit vim, from command mode (press Escape to get out of input mode and into command mode) type ':wq'
(Colon, command signifier.
w, write.
q, quit)
To exit vim without saving, from command mode type ':q!'
To undo the last thing you did press 'u'
To redo the last thing you undid type ':redo'

Step Four: Delete all blank lines in the file.

Vim Tip: To delete all blank lines in a file type ':g /^$/ d'
(Colon, command signifier.
g, global.
/, start of search pattern.
Caret, start of line.
Dollar, end of line.
/, end of search pattern.
d, delete)
English translation: Delete all lines where the start of the line is followed by the end of the line (i.e. blank lines)

Step Five: Delete all unnecessary lines at the start and end of the file (copyright information, table of contents, etc). Acknowledgments can go either way (you can take them out of the normal display order of the book, but some EPUB readers don't honour that)

Vim Tip: To delete a line type 'dd'
(d, delete.
dd, shorthand for delete line)
To delete a number of lines type '<number>dd'
(<number>, any number)
To delete a range of lines type ':<first line number>,<last line number> d'
(Colon.
<first line number>, where to start deleting.
<last line number>, where to stop deleting.
d, delete)
To get the line number the cursor is on, in command mode type '<ctrl-g>'

If your source file has a lot of ampersands (& and less than signs (< replace them with & and <

Vim Tip: To replace a character or string type ':%s/<string>/<replacement>/g'
(Colon.
Percent, all lines in the file (can be replaced with a line range or removed to only replace on the current line)
/, start of search pattern.
<string>, the string to replace.
/, end of search pattern, start of replacement string.
<replacement>, what replaces the string.
/, end of replacement string.
g, globally (don't stop at first occurrence).)

Step Six: Add a (paragraph) tag around all paragraphs

Vim Tip: To add something to the start of all lines ( tag) type ':%s/^/ /g'
(Colon.
Percent, all lines in the file.
s, substitute.
/, start of search pattern.
Caret, start of line (what we are replacing).
/, end of search pattern, start of replacement text.
' ', the tag (what we are replacing the start of the line with, there are four spaces before the <p at the start to indent the line to make it look neater, the class allows you to change how the paragraph is displayed with the stylesheet).
/, end of replacement text.
g, globally (replace all, not just the first occurrence)) English translation: For all lines in the file, replace the start of the line with the tag, globally
To add something to the end of all lines (closing the tag) type ':%s~$~~g'
(Colon.
Percent.
s.
~, Tilde, start of search pattern (the separating characters can be any character that doesn't occur in the search pattern or the replacement text, as we want to put '' as the replacement text we can't use the '/' character to separate the expressions).
Dollar, end of line
~.
'', replacement text (closes the tag).
g.)

Sub-step A: Some OCR files/conversions have paragraphs split over two lines. If it's detectable (the file I found like this had a space at the end of each broken line) you can save yourself some work by rejoining them all automatically.

Vim Tip: Substituting over a line ':%s~ \n ~ ~g' (\n being the special sequence for newline)
Or if done before adding the tag ':%s/ $\n/ /g'

Step Seven: Add html headers and footers to the file.

Vim Tip: To jump to a specific line in a file type ':<line number>' e.g. ':1' to go to line 1.
To jump to the end of a file type ':%'
To start inserting into the file:
at the current cursor position 'i'
after the current cursor position 'a'
at the start of the line 'I'
at the end of the line 'A'
make a new line after current line and start on the new line 'o'
make a new line before current line and start on the new line 'O'
To get out of insertion mode and into command mode press Escape.

Header:

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Chapter One: <Chapter Title></title>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
<link href="stylesheet.css" type="text/css" rel="stylesheet"/>
</head>
<body class="chapter">

The white space is lost in the header above and while it isn't needed, it does make the file look neater (I can't find a [code] format tag that would preserve white space). Some books name their chapters, some don't. Replace <Chapter Title> in the header above if it's relevant, otherwise delete it (leaving <title>Chapter One</title>

.

Footer:

</body>
</html>

Step Eight: Save the file and load it in your Web Browser.

Vim Tip: To save the file you're working on type ':w'

Sub-step A: If the OCR file you started with has XML special characters in it (specifically & or <

your web browser should complain and tell you the line number that it occurs on (Firefox does). Jump to the line and delete the offending character if it isn't supposed to be there, otherwise replace it with & or <.

Step Nine: Read the book in your Web Browser, editing it in Vim as you go.
- The titles of the chapters should be of the class chapter_title ( instead of e.g. 'Chapter One: Tragedy'

- Some text in books is emphasized for different reasons. To emphasize text surround it in the tag e.g. emphasis
- To show that the book has jumped to another time or character some books have chapter breaks. Best shown with an example: '***'

Vim Tip: To jump straight to an error in the text type '/<error>'
(/, start of search pattern.
<error>, a string of characters to search the file for.)
To jump to the next occurrence of the last search press 'n'

Editing Tip: For some books, if you don't have a printed copy of the book, you can use Amazon's 'Look Inside' feature to correct errors in the OCR

Step Ten: Copy the file to a file called Chapter_1.html (or Prologue.html/Acknowledgments.html/etc if that should be the first file), open Chapter_1.html in Vim and delete everything that isn't Chapter One.

Vim Tip: Search for Chapter Two: '/Chapter Two'
Get what line it is on, and the total lines in the file: 'ctrl-g'
Delete all lines that shouldn't be in the Chapter_1.html file: ':<line number of Chapter Two header>,<total lines - 2> d'
Write the file and quit Vim: ':wq'

Step Eleven: Repeat Step Ten for other chapters. As well as replacing <title> tag's contents with the correct chapter title.

Vim Tip: Search for the chapter, e.g. Chapter Two: '/Chapter Two'
Get what line it is on: 'ctrl-g'
Delete all of the chapters before this one: ':9,<line - 1> d'
Search for the next chapter, e.g. Chapter Three: '/Chapter Three'
Get what line it is on, and the total lines in the file: 'ctrl-g'
Delete all the chapters after this one: '<line number of next chapter header>,<total lines - 2> d'
Write the file and quit Vim: ':wq'

Step Twelve: Open content.opf in Vim.
Meta-data: Open the book's Amazon page, or the publishing information of the printed book
- The <dc:language> tag is for the language of the book, en for english.
- The <dc:subject> tag is for the books subjects (comma separated)
- The <dc:date> tag can has an event with it - 'edited' is optional, it would be today (you're editing the file); 'published' is the date the book was published.
- The <dc:contributor> tag is for who contributed to the book, it has a role with it - 'edt' is optional, your name or handle (you edited the file)
- The <dc:identifier> is to identify the book, only one is used (the one referenced in the package tag). I tend to include the ISBN and reading order in the series. The one that is needed is the uuid, which should be unique (I use http://www.famkruithof.net/uuid/uuidgen and Version 4)
- The <dc:creator> is for the author(s) of the book (a new tag for each author)

Manifest: The manifest is a listing of all of the non-required files in the EPUB, edit the list so all of your chapters (prologue, epilogue, etc) are listed. Doesn't need to be in any order.

Spine: An ordered list of the files in the manifest, edit this list to include all of your chapters in the correct order. Add 'linear="no"' to any item that should be out of the reading order i.e. Acknowledgments.

For a full listing of what the opf file is about see http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html

Save and exit Vim.

Step Thirteen: Open toc.ncx with Vim.
Edit the content value of the meta tag with the name 'dtb:uid' to be the same as the uuid from content.opf
Edit the <docTitle> to reflect the title of the book
Edit the rest of the file to reflect the contents of the book

Vim Tip: To cut lines press '<number of lines> dd'
To paste lines press 'p'

Save and quit Vim.

Step Fourteen: Get a decent quality cover image and save it as cover_image.jpg (Google Image search works well)

Step Fifteen: Zip the file with the mimetype file the first in the archive.

Linux: 'zip -Xr9D "<Author> - <Title>.perfect.epub" mimetype *'

(naming format is just my preference)

Windows: -- I've no idea if someone can say how to zip a file with the mimetype as the first file it would be appreciated --

Step Sixteen: Check the EPUB with EpubCheck and resolve any errors.

'java -jar <path to EpubCheck.jar> <book>.perfect.epub'

windows 8 download free Software

0 comments:

Post a Comment

Search

About Me

Labels

Blog Archive