OCR stands for Optical Character Recognition. As we explained earlier, the OCR process is designed to turn images of text into machine-readable text. In order to produce our PDF image/text editions we must perform an OCR on the TIF images we produced when we scanned the book.
For this exercise, we will be using the software ABBYY FineReader. Although we value the contributions of the open-access (OA) community and whenever possible we use their software, there is not a comparable OCR user interface in the OA world.
Performing the OCR
Loading your images
The first thing you notice when you launch the FineReader is a task wizard. Select Other/Open. This should bring up a browser window.
Browse to your flash drive and select all the images you want to load into the FineReader. Before you press Open, we must select the right options. Towards the bottom-right corner of the browser window you will notice a button labelled Options… This should take you to the Options panel. (If later you need to access the Options panel from within the workspace you can just press Ctrl+Shft+O)
Make sure the split pages option is checked. This option will cut the pages of your scans neatly into two, so you can work one page at a time. (Notice that this decision will also affect the look of the final pdf.)
Make sure you select the PDF tab. In the PDF options you want to select the Default paper size to be “Keep original image size.” Save mode should be “Text under the page image.” You also want to check “Enable Tagged PDF (this will allow us to add meta-data later). Picture quality should be set to “High quality (for printing).” In the end you should end up with something like this:
In the Advanced tab you want to open up the Font Matching options and clear all. Now choose a font that you thing you will be comfortable working with while you proof. In general some people like to work with serif fonts, and some with sans-serif (look up the difference on wikipedia). I personally think for these texts Times New Roman works well.
Now we’re ready to roll! Press OK to close the Options and press Open in the Browser windows to begin the loading process.
Once the pages are loaded, you will notice that the software automatically performs an OCR on the images. After the process is over you will have 3 columns in the work area: Pages, Image and Text. We will be doing most of our work on the Text column, but it is important to keep the other columns open as we work. At this point you should also make sure that Save as PDF is selected on the top ribbon. Before we proceed there is another window we need here. Go to view and select Zoom Window/Show zoom window (You can also open/close the zoom window by pressing Ctrl+F5). Notice that in the menu options for Zoom Window you have the option to select dock on top or dock on the bottom. This is a personal choice. I like the top. At this point your screen should look something like this:
|Note: Make sure you save often. To save the project go to File/Save FineReader Document. The file-naming convention for this part of the process should be the same as the one we used for our images [1st four letters of the title] + [a number from 1 to 4 depending on your position in the team]: Ex. prec1, prec2, etc. We will change these later when we incorporate all the projects.|
Ok, now we are ready to proofread the OCR. As a default setting, FineReader marks all suspect characters by highlighting them in blue. You can navigate through these suspects in three ways:
- By using the Check Spelling function. You can select this from the text column/window on the top right. This will bring up its own spell-checking window that can walk you through each of the suspects. Go through the whole text until there are no more suspects to check.
I find this method to be the slower one and not necessarily the most accurate. The positive side of this method is that it allows you to “check off” the suspects as you go (by either ignoring them or correcting them), so that in the end you can visually see you didn’t skip a suspect.
- By using the navigation buttons on the top-right corner of the Text column/window.
This method will guarantee that you indeed don’t skip a suspect. If you go this route, go from suspect to suspect, making sure the characters in the text match the image in the zoom area (in some occasions a quick look at the larger image can be helpful). Go through the whole text until there are no more suspects to check.
- My favorite, but the one that requires you to be the most alert. Just go from suspect to suspect using your mouse. Click on a suspect, examine it against the Zoom (or the Image). Correct it if it needs to be, otherwise move on quickly with the mouse to the next one. Sometimes, you can use your eyes to go over two or three suspects that are close to each other. After you gain experience doing this, make sure you don’t lose your concentration. If you are going too fast, you might miss a comma that is supposed to be a period, etc. Go through the whole text until there are no more suspects to check.
The Proof is in the Details
Here are some things to watch out for while you are proofing:
Notice that FineReader automatically converts end-of-line hyphens to what it calls Optional Hyphens (Usually called soft hyphens). Sometimes it doesn’t do a good job with these, so pay special attention to them.
Right now there is an issue with PDF texts that does not allow you to search for words broken by a soft-hyphen. In order to correct his you have to manually erase the optional hyphens and add a line break at the end of the word that had the optional hyphen. To add a line break you must press Shft+Enter. If you simply hit Enter, you are adding a paragraph break. Unless it is called for you want to be careful not to do this. NOTE: You probably want to leave this procedure after you have corrected everything else.
If you find an optional hyphen that should be a “hard” hyphen, you can make the correction using your keyboard (-).
Many times you will encounter characters from another language (ä, ç, Œ, etc.) that the OCR fails to recognize because it is focusing on its English dictionary. In order to
correct these errors there are two tried and true ways: a) Copy and paste the right character from the Internet (time-consuming); or b) Use Alt + [character code]. Here is a link that has a list with the major character codes. Make sure you use the NumPad to enter the character codes. The negative side of this method is that working on most laptops can be a pain since the NumPad is overlaid on the keyboard.
Every time you correct an error, make sure you make a mental note of it. Eventually you might start noticing patterns in the types of errors that are common for your edition. You should write these patterns down as a form of documentation of your work. Eventually these patterns can be used to make corrections more expedient and accurate.