[As part of their scanning and OCR/Proofing exercise our students were asked to write a midterm report. They have given us permission to post their reports here under our CC license. The following is the report from team Alex, word for word]
ENAM Project Tango, 1st Project Report
David McNerney, Laura Borgs, Jordan Bolden and Flora Pulce
4 October 2010
1. Please describe what you did in the process of digitizing this work. What challenges arose in the process and how did you solve them?
In the process of digitizing the work we went through various stages in converting the text to a digital copy. The first of these stages was the scanning phase. In the scanning phase we assigned each person in the group an equal number of pages to scan, (taking into account the cover, back cover, spine etc.) The scanning process was a relatively simple one, taking each member of the group approximately half an hour to scan their individual pages, and we easily finished the entirety of our scanning in one week. After scanning we moved into the second phase of digitizing the text, the OCR. We spent almost three whole weeks completing the OCR process. The first step in the OCR was to convert our scanned images of the text into a readable .pdf document. We did this using the ABBY Finereader 10 software program. In order to evenly divide the work each group member was responsible for the same pages he or she was responsible for in the scanning phase. This led us to create 4 separate Finereader Documents with one group member responsible for each document. From here we moved into the editing phase, which was easily the most difficult part of the project. Throughout the editing phase we had to use ABBY Finereader 10 to find and edit any mistakes made in the scanning process. While the software was extremely accurate in converting the image to text, there were a number of mistakes which made the editing job a very tedious one. One of the biggest issues arose with learning to the use the OCR software. As undergraduates we did not have previous exposure to any form of OCR software and learning to use the Finereader 10 software proved to be difficult. However, the OCR handout was very helpful and eventually the group was able to collectively figure out the software and make the necessary edits. Within the editing process itself there arose patterns of errors from the OCR software. One of the biggest issues is with hyphens, especially when a word could not fit on the line. It was explained to us why this issue exists, yet it still proved to be a very tedious part of the editing process. Once we finished with the editing all that was left was to compile the complete text into one large document, something that once again was a new thing, but did not prove too difficult to figure out. In summation, the biggest problem we encountered was a general unfamiliarity with the use of OCR software which led to a few minor problems, but all-in-all the process was relatively easy, if not tedious.
2. What are the advantages of a digital edition of a work such as the one you digitized? What are the disadvantages? Include in your reflections not only a consideration of the differences between a paper edition and a digital edition of the same work, but some consideration of how the broader landscape of digital databases (e.g. Google Books, JSTOR) changes how these works are used.
Nowadays more and more book are being put out-of-print. The world of literature is becoming more and more digitalized and “Digital Humanities” is a new field of study developing out of this digitalization, because before looking for a book in a library, people look up the book on the internet first.
The advantages are obvious: finding a book on an online database gives you access to a book quicker and easier that searching it in a library or even having to use the Inter-library-loan which many libraries offer but it takes away valuable time you could rather be spending doing more research. Especially books that are out of print should be digitalized since getting access to these kinds of resources is getting harder and harder. Another advantage is that you can search through the PDF file easily and find the paragraphs you are looking for faster.
Taken to an extreme, book could become a dead medium and libraries won’t have any value some day. By using a PDF file to search a book for valuable information for your research one might miss other important background knowledge which one could have gained by reading through the book diagonally to find the fitting information for the research.
3. What did you learn? In answering this question try to consider not simply the specific technologies involved (I learned how to use a scanner; I learned how to use ABBYY Finereader, version 10), but the broader issues: what did you learn about books?
For question three we decided it would make more sense if each of us answered the question as it applied to us in order to address the diversity of what we all learned. What follows is each of our responses.
David: In completing this first project I really learned more about the effort that goes into digitizing a scholarly manuscript. Although it would be a quicker process if we didn’t attempt to preserve the look of the original text, through the project I have gained a greater appreciation for this standard. Going into the project I did not possesses a true appreciation for putting up a digitized copy of the book as close to the original as possible. In directly working with the text, and digitizing it myself I definitely feel that I learned a new appreciation for keeping books in their original form (or as close to it as possible) rather than just copying the text into a word document. Furthermore, through the project I learned a greater appreciation for digital books in general. Prior to this project I much preferred working with a hard copy of the text, but in creating a digitized copy I grew to appreciate that specific form of media, and the advantages that it provides for, much more. I feel that this project provides a very good way to increase awareness around the issues of digitizing texts, while at the same time providing a good insight into the value of having a digitized text.
Jordan: Honestly, before this project I had never seriously considered the concept of digital humanities or the relevance of the rising field of study to my life beyond my vehement disdain for Kindles vague condescendence for blogging. I believed in the physical experience reading provides; the few seconds of suspense turning a page creates that are completely absent in digital replications. It would have taken nothing short of working within the field in this way to convince me that it is worth it to even pose the question of the significance of digital humanities. But, thrust into the middle of this experiment, I begin to see how monumental the question of digital humanities could really be.
This project forced me to re-evaluate my definition of a book, and consider whether the text separate of a physical book should be defined differently. This inclination of thought is completely new to me, and opened new avenues with which to regard my studies. Learning to use and working with ABBYY Finereader showed me how much work goes into digitising a book, and the way in which a work can be illuminated by doing so. I may not be sold on the idea that a digitised version of a book is transformed into a new thing altogether, but I can see how useful digitising can be as a tool for preserving literature, making it more widely accessible, and providing a more convenient and effective way to search and use texts for research or study.
Laura: Before Professor McGann’s I had never heard about Digital Humanities before. Literature and World Wide Web were always two different worlds for me. Of course I searched for books online but just in order to later on find them in a library nearby to flip through the book and gain more knowledge out of it than only the knowledge you were looking for in the first place.
While digitizing a book I learned to appreciate online databases for the work that is put in the digitization of a book and the fact that online databases are making book accessible for everyone owning a computer or having access to a computer. And to be honest, IT will develop faster and with more attention of today’s society than books will. Taken to an extreme, someday there wont be anymore people writing a book, because with electronic devices like the Kindle the book itself with its pages, front and back cover and a spine will lose its function – books will be downloaded and immediately go on you electronic device – books will become ebooks. And you can already buy about without even holding it in your own hands, flipping through the pages and acknowledging the work that was put into these 300 plus pages because now it is down to a Megabyte size.
Flora: This project was a revolutionary idea, it will be helpful for researchers in the humanities, it is an another effort to bring together the traditional ways of passing knowledge-through books- and a more modern approach of learning and share academic knowledge-the wonderful tool that is internet. But it was not really interesting to be a part of it in a practical way because basically we just read instructions from a blog and reproduced it. We did not really face hard situations, or problems, so we had little initiative. I guess in a way it is a good thing, it proved that everything happened according to plans. But still, all of the scanning and OCR, and correction were really mechanical and I felt it had not a lot in common with being in the humanities, and learning literature. We just skimmed through the book without taking a closer look at it. Though I spent almost a month using it, I have barely no idea of what it is about. I am not, in general very interested in digital editions, I am old fashion, I need a book to feel that I am learning, so in my opinion, though a digital edition is much more useful and easy to work with, it will not replace a paper edition. To sum it up, it is a job that has to be done, but it is not an interesting one.