We’re nearing completion of the first of our two digital workshops: scanning and OCR’ing a scholarly monograph.
As Alex noted, the process of scanning itself seemed to go pretty smoothly. With the work shared by four people, it was not particularly onerous. Students in my group, unlike those in Alex’s, did not work in pairs; but they had no complaints about scanning.
In introducing and discussing the project I tried to bring our group’s attention to the broader issues involved. What, I asked them, about this book would you be interested in preserving? How would you describe this book?
They skipped over what I assumed would be the obvious answers (the title, the text, the author) and began immediately talking about the typography and the binding. While their vocabulary for describing these properties of the book was limited (as, to a hardcore bibliographer, rather basic, is my own).
I encouraged them to think about how search changes the way we can use a book. I also began to point to the possibilities of large-scale text databases. How does the availability of such resources change the sort of questions we are inclined to ask of the works we read?
(I left off pressing them on questions about what metadata should be attached to a digital edition; that we’ll address soon enough I think.)
While I didn’t draw the comparison explicitly for my students, I see this focus on materiality, on how the book as a physical object mediates the information it communicates, as consistent with what students are doing in their seminar meetings, where much of their time is spent reciting poetry. Prof. McGann has written about the value and importance of recitation as an alternative or supplement for that most traditional of English major seminar activities: interpretation. Focusing on the material properties of the book, I like to think, stems from a similar set of theoretical concerns: about how material form mediates meaning and purpose.
So far, so good.
The process of OCR’ing the book, however, proved challenging in other ways. Alex and I remain frustrated by the problem of how to handle soft hyphens. And the process itself is, by its very nature, more vexing and less pleasant than scanning: while most people were able to scan their pages within an hour or so, the process of OCR’ing took around four hours on average. And those four hours are spent squinting at a screen, comparing a PDF to the recognized text to see whether that period came through as a period or a comma. One student described the experience as “mind numbing.” Another, no doubt with some slight exaggeration, described it as “soul crushing.”
This complaint alone does not necessarily count as a demerit to the project. There is not a priori a reason why this sort of time consuming, “mind numbing,” labor might not have educational value. At least some experience of how the sausage gets made seems undeniably valuable. Once you’ve digitized a text, you understand the world of digitized texts in a different way.
And yet I wonder if the decreasing marginal educational value of each additional hour spent doing the OCR can justify the labor involved. Reciting Whitman’s Song of Myself has certain educational benefits; reciting Dickinson’s verse aloud only compounds those benefits. Scanning and OCR’ing 10 pages of text has certain educational benefits as well. I’m less sure that scanning and OCR’ing the next 10 pages, however, carries the same value.