The Sapheos Project
Dr. Randall Cream, Dr. Song Wang, Jarrell Waggoner, Jun Zhou
Center for Digital Humanities and Department of Computer Science
University of South Carolina
Updated August 2nd, 2010
[PDF version]
The Sapheos Project is a NEH-funded project that's trying to use advanced digital image processing techniques to automatically analyzing historical documents. The analyzing system includes linking document image with text that's recorded in XML TEI and collating large sets of images that may vary in terms of quality and size. Compared to other solutions, it’s a pure software approach with minimal user intervention.
Since September 2009, the Center for Digital Humanities together with the Department of Computer Science at the University of South Carolina are exploring possible prototypes for this system. In this paper, we first outline the steps and methods that we have adopted and then discuss our preliminary results.
Linking image with text involves document image segmentation and recording the segmentation information (xy-coordinates values) in TEI XML. Here XML file serves not only as a container for the result, but also as a reference to segmentation procedure. For example, <hi rend="face(ornamental)"> in XML represents that's a large ornamental letter image that may cross multiple lines. <lg> and <l> indicate the content is a poem that should be treated as a line instead of a paragragh if the paragraph is judged by indent from the beginning. There are so many cases to consider that it's impossible to include all of them in this project. We intended to segment paragraphs [code] out of the document image [Figure 1] and the result is writen in a file for further integrating into the result XML file.
Figure 1
This figure indicates the result of stanza (paragraph) segmentation.
Automatic image collation or theoretically automatic image registration is the crown of our project. With two similar images, the target and the template, placed side by side, we select about 70 to 100 pairs of matching points that are evenly distributed across the images. [Figure 2] We register the target image to the template image and perform a thin-plate-spline transformation [code] . Once matching points or landmarks are precisely selected, the target can be perfectly warped to the template, and the variants will be shown as blurs on the collated image. [Figure 3]
Figure 2
This figure demonstrates the program's ability to purposefully deform a curved image to an acceptable standard, and by doing so it allows the users to handle images that are not necessarily scanned to a predetermined standard. The process allows for variances due to lighting or even page curvature. The crosshairs represent anchor points within the text around which the standardization process takes place.
In the following, we focus on the problem of how to automatically and precisely select matching landmarks. Initial landmarks are selected by a popular
scale-invariant feature transformation (or SIFT)
[SIFT Package] algorithm. It’s fast, robust to scaling, rotation, illumination and local geometric distortion. About 1000 matches can be identified for each image pair with the use of this algorithm. Most image pairs are either outliers or redundant. We have developed our own scoring system to obtain final matches
[CODE]. First, we remove duplicate matches based on x-y coordinates. Second, we run a regression test across all non-duplicated matches. The regression test is to score matches by what effect the match has on the resulting thin-plate-spline matrix transformation. A certain number of matches that score poorly are then removed
[CODE]. Third, we run a local filter on the remaining matches. The local filter classifies matches by k-nearest neighbors algorithm (or k-NN). When compared to its nearest neighbor, any match that is beyond the threshold score is removed
[CODE]. After that, a grid of 3x6 is applied to each page. In each cell, matches are sorted locally. Low score matches are discarded. The last step partially ensures that matches are evenly distributed.
The software is implemented with Matlab, a high-level technical computing language. It is tested on eight different copies of “The Farie Quene”, printed in the sixteenth century in Europe, with the support from Spenser Archive project. Book images were taken by a camera held perpetual to the book. They are naturally warped, especially those close to the center.
To get a clean result, we preprocessed those images before collation, including cutting boundaries, scaling, and rotating, splitting two-page images into a one-page image. These steps are all done automatically by traditional image processing methods.
We randomly selected 284 pages from the books and performed a two-page collation. These pages have 9 variants in total. Our testing result shows that all 9 variants are indentified correctly [Figure 3]. However, there are also a large number of false positives [Figure 4], which significantly affect the performance of our software. These false positives are either due to limitation of SIFT algorithm or due to the page quality. SIFT tends to get less feature along the edges, which lead to inadequate matches selected there. Bad image quality appears to be the more important factor that causes false positives, ink bleeding, bad warping, handwriting, blurred characters, water marks, fire marks, even unequal space between similar characters, etc. All these issues are quite common among medieval books. Although it is impossible to eliminate false positives, we are continuing to refine our methods to better deal with these issues. New issues may come up when we try more images. Although we have made big strides, a lot of challenges remain.
Figure 3
This figure shows variant at the first line. It shifts about two characters when compared to each other.
Figure 4
This figure shows that the edge of page has 3 false positives (SIFT limitation) and the middle has 2 false positives (quite possible due to strong ink bleed on one image).