dighistjoe: Critique of Hathi Trust

HathiTrust is the federation of over sixty major research institutions and libraries, working to preserve the ‘cultural record’ for the long term future. The HathiTrust Digital Library (HTDL) is a digitalised version of the partners’ physical collections, which allows them to be viewed by both the students of the partner’s, and the public.[1] The HTDL currently has over 10.1 million volumes of work, making up 453 terabytes worth of digitised sources, and is growing daily.[2]

In order to navigate this vast digital library, The HTDL comes equipped with several ways of searching, which use different data capture methods. The first way to search is by using the ‘Catalog Search’ bar. It is used to search information about the item the user is looking for. It identifies items from certain search criteria, as they are marked with a number of XML tags, with the option of using an advanced search to narrow these items down to something more specific. XML is used as a wrapper to hold metadata about objects in HathiTrust; it is used both in preserving and providing access to objects. I feel this method of searching works reasonably well, and once the site has generated a list of results, these can be further refined by a selection of different XML tags.

The next method of searching ignores XML tagging, and searches for results directly in the text; it does this by using OCR. When experimenting with in-text searching on HathiTrust, I found it to be quite accurate most of the time (see figures 3 & 4). However like most OCR software it falls foul of some common OCR errors. For example on numerous occasions “the” is read as “th3” (see figures 5 & 6). I feel it would be unfair to blame this solely on HathiTrust’s underlying OCR software. Though OCR is rarely 100% accurate, poor image quality doesn’t help. Figure 6 is a prime example of why the OCR can find itself easily confused. The differences between figures 4 & 6 help to exemplify the lack of continuity in image quality within the HTDL. The varying degrees of image quality found on HathiTrust, once again cannot be blamed on the website itself, but rather the fact that the sources are PDFs of the scanned originals. This means that the quality of the images on the HathiTrust relies on the condition of the original. I feel from figure 6 it is clear to see that both the condition of the original source, and the font used in it will inevitable affect the accuracy of the OCR.

Though I think HathiTrust’s visual design in terms of accessibility works very well, I still think the over-all layout leaves a lot to be desired. The home page is extremely direct, which I think makes the site very user-friendly; with two immediate search bars, the site presents the user with direct access to exactly what they came for. The problem I have with HathiTrust’s visual design, is that it is simply not eye-catching enough. With no inclusion of images, or a decent navigational bar I feel that the homepage could be quite unappealing to a first time user, which in turn could leave the user reluctant to access the site.

Ultimately I think that HathiTrust’s ambitions of preserving cultural works for future generations is a fantastic use of the ever growing collection of online historical resources. The site contains a wealth of historical resources on a broad range of subjects. I feel that the methodology to accessing the material is completely appropriate to the site; as once the user has searched for their desired item, they will more than likely be met with not only what they were searching for, but also a host of other useful source materials.

Figure 1: ‘Catalog Search’ bar. Found on http://www.hathitrust.org/