HathiTrust is the federation of
over sixty major research institutions and libraries, working to preserve the
‘cultural record’ for the long term future. The HathiTrust Digital Library
(HTDL) is a digitalised version of the partners’ physical collections, which
allows them to be viewed by both the students of the partner’s, and the public.[1]
The HTDL currently has over 10.1 million volumes of work, making up 453 terabytes
worth of digitised sources, and is growing daily.[2]
In order to navigate this vast digital library,
The HTDL comes equipped with several ways of searching, which use different
data capture methods. The first way to search is by using the ‘Catalog Search’
bar. It is used to search information
about the item the user is looking for. It identifies items from certain search
criteria, as they are marked with a number of XML tags, with the option of
using an advanced search to narrow these items down to something more specific. XML
is used as a wrapper to hold metadata about objects in HathiTrust; it is used
both in preserving and providing access to objects. I feel this
method of searching works reasonably well, and once the site has generated a
list of results, these can be further refined by a selection of different XML
tags.
The next method of searching
ignores XML tagging, and searches for results directly in the text; it does
this by using OCR. When experimenting with in-text searching on HathiTrust, I
found it to be quite accurate most of the time (see figures 3 & 4). However
like most OCR software it falls foul of some common OCR errors. For example on
numerous occasions “the” is read as “th3” (see figures 5 & 6). I feel it
would be unfair to blame this solely on HathiTrust’s underlying OCR software.
Though OCR is rarely 100% accurate, poor image quality doesn’t help. Figure 6
is a prime example of why the OCR can find itself easily confused. The
differences between figures 4 & 6 help to exemplify the lack of continuity
in image quality within the HTDL. The varying degrees of image quality found on
HathiTrust, once again cannot be blamed on the website itself, but rather the
fact that the sources are PDFs of the scanned originals. This means that the
quality of the images on the HathiTrust relies on the condition of the original.
I feel from figure 6 it is clear to see that both the condition of the original
source, and the font used in it will inevitable affect the accuracy of the OCR.
Though I think HathiTrust’s
visual design in terms of accessibility works very well, I still think the
over-all layout leaves a lot to be desired. The home page is extremely direct, which
I think makes the site very user-friendly; with two immediate search bars, the
site presents the user with direct access to exactly what they came for. The
problem I have with HathiTrust’s visual design, is that it is simply not
eye-catching enough. With no inclusion of images, or a decent navigational bar
I feel that the homepage could be quite unappealing to a first time user, which
in turn could leave the user reluctant to access the site.
Ultimately I think that
HathiTrust’s ambitions of preserving cultural works for future generations is a
fantastic use of the ever growing collection of online historical resources. The
site contains a wealth of historical resources on a broad range of subjects. I
feel that the methodology to accessing the material is completely appropriate
to the site; as once the user has searched for their desired item, they will
more than likely be met with not only what they were searching for, but also a
host of other useful source materials.

Figure
2: Advanced Catalog Search Options. Found on http://catalog.hathitrust.org/Search/Advanced

Figure 3:
OCR search result. Found on: http://babel.hathitrust.org/cgi/pt/search?id=mdp.39015012114313&view=image&seq=11&q1=%22hitler+thought%22

Figure 4:
OCR result in text. Found on: http://babel.hathitrust.org/cgi/pt?id=mdp.39015012114313;view=image;seq=34;q1=%22hitler%20thought%22;start=1;size=10;page=search;num=20

Figure 5:
OCR search result. Found on: http://babel.hathitrust.org/cgi/pt/search?id=mdp.39015068572810&view=image&seq=3&q1=TH3

Figure 6:
OCR result in text. Found on: http://babel.hathitrust.org/cgi/pt?id=mdp.39015068572810;view=image;seq=22;q1=TH3;start=1;size=10;page=search;num=12
No comments:
Post a Comment