Thomas Crombez

Bridging the Gap between Scholarly Editing and Mass Digitization

In Uncategorized on March 4, 2011 at 8:04 pm

A hybrid model for online scholarly publishing projects
(This is the edited version of a paper presented at the TEI Conference and Members’ Meeting in Ann Arbor, November 2009. Slides of the original presentation may be found here)

In this essay, I would like to highlight the differences between mass digitization efforts such as Google Books, and small-scale scholarly editing projects, most of which use one of a number of standards developed by the academic text-editing community (for example, the Text Encoding Initiative or TEI).

Since the inception of large-scale book digitization projects from 2004 onwards, it has become clear that such initiatives start from a highly different point of view compared to academic digitization efforts. Quantity and commercial opportunity, rather than editorial accuracy, seem to be the primary motives. Still, both are partly engaged in the same project – the digitization of humanity’s printed past. Should they remain separate kinds of efforts, one commercial and the other academic?

My aim is not to theorize about the wider implications of both kinds of digital publishing, but to demonstrate how they may co-exist. Such partnerships, I will argue, consist in maximizing the links that can be constructed between both.

More particularly, it will be made clear that automatic text analysis could play a crucial role in the process of bridging the gap between scholarly editing projects and mass digitization undertakings. This would require the integration of three distinct areas of knowledge:

  • scholarly editing;
  • mass digitization projects, meaning million-volume online collections of texts which are not scholarly edited (examples include the Gutenberg Project, the Million Books Project, Google Books, the Internet Text Archive);
  • automatic text analysis, i.e., instruments from language technology, also known as Natural Language Processing or Computational Linguistics, that may help in linking these two very different kind of document collections.

My argument will close with an elaborate discussion of two sample projects that may serve as ‘proofs of concept.’

Mass Digitization Projects

The most well-known example of mass digitization is probably Google Books, which was launched in 2004. Of the millions of books digitized by the company so far (in cooperation with a growing number of academic and public libraries in the US and in Europe) only a limited amount are in the public domain, and hence available for so-called ‘full view.’ The viewing of books that are still in print is restricted to a limited number of pages, or even short ‘snippets’ for works of which the owner has not been identified yet. (Google Books 2010; Wikipedia 2010)

The distinguishing characteristic of projects such as Google Books is their massive size, spanning millions of volumes. However, when it comes to collections of digital documents, ‘size’ is not merely a linear measure of information. A quote from a Google engineer from 2005, recorded by George Dyson, succinctly sums up the aims of the search company:

We are not scanning all those books to be read by people. We are scanning them to be read by an AI. (Dyson 2005)

What does ‘AI’ or ‘metadata extraction’ exactly mean for the end-user of a massive digital library? The most illustrating example is provided by the current interface of Google Books. Not only does it present the evident metadata concerning a book (author, publisher, year of publication, number of pages), but also a wealth of additional information:

  • Chapter endings and beginnings are automatically detected, in order to create a table of contents that links to the correct pages.
  • A list of references to this book from various web pages is compiled automatically, and also a list of references to this book from other books in Google Books.
  • ‘Popular passages’ are extracted from the book, i.e., passages that are frequently quoted in other books.
  • A list of ‘common terms and phrases’ for the book is provided, namely, keywords and names that the artificial intelligence engine believes are characteristic for this work. Just as corresponding features on websites such as Amazon.com (using ‘SIPs’ or Statistically Improbable Phrases to characterize a book), the list of characteristic terms seems to be composed using a bibliometric measure such as TF-IDF, or Term Frequency multiplied by Inverse Document Frequency. This metric will result in a selection of words or combinations of words that are relatively rare in the whole of the collection, but that occur significantly more often in this particular book.

All of this information is extracted automatically from the book’s textual contents. The various applications of AI or machine learning that are used (for example, the TF-IDF metric) may be grouped under the term of automatic text analysis.

These ‘extras,’ I believe, are already changing the way humanities scholars do research today. Semantic markup is added to the collection which is generated by means of the collection itself. This might be compared to how a catalogue of a traditional print library, or a particular ordering of the shelves (e.g., according to discipline or subject), adds information to the library’s collection.

Such automatically extracted metadata constitute the crucial innovation of mass digitization projects. Compared to the ‘markup’ or organizational structure that set apart a print library from a pile of books, the main differences introduced by this sort of markup may be summed up as follows. Automatically generated metadata carry much more information about the contents of the works, compared to a traditional catalogue or subject index. Moreover, the metadata can be easily updated as new items are added. And finally the number of interconnections between books (all of which are immediately available to the user) vastly increases.

The massive ‘size’ of a digital library, then, is not merely to be conceived as a linear measure of information – the depth and scope of the available metadata is actually increased exponentially.

Still, massive text collections also feature a number of important drawbacks. The resistance of academics and other readers to Google Books grew rapidly after the launch. One of the reasons was that the textual content of the books was often inaccurate. Such errors are mainly due to shortcomings in either the available materials (books that are badly printed, or stained) or in the OCR software (Optical Character Recognition) that is employed to ‘read’ the books.

Deficient text seriously endangers the potential of mass digitization projects. It leads to search errors (both in the sense of words that are not found while they are in the text, and words found while they are not in the text) and to erroneous classifications or incorrect metadata.

In 2009, following the Google Books Settlement Conference at UC Berkeley, Geoff Nunberg wrote a blog post where he described Google Books as a ‘metadata train wreck,’ precisely because its systems try to deduce so many metadata about the books in its collection through a purely algorithmic procedure. Amongst the many examples he cites, the following might illustrate the core of the problem:

More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones’ having been the translator of Freud’s Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled. (Nunberg 2009)

Because language technology fundamentally works by generalizing from a limited amount of textual data, errors such as these will be endemic, certainly in the first phase of the project. Especially when speed and size is valued above data quality and consistency, which is precisely what sets apart mass digitization from scholarly editing. Errors in the text of books and their metadata have technological causes, but derive from a principled starting point: the choice not to use a rigorous editing procedure (according to scholarly standards) in order to save time. Moreover, the editing procedure is not publicly documented, for instance, through an editorial declaration.

Scholarly Editing Projects

Mass digitization projects are unlike digital scholarly editing projects on virtually every point. (A broad range of examples may be found on the ‘Projects’ page of the TEI website, tei-c.org/Activities/Projects.) First, the amount of data in editing projects is extremely small when compared to initiatives such as Google Books. Even the largest of academic digitization projects at most include a few hundreds or thousands of works (for example, the Oxford Text Archive, the Digital Library for Dutch Language and Literature (dbnl.org), or the Deutsches Textarchiv).

Due to the smaller amount of documents, using language technology to automatically extract metadata will also be less successful. Applying a metric for computing characteristic words such as TF-IDF depends substantially on the amount of words available; otherwise the results may easily be skewed.

Scholarly projects have far different characteristics that set them apart. The textual data, although often also generated by OCR, is almost always checked and double-checked by hand, so errors are rare. The same holds for metadata, which is manually entered, often in great detail. Finally, the editing procedure is carefully documented. Standards such as TEI reserve specific portions of the document encoding format for declaring editorial procedures, and encourage editors to use them. (TEI Consortium 2007, 2.3)

Crossing the Divide

So far, my story has followed the established dividing line between these two very different kinds of projects: one based on small-scale but careful and rigorous editorship, the other on bringing together very large quantities of swiftly gathered ‘junk text.’

However, this divide may soon become outdated. The texts of Google Books and the Internet Archive may be inaccurate sometimes, but the quality of both scanning equipment and OCR software has been steadily improving. Metadata is sometimes poor, but is becoming richer and more fine-grained as the collections expand.

As I see it, the question is not when to cross the divide between scholarly publication projects and mass digitization projects, but how. In that scenario it is crucial to investigate what opportunities and threats may present themselves.

A Hybrid Model for Interchange

I propose a hybrid model of possible interactions between TEI projects and mass digitization projects. The model is constituted of three distinct components.

At the core of the model is a carefully curated but relatively small collection of reliable textual data, for instance generated by a scholarly editing project, which will probably be strongly interrelated (e.g., from the same author, genre, and/or period). The documents feature extensive metadata and are edited according to an accepted standard such as TEI-XML.

Next is the cloud, not a single resource but a conglomerate of large and publicly available repositories of text or data, which may preferably be queried in an automatic way (more on this below). Examples of text repositories include large-scale digitization projects such as discussed above, but possibly also other scholarly editing projects that are thematically or chronologically associated with the Core texts. Examples of data repositories are lexical reference works (WordNet, Wordnik, WordHoard), online encyclopedias (Wikipedia and its machine-readable variant DBpedia.org), scholarly databases (JSTOR), bibliographical catalogues (WorldCat), and subject headings (Library of Congress Subject Headings).

Connecting the core with the cloud is the bridging component of the model, namely the interface. The interface includes a website or a number of web services that enable researchers to visualize and annotate the contents of the Core. It provides metadata that complement the markup produced during the scholarly editing process of the Core documents. This metadata is generated by constructing queries based on either the raw textual data or the layers of markup, which may first be pre-processed by means of automatic text analysis. The queries are then sent off to the Cloud, and the returned information is added as metadata to the document and immediately visualized to the user.

In other words, the crux of the model is to query, analyze and re-format the data in the Cloud, in order to stream relevant information to a specialized audience of researchers studying a particular set of Core texts.

The Programming Scholar

One element in the model sketched above is vital to its success. First, the repositories that constitute the Cloud must allow programmatic access to their resources by means of an API or Application Programming Interface. Put briefly, an API allows web programmers to query a given web resource in the background, serving up its results in a different form to the users of their own web applications.

A travelling website, for instance, may query both a weather service and an airline fares service when a user searches for information on a given location, in order to deliver all of these results (together with information from its own resources) on a single page. This practice is often known as a ‘mash-up.’

Some of the text and data repositories mentioned above already provide programmatic access to their resources through an API. It may be noted that, at the time of writing, there is a well-documented public API available for Google Books, but not for the Internet Text Archive.

If no API is available for a given resource, and it is unlikely it will be made available in the future, one may resort to querying the website through a general search API (provided by search engines such as Google or Bing). For instance, in order to present search results automatically collected from Wikipedia or another resource next to the Core documents of your editing project. More examples will be discussed in detail in the ‘Proofs of Concept’ section below.

If we would take the model one step further, it is possible to envision a future kind of digital editing projects, whose essential feature would be the combination of manually added markup and metadata with automatically extracted metadata. Imagine the wealth of information to describe your documents that would be available in a system that is both rigorously edited and automatically analyzed using:

  • topic detection
  • genre detection
  • keyword extraction
  • named entity recognition
  • automatic translation of foreign terms
  • automatic summarization
  • automatic authorship attribution of disputed documents

All of these technologies (combined with the initially added markup) could lead to a very extensive method for profiling documents. If all of the metadata describing a document are fed into a document profile, these may be compared internally (e.g., for searching related texts in the Core) or used for external purposes (e.g., for use in a larger text profiling project, such as authorship attribution research). Document profiles could be employed to draw up a typical profile of the texts written by a certain author, or written in a certain genre.

Proofs of Concept

I would like to illustrate this model by means of two recent publishing projects in which I was involved. Both were small-scale scholarly editing projects collecting documents on Flemish theatre. The texts stem from various periods of the twentieth century. Access to both projects is provided through the website corpustoneelkritiek.org. (Although the source texts are mainly in Dutch, the interface is entirely in English.)

Corpus Toneelkritiek Interbellum (Flemish Theatre Reviews 1919-1939) is a collection of short documents on theatre, chiefly reviews and essays, from the inter-war period. It includes 686 documents which were marked up according to the TEI P5 Guidelines. Mainly semantic elements are used, such as , and , to indicate where a personal name, an organization’s name, or a title occurs. (CTI 2009)

For instance, in the following passage the <persName> element was used to encode different spellings of the Flemish director J.O. de Gruyter’s name. Using the @key attribute, these variant spellings all refer to a single entity (here abbreviated JODG) in a database of names. Furtherwise, all occurrences of personal names are typed using the @type attribute to specify the person’s main profession (such as writer, director, poet, composer, or philosopher).

De herinnering aan dr J.O. de Gruyter, die zich ook in Nederlandsche tooneelkringen grooten naam verwierf, blijft levendig. Telkens verschijnen er artikelen aan hem en zijn levenswerk gewijd. In dit najaar zal het licht zien, samengesteld door Emmanuel de Bom, een “De Gruyter-Boek”. Dit bevat bijdragen van verschillende specialiteiten en herinneringen van De Gruyters vrienden.

The corpus is made available online through the project’s website corpustoneelkritiek.org/cti. For this purpose, all XML source files were converted to static HTML using the XML parsing modules from the Python programming language. Additionally, a simple search engine was added, again written in Python and based on the search engine described in Toby Segaran’s book Programming Collective Intelligence (Segaran 2007).

However, the website does not merely present the text of the digitized documents. The manually added metadata, encoded in the markup, is processed during the generation of the HTML files in order to enrich the visualization of the documents with meta-information. Most importantly, document metadata was collected in document profiles, which were then mutually compared to determine internal links. This was again done by parsing and analyzing the content of the XML files using Python.

Document Profiles

A document profile contains the author of the text, its year of publication, its month of publication, and the full list of personal names, organization names, and titles mentioned. Additionally, terms that are supposed to characterize the document are added using automatic text analysis. After reading each file’s textual contents, highly frequent function words are removed by comparing the text with a dictionary of Dutch stop words. Any single word (or ‘unigram’) that occurs four times or more in the remaining text is considered to be ‘characteristic’ for this document and added to the profile. Any combination of two words (or ‘bigram’) that occurs two times or more is also added.

I should immediately add that this a quite rudimentary method for computing document keywords. More sophisticated methods, such as TF-IDF, will be discussed in the second case study. However, the method was sufficiently effective for the purpose of this corpus, since most texts are of equal length (on average 1120 words). Hence, it was possible to establish an absolute threshold for both unigram and bigram frequency (resp. four and two occurrences).

With the document profiles for each corpus text at hand, documents may be compared in order to detect relations. In this case, two documents were considered to be ‘related’ if more than half of their document profiles overlapped. This overlap was further normalized by taking into account the length of the document profiles, so that exceptionally long texts would not be prioritized because of their higher number of keywords.

On the project website, ten possibly related texts are suggested to the user who is consulting a corpus document. The interface specifies on which overlapping elements the suggestion is based, e.g., if a text is considered to be related because there is a thematic relation, or because it was written by the same author, or in the same period. Furthermore, a sidebar next to the text lists all personal names, organizations (e.g., theatre companies), locations (e.g., playhouses), and titles that have been marked up in that specific document.

Advanced Search

So far, the extra information that the web interface added to the visualization of TEI-encoded documents was strictly derived from the corpus itself. Next, I will discuss an example of how this may be connected to external text and data repositories.

All names and organizations presented in the sidebar of documents on the CTI website are clickable. These hyperlinks point to an ‘advanced search’ function that queries multiple resources for quotations mentioning that specific name. When an item from the sidebar is clicked, an ‘advanced search’ launches that collects quotations on that item from:

  1. the corpus itself
  2. Google Books
  3. the Dutch version of Wikipedia (nl.wikipedia.org)
  4. the Digital Library of Dutch Language and Literature (dbnl.org, the largest online collection of scholarly editions of Dutch books)

While the rest of the website consists of static HTML files derived from the XML source files, the advanced search functionality depends on Python web programming. Queries generated from the list of names in the sidebar are sent to the Google Book Search API, to collect titles and extracts from books mentioning the query, and to the Yahoo Web Search API, to collect pages from the Dutch Wikipedia and from DBNL. The APIs generally return their results as RSS feeds, i.e., using the XML protocol, which can be conveniently parsed in Python using XML parsing modules, or Mark Pilgrim’s Universal Feed Parser (feedparser.org).

The final result is a remarkably efficient ‘research engine.’ During an advanced search query, information is harvested from so many sources that the page starts to function as an ‘improvised encyclopedia.’ Because both the page’s information as well as the presumed user’s background and interests are quite narrowly circumscribed, search results are almost always relevant. The suggestions of related texts, on the other hand, serve as an easy tool to browse the collection (instead of the inherently meaningless search box that is offered on the landing page of many web resources).

Named Entity Detection

A different plan was followed for the construction of Pieter T’Jonck Theatre and Dance Reviews, a research website collecting 1455 reviews by Belgian critic Pieter T’Jonck (from 1985 up to the present). For this project, the collection of documents was too large for the editorial team to mark-up manually. Only basic metadata were added manually by the editors, such as the publication platform, date of publication, and the performance under review.

In order to enrich the available metadata for each document, the reviews were analyzed by an automatic linguistic parser to extract all named entities. For this purpose, a Dutch-language version of MBSP (Memory-Based Shallow Parser) was used. MBSP was developed at the universities of Tilburg and Antwerp (clips.ua.ac.be/pages/MBSP) and automatically applies various levels of linguistic annotation.

On the project website, the named entities extracted by MBSP are presented to users consulting a document, together with the manually added metadata such as the date of publication and the performance reviewed in that document. Again, these metadata serve as the basis for new queries using an advanced search functionality analogous to that described above.

Closing Remarks

The case studies discussed in the previous section may hopefully throw sufficient light on the possibilities of the hybrid model that I have introduced in this essay. It must be added that the case studies only touch the tip of the iceberg. For instance, although external resources are queried to enrich the research experience of the user, these queries do not contain any ‘semantic’ information.

For instance, the advanced search functionality does not query Wikipedia for persons corresponding to a certain name, but merely for Wikipedia pages mentioning that name. More sophisticated systems that use semantic web technology (such as the Linked Data initative) go far beyond these restrictions and provide much more controlled searching. On the other hand, semantic descriptors are only available for a handful of resources. Using publicly available APIs and repositories (such as Wikipedia) has the advantage of searching the widest possible array of sources.

I would like to conclude with a statement by Daniel J. Cohen from a discussion on ‘the promise of digital history,’ published in the September 2008 issue of the Journal of American History:

Digital history and the abundance it tries to address make many historical arguments seem anecdotal rather than comprehensive. Hypotheses based on a limited number of examples, as many dissertations and books still are, seem flimsier when you can scan millions of books at Google to find counterexamples. (Cohen et al. 2008)

Although I fully agree with Cohen that digital repositories open up new vistas of research which require new methodological training for many humanities scholars, the hidden question that remains unanswered is how we will cope with the challenges of digital abundance. It is not only more and more vast repositories that will be needed, but also the adequate tools and interfaces to query, browse, and interconnect the information they hold.

List of Works Cited

Cohen, Daniel J., Michael Frisch, Patrick Gallagher, Steven Mintz, Kristen Sword, Amy Murrell Taylor, William G. Thomas III and William J. Turkel. 2008. “Interchange: The Promise of Digital History.” Journal of American History 95.2. http://www.journalofamericanhistory.org/issues/952/interchange/index.html

CTI. 2009. Corpus Toneelkritiek Interbellum. http://www.corpustoneelkritiek.org/cti

Dyson, George. 2005. “Turing’s Cathedral: A visit to Google on the occasion of the 60th anniversary of John von Neumann’s proposal for a digital computer.” Edge: The Third Culture. October 24. http://www.edge.org/3rd_culture/dyson05/dyson05_index.html

Google Books. 2010. “History of Google Books.” Accessed August 25. http://books.google.com/intl/en/googlebooks/history.html

Nunberg, Geoff. 2009. “Google Books: A Metadata Train Wreck.” Language Log. August 29. http://languagelog.ldc.upenn.edu/nll/?p=1701.

Segaran, Toby. 2007. Programming Collective Intelligence: Building Smart Web 2.0 Applications. Sebastopol, CA: O’Reilly.

TEI Consortium. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Last updated July 6, 2010.

Wikipedia. 2010. “Google Books.” Accessed August 25. http://en.wikipedia.org/wiki/Google_Books

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: