Thomas Crombez

A Naive Bayesian in the auction house, pt. 1

In Digital History Hacks on March 15, 2010 at 10:52 pm

The title of this post refers to a fascinating series from the summer of 2008 on William Turkel’s blog Digital History Hacks. He used the digitized archive of court records from the Old Bailey (London’s central criminal court) in order to demonstrate the new avenues of research made possible through digital and computational tools. Reading this must have been one of those moments of insight which truly tempted me into the territory of digital history.

More particularly, Turkel showed how you may train an automatic classifying program (known as a “Naive Bayes Classifierfor mathematical reasons) to identify the type of crime based solely on the transcript of the court session.

A trial at the Old Bailey in London as drawn b...

Image via Wikipedia

In the conclusion of his series, Turkel sums up various reasons why machine learning can prove a valuable tool for historical research. The real eye-opener for me was when he demonstrated the advantages of scrutinizing the errors turned up by a not-so-well performing classifier. In the example of the Old Bailey, Turkel trained a classifier to identify cases of assault amongst nineteenth-century court records. The accuracy of the classifier is not particularly good, but neither is this the prime interest. It gets interesting when you study the cases falsely identified by the classifier as belonging to the assault category. One example is particularly revealing for the kind of “fuzzy” searching that machine learning may lead to:

In other words, about 96% of the learner’s false positive “errors” in this case were other kinds of assault. What of the trials classified as “miscellaneous – other”? One was this trial, where 44 year old William Blackburn was found guilty of “unlawfully and maliciously administering to Hannah Mary Turner 6 drachms of tincture of cantharides, with intent to excite, &c.” I understand that this case probably doesn’t fit the definition of assault used by either Blackburn’s contemporaries or by the person who coded the file. Nevertheless, it is not completely unrelated to the idea of an assault, and is exactly the kind of source that a historian could use to shed light on gender relations, sexuality, or other topics.

In the concluding post, Turkel sums up the advantages of automatic classifiers for research into large digitized collections. Learning from false positives proves particularly important, “giving you a way of finding interesting things just beyond the boundaries of your categories.”

For a long time I had wanted to move beyond digitizing and making available online the corpora that interest me (such as Corpus Toneelkritiek Interbellum), and to start exploring computational tools such as machine learning. Recently, just such an occasion turned up when PhD student Dries Lyna asked me for some tools to analyze the ca. 5000 auction lots he digitized from catalogues of auctions in eighteenth-century Antwerp and Brussels. More particularly, Dries is interested in the evolving literary style of these catalogues. His starting point is the distinct impression amongst art historians of this period that

during the second half of the eighteenth century paintings’ descriptions in catalogues in general became longer, more detailed and increasingly precise (…). Not surprisingly, longer accounts were accompanied by a growing richness of vocabulary and often more florid use of language.

Using the digitized records from more than forty catalogues, we were able to show how the art vocabulary actually evolved during the second half of the eighteenth century. Not only did the vocabulary richness greatly increase, but discourse also became much more specific. In the 1840s the generic term tableau (painting) is used almost indiscriminately, while the next decade sees a decrease in the use of this word and an increase in more specialized terms such as paysage (landscape) and portrait.

We were both quite excited about these results. But Dries also wanted to know if there was a correlation between the style of the descriptions and the price the paintings eventually fetched during the auction. In other words, was the evolving language also a performative tool in the rapidly maturing late-modern marketplace for artworks?

More on that soon……


Related articles by Zemanta


Reblog this post [with Zemanta]
  1. Thomas, I’m happy to send you all of the source code for that project if you’d like to reuse some of it for the auction records. Just e-mail me. I am also continuing to work with the Old Bailey folks and some other colleagues on machine learning and text mining. You can find our new stuff at http://criminalintent.org Best, Bill

Leave a reply to William J Turkel Cancel reply