<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Doing Digital History</title>
	<atom:link href="http://doingdigitalhistory.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://doingdigitalhistory.wordpress.com</link>
	<description></description>
	<lastBuildDate>Fri, 04 Mar 2011 20:06:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='doingdigitalhistory.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Doing Digital History</title>
		<link>http://doingdigitalhistory.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://doingdigitalhistory.wordpress.com/osd.xml" title="Doing Digital History" />
	<atom:link rel='hub' href='http://doingdigitalhistory.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Bridging the Gap between Scholarly Editing and Mass Digitization</title>
		<link>http://doingdigitalhistory.wordpress.com/2011/03/04/bridging-the-gap-between-scholarly-editing-and-mass-digitization/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2011/03/04/bridging-the-gap-between-scholarly-editing-and-mass-digitization/#comments</comments>
		<pubDate>Fri, 04 Mar 2011 20:04:41 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=195</guid>
		<description><![CDATA[A hybrid model for online scholarly publishing projects (This is the edited version of a paper presented at the TEI Conference and Members&#8217; Meeting in Ann Arbor, November 2009. Slides of the original presentation may be found here) In this essay, I would like to highlight the differences between mass digitization efforts such as Google [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=195&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>A hybrid model for online scholarly publishing projects</strong><br />
<em>(This is the edited version of a paper presented at the TEI Conference and Members&#8217; Meeting in Ann Arbor, November 2009. Slides of the original presentation may be found <a href="http://www.tei-c.org/Vault/MembersMeetings/2009/files/Crombez_Between_the_Folds_TEI_2009.pdf">here</a>)</em></p>
<p>In this essay, I would like to highlight the differences between mass digitization efforts such as Google Books, and small-scale scholarly editing projects, most of which use one of a number of standards developed by the academic text-editing community (for example, the Text Encoding Initiative or TEI).<span id="more-195"></span></p>
<p>Since the inception of large-scale book digitization projects from 2004 onwards, it has become clear that such initiatives start from a highly different point of view compared to academic digitization efforts. Quantity and commercial opportunity, rather than editorial accuracy, seem to be the primary motives. Still, both are partly engaged in the same project – the digitization of humanity’s printed past. Should they remain separate kinds of efforts, one commercial and the other academic?</p>
<p>My aim is not to theorize about the wider implications of both kinds of digital publishing, but to demonstrate how they may co-exist. Such partnerships, I will argue, consist in maximizing the links that can be constructed between both.</p>
<p>More particularly, it will be made clear that <em>automatic text analysis </em>could play a crucial role in the process of bridging the gap between scholarly editing projects and mass digitization undertakings. This would require the integration of three distinct areas of knowledge:</p>
<ul>
<li><strong>scholarly editing</strong>;</li>
<li><strong>mass digitization projects</strong>, meaning million-volume online collections of texts which are <em>not </em>scholarly edited (examples include the Gutenberg Project, the Million Books Project, Google Books, the Internet Text Archive);</li>
<li><strong>automatic text analysis</strong>, i.e., instruments from language technology, also known as Natural Language Processing or Computational Linguistics, that may help in linking these two very different kind of document collections.</li>
</ul>
<p>My argument will close with an elaborate discussion of two sample projects that may serve as ‘proofs of concept.’</p>
<h2>Mass Digitization Projects</h2>
<p>The most well-known example of mass digitization is probably Google Books, which was launched in 2004. Of the millions of books digitized by the company so far (in cooperation with a growing number of academic and public libraries in the US and in Europe) only a limited amount are in the public domain, and hence available for so-called ‘full view.’ The viewing of books that are still in print is restricted to a limited number of pages, or even short ‘snippets’ for works of which the owner has not been identified yet. (Google Books 2010; Wikipedia 2010)</p>
<p>The distinguishing characteristic of projects such as Google Books is their massive size, spanning millions of volumes. However, when it comes to collections of digital documents, ‘size’ is not merely a linear measure of information. A quote from a Google engineer from 2005, recorded by George Dyson, succinctly sums up the aims of the search company:</p>
<blockquote><p>We are not scanning all those books to be read by people. We are scanning them to be read by an AI. (Dyson 2005)<em> </em></p></blockquote>
<p>What does ‘AI’ or ‘metadata extraction’ exactly mean for the end-user of a massive digital library? The most illustrating example is provided by the current interface of Google Books. Not only does it present the evident metadata concerning a book (author, publisher, year of publication, number of pages), but also a wealth of additional information:</p>
<ul>
<li>Chapter endings and beginnings are automatically detected, in order to create a table of contents that links to the correct pages.</li>
<li>A list of references to this book from various web pages is compiled automatically, and also a list of references to this book from other books in Google Books.</li>
<li>‘Popular passages’ are extracted from the book, i.e., passages that are frequently quoted in other books.</li>
<li>A list of ‘common terms and phrases’ for the book is provided, namely, keywords and names that the artificial intelligence engine believes are characteristic for this work. Just as corresponding features on websites such as Amazon.com (using ‘SIPs’ or Statistically Improbable Phrases to characterize a book), the list of characteristic terms seems to be composed using a bibliometric measure such as TF-IDF, or Term Frequency multiplied by Inverse Document Frequency. This metric will result in a selection of words or combinations of words that are relatively rare in the whole of the collection, but that occur significantly more often in this particular book.</li>
</ul>
<p>All of this information is extracted automatically from the book’s textual contents. The various applications of AI or machine learning that are used (for example, the TF-IDF metric) may be grouped under the term of <em>automatic text analysis.</em></p>
<p>These ‘extras,’ I believe, are already changing the way humanities scholars do research today. Semantic markup is added to the collection which is generated <em>by means of the collection itself</em>. This might be compared to how a catalogue of a traditional print library, or a particular ordering of the shelves (e.g., according to discipline or subject), adds information to the library’s collection.</p>
<p>Such automatically extracted metadata constitute the crucial innovation of mass digitization projects. Compared to the ‘markup’ or organizational structure that set apart a print library from a pile of books, the main differences introduced by this sort of markup may be summed up as follows. Automatically generated metadata carry much more information about the contents of the works, compared to a traditional catalogue or subject index. Moreover, the metadata can be easily updated as new items are added. And finally the number of interconnections between books (all of which are immediately available to the user) vastly increases.</p>
<p>The massive ‘size’ of a digital library, then, is not merely to be conceived as a linear measure of information – the depth and scope of the available metadata is actually increased exponentially.</p>
<p>Still, massive text collections also feature a number of important drawbacks. The resistance of academics and other readers to Google Books grew rapidly after the launch. One of the reasons was that the textual content of the books was often inaccurate. Such errors are mainly due to shortcomings in either the available materials (books that are badly printed, or stained) or in the OCR software (Optical Character Recognition) that is employed to ‘read’ the books.</p>
<p>Deficient text seriously endangers the potential of mass digitization projects. It leads to search errors (both in the sense of words that are not found while they are in the text, and words found while they are not in the text) and to erroneous classifications or incorrect metadata.</p>
<p>In 2009, following the Google Books Settlement Conference at UC Berkeley, Geoff Nunberg wrote a blog post where he described Google Books as a ‘metadata train wreck,’ precisely because its systems try to deduce so many metadata about the books in its collection through a purely algorithmic procedure. Amongst the many examples he cites, the following might illustrate the core of the problem:</p>
<blockquote><p>More mysterious is the entry for a book called <em>The Mosaic Navigator: The essential guide to the Internet Interface</em>, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones’ having been the translator of Freud’s <em>Moses and Monotheism</em>, which must have somehow triggered the other sense of the word <em>mosaic</em>, though the details of the process leave me baffled. (Nunberg 2009)</p></blockquote>
<p>Because language technology fundamentally works by generalizing from a limited amount of textual data, errors such as these will be endemic, certainly in the first phase of the project. Especially when speed and size is valued above data quality and consistency, which is precisely what sets apart mass digitization from scholarly editing. Errors in the text of books and their metadata have technological causes, but derive from a principled starting point: the choice not to use a rigorous editing procedure (according to scholarly standards) in order to save time. Moreover, the editing procedure is not publicly documented, for instance, through an editorial declaration.</p>
<h2>Scholarly Editing Projects</h2>
<p>Mass digitization projects are unlike digital scholarly editing projects on virtually every point. (A broad range of examples may be found on the ‘Projects’ page of the TEI website, <a href="http://www.tei-c.org/Activities/Projects">tei-c.org/Activities/Projects</a>.) First, the amount of data in editing projects is extremely small when compared to initiatives such as Google Books. Even the largest of academic digitization projects at most include a few hundreds or thousands of works (for example, the <a href="http://ota.ahds.ac.uk/">Oxford Text Archive</a>, the Digital Library for Dutch Language and Literature (<a href="http://www.dbnl.org">dbnl.org</a>), or the <a href="http://www.deutschestextarchiv.de">Deutsches Textarchiv</a>).</p>
<p>Due to the smaller amount of documents, using language technology to automatically extract metadata will also be less successful. Applying a metric for computing characteristic words such as TF-IDF depends substantially on the amount of words available; otherwise the results may easily be skewed.</p>
<p>Scholarly projects have far different characteristics that set them apart. The textual data, although often also generated by OCR, is almost always checked and double-checked by hand, so errors are rare. The same holds for metadata, which is manually entered, often in great detail. Finally, the editing procedure is carefully documented. Standards such as TEI reserve specific portions of the document encoding format for declaring editorial procedures, and encourage editors to use them. (TEI Consortium 2007, 2.3)</p>
<h2>Crossing the Divide</h2>
<p>So far, my story has followed the established dividing line between these two very different kinds of projects: one based on small-scale but careful and rigorous editorship, the other on bringing together very large quantities of swiftly gathered ‘junk text.’</p>
<p>However, this divide may soon become outdated. The texts of Google Books and the Internet Archive may be inaccurate sometimes, but the quality of both scanning equipment and OCR software has been steadily improving. Metadata is sometimes poor, but is becoming richer and more fine-grained as the collections expand.</p>
<p>As I see it, the question is not <em>when </em>to cross the divide between scholarly publication projects and mass digitization projects, but <em>how.</em> In that scenario it is crucial to investigate what opportunities and threats may present themselves.</p>
<h2>A Hybrid Model for Interchange</h2>
<p>I propose a hybrid model of possible interactions between TEI projects and mass digitization projects. The model is constituted of three distinct components.</p>
<p>At the <em>core </em>of the model is a carefully curated but relatively small collection of reliable textual data, for instance generated by a scholarly editing project, which will probably be strongly interrelated (e.g., from the same author, genre, and/or period). The documents feature extensive metadata and are edited according to an accepted standard such as TEI-XML.</p>
<p>Next is the <em>cloud</em>, not a single resource but a conglomerate of large and publicly available repositories of text or data, which may preferably be queried in an automatic way (more on this below). Examples of text repositories include large-scale digitization projects such as discussed above, but possibly also other scholarly editing projects that are thematically or chronologically associated with the Core texts. Examples of data repositories are lexical reference works (WordNet, Wordnik, WordHoard), online encyclopedias (Wikipedia and its machine-readable variant DBpedia.org), scholarly databases (JSTOR), bibliographical catalogues (WorldCat), and subject headings (Library of Congress Subject Headings).</p>
<p>Connecting the core with the cloud is the bridging component of the model, namely the <em>interface</em>. The interface includes a website or a number of web services that enable researchers to visualize and annotate the contents of the Core. It provides metadata that complement the markup produced during the scholarly editing process of the Core documents. This metadata is generated by constructing queries based on either the raw textual data or the layers of markup, which may first be pre-processed by means of automatic text analysis. The queries are then sent off to the Cloud, and the returned information is added as metadata to the document and immediately visualized to the user.</p>
<p>In other words, the crux of the model is to query, analyze and re-format the data in the Cloud, in order to stream relevant information to a specialized audience of researchers studying a particular set of Core texts.</p>
<h2>The Programming Scholar</h2>
<p>One element in the model sketched above is vital to its success. First, the repositories that constitute the Cloud must allow programmatic access to their resources by means of an API or Application Programming Interface. Put briefly, an API allows web programmers to query a given web resource in the background, serving up its results in a different form to the users of their own web applications.</p>
<p>A travelling website, for instance, may query both a weather service and an airline fares service when a user searches for information on a given location, in order to deliver all of these results (together with information from its own resources) on a single page. This practice is often known as a ‘mash-up.’</p>
<p>Some of the text and data repositories mentioned above already provide programmatic access to their resources through an API. It may be noted that, at the time of writing, there is a well-documented public API available for Google Books, but not for the Internet Text Archive.</p>
<p>If no API is available for a given resource, and it is unlikely it will be made available in the future, one may resort to querying the website through a general search API (provided by search engines such as Google or Bing). For instance, in order to present search results automatically collected from Wikipedia or another resource next to the Core documents of your editing project. More examples will be discussed in detail in the ‘Proofs of Concept’ section below.</p>
<p>If we would take the model one step further, it is possible to envision a future kind of digital editing projects, whose essential feature would be the combination of manually added markup and metadata with automatically extracted metadata. Imagine the wealth of information to describe your documents that would be available in a system that is both rigorously edited <em>and </em>automatically analyzed using:</p>
<ul>
<li>topic detection</li>
<li>genre detection</li>
<li>keyword extraction</li>
<li>named entity recognition</li>
<li>automatic translation of foreign terms</li>
<li>automatic summarization</li>
<li>automatic authorship attribution of disputed documents</li>
</ul>
<p>All of these technologies (combined with the initially added markup) could lead to a very extensive method for profiling documents. If all of the metadata describing a document are fed into a document profile, these may be compared internally (e.g., for searching related texts in the Core) or used for external purposes (e.g., for use in a larger text profiling project, such as authorship attribution research).<em> </em>Document profiles could be employed to draw up a typical profile of the texts written by a certain author, or written in a certain genre.</p>
<h2>Proofs of Concept</h2>
<p>I would like to illustrate this model by means of two recent publishing projects in which I was involved. Both were small-scale scholarly editing projects collecting documents on Flemish theatre. The texts stem from various periods of the twentieth century. Access to both projects is provided through the website <a href="http://www.corpustoneelkritiek.org">corpustoneelkritiek.org</a>. (Although the source texts are mainly in Dutch, the interface is entirely in English.)</p>
<p><em>Corpus Toneelkritiek Interbellum </em>(Flemish Theatre Reviews 1919-1939) is a collection of short documents on theatre, chiefly reviews and essays, from the inter-war period. It includes 686 documents which were marked up according to the TEI P5 Guidelines. Mainly semantic elements are used, such as ,  and , to indicate where a personal name, an organization’s name, or a title occurs. (CTI 2009)</p>
<p>For instance, in the following passage the &lt;persName&gt; element was used to encode different spellings of the Flemish director J.O. de Gruyter’s name. Using the @key attribute, these variant spellings all refer to a single entity (here abbreviated JODG) in a database of names. Furtherwise, all occurrences of personal names are typed using the @type attribute to specify the person’s main profession (such as writer, director, poet, composer, or philosopher).</p>
<blockquote><p>De herinnering aan dr J.O. de Gruyter, die zich ook in Nederlandsche tooneelkringen grooten naam verwierf, blijft levendig. Telkens verschijnen er artikelen aan hem en zijn levenswerk gewijd. In dit najaar zal het licht zien, samengesteld door Emmanuel de Bom, een &#8220;De Gruyter-Boek&#8221;. Dit bevat bijdragen van verschillende specialiteiten en herinneringen van De Gruyters vrienden.</p></blockquote>
<p>The corpus is made available online through the project’s website <a href="http://www.corpustoneelkritiek.org/cti">corpustoneelkritiek.org/cti</a>. For this purpose, all XML source files were converted to static HTML using the XML parsing modules from the Python programming language. Additionally, a simple search engine was added, again written in Python and based on the search engine described in Toby Segaran’s book <em>Programming Collective Intelligence</em> (Segaran 2007).</p>
<p>However, the website does not merely present the text of the digitized documents. The manually added metadata, encoded in the markup, is processed during the generation of the HTML files in order to enrich the visualization of the documents with meta-information. Most importantly, document metadata was collected in <em>document profiles</em>, which were then mutually compared to determine internal links. This was again done by parsing and analyzing the content of the XML files using Python.</p>
<h2>Document Profiles</h2>
<p>A document profile contains the author of the text, its year of publication, its month of publication, and the full list of personal names, organization names, and titles mentioned. Additionally, terms that are supposed to characterize the document are added using automatic text analysis. After reading each file’s textual contents, highly frequent function words are removed by comparing the text with a dictionary of Dutch stop words. Any single word (or ‘unigram’) that occurs four times or more in the remaining text is considered to be ‘characteristic’ for this document and added to the profile. Any combination of two words (or ‘bigram’) that occurs two times or more is also added.</p>
<p>I should immediately add that this a quite rudimentary method for computing document keywords. More sophisticated methods, such as TF-IDF, will be discussed in the second case study. However, the method was sufficiently effective for the purpose of this corpus, since most texts are of equal length (on average 1120 words). Hence, it was possible to establish an absolute threshold for both unigram and bigram frequency (resp. four and two occurrences).</p>
<p>With the document profiles for each corpus text at hand, documents may be compared in order to detect relations. In this case, two documents were considered to be ‘related’ if more than half of their document profiles overlapped. This overlap was further normalized by taking into account the length of the document profiles, so that exceptionally long texts would not be prioritized because of their higher number of keywords.</p>
<p>On the project website, ten possibly related texts are suggested to the user who is consulting a corpus document. The interface specifies on which overlapping elements the suggestion is based, e.g., if a text is considered to be related because there is a thematic relation, or because it was written by the same author, or in the same period. Furthermore, a sidebar next to the text lists all personal names, organizations (e.g., theatre companies), locations (e.g., playhouses), and titles that have been marked up in that specific document.</p>
<h2>Advanced Search</h2>
<p>So far, the extra information that the web interface added to the visualization of TEI-encoded documents was strictly derived from the corpus itself. Next, I will discuss an example of how this may be connected to external text and data repositories.</p>
<p>All names and organizations presented in the sidebar of documents on the CTI website are clickable. These hyperlinks point to an ‘advanced search’ function that queries multiple resources for quotations mentioning that specific name. When an item from the sidebar is clicked, an ‘advanced search’ launches that collects quotations on that item from:</p>
<ol>
<li>the corpus itself</li>
<li>Google Books</li>
<li>the Dutch version of Wikipedia (<a href="http://nl.wikipedia.org">nl.wikipedia.org</a>)</li>
<li>the Digital Library of Dutch Language and Literature (<a href="http://www.dbnl.org">dbnl.org</a>, the largest online collection of scholarly editions of Dutch books)</li>
</ol>
<p>While the rest of the website consists of static HTML files derived from the XML source files, the advanced search functionality depends on Python web programming. Queries generated from the list of names in the sidebar are sent to the Google Book Search API, to collect titles and extracts from books mentioning the query, and to the Yahoo Web Search API, to collect pages from the Dutch Wikipedia and from DBNL. The APIs generally return their results as RSS feeds, i.e., using the XML protocol, which can be conveniently parsed in Python using XML parsing modules, or Mark Pilgrim’s Universal Feed Parser (<a href="http://www.feedparser.org">feedparser.org</a>).</p>
<p>The final result is a remarkably efficient ‘research engine.’ During an advanced search query, information is harvested from so many sources that the page starts to function as an ‘improvised encyclopedia.’ Because both the page’s information as well as the presumed user’s background and interests are quite narrowly circumscribed, search results are almost always relevant. The suggestions of related texts, on the other hand, serve as an easy tool to browse the collection (instead of the inherently meaningless search box that is offered on the landing page of many web resources).</p>
<h2>Named Entity Detection</h2>
<p>A different plan was followed for the construction of <em>Pieter T’Jonck Theatre and Dance Reviews</em>, a research website collecting 1455 reviews by Belgian critic Pieter T’Jonck (from 1985 up to the present). For this project, the collection of documents was too large for the editorial team to mark-up manually. Only basic metadata were added manually by the editors, such as the publication platform, date of publication, and the performance under review.</p>
<p>In order to enrich the available metadata for each document, the reviews were analyzed by an automatic linguistic parser to extract all named entities. For this purpose, a Dutch-language version of MBSP (Memory-Based Shallow Parser) was used. MBSP was developed at the universities of Tilburg and Antwerp (<a href="http://www.clips.ua.ac.be/pages/MBSP">clips.ua.ac.be/pages/MBSP</a>) and automatically applies various levels of linguistic annotation. <em></em></p>
<p>On the project website, the named entities extracted by MBSP are presented to users consulting a document, together with the manually added metadata such as the date of publication and the performance reviewed in that document. Again, these metadata serve as the basis for new queries using an advanced search functionality analogous to that described above.</p>
<h2>Closing Remarks</h2>
<p>The case studies discussed in the previous section may hopefully throw sufficient light on the possibilities of the hybrid model that I have introduced in this essay. It must be added that the case studies only touch the tip of the iceberg. For instance, although external resources are queried to enrich the research experience of the user, these queries do not contain any ‘semantic’ information.</p>
<p>For instance, the advanced search functionality does not query Wikipedia for <em>persons </em>corresponding to a certain name, but merely for Wikipedia pages mentioning that name. More sophisticated systems that use semantic web technology (such as the Linked Data initative) go far beyond these restrictions and provide much more controlled searching. On the other hand, semantic descriptors are only available for a handful of resources. Using publicly available APIs and repositories (such as Wikipedia) has the advantage of searching the widest possible array of sources.</p>
<p>I would like to conclude with a statement by Daniel J. Cohen from a discussion on ‘the promise of digital history,’ published in the September 2008 issue of the <em>Journal of American History</em>:</p>
<blockquote><p>Digital history and the abundance it tries to address make many historical arguments seem anecdotal rather than comprehensive. Hypotheses based on a limited number of examples, as many dissertations and books still are, seem flimsier when you can scan millions of books at Google to find counterexamples. (Cohen et al. 2008)</p></blockquote>
<p>Although I fully agree with Cohen that digital repositories open up new vistas of research which require new methodological training for many humanities scholars, the hidden question that remains unanswered is how we will cope with the challenges of digital abundance. It is not only more and more vast repositories that will be needed, but also the adequate tools and interfaces to query, browse, and interconnect the information they hold.</p>
<h2>List of Works Cited</h2>
<p>Cohen, Daniel J., Michael Frisch, Patrick Gallagher, Steven Mintz, Kristen Sword, Amy Murrell Taylor, William G. Thomas III and William J. Turkel. 2008. “Interchange: The Promise of Digital History.” <em>Journal of American History </em>95.2. <a href="http://www.journalofamericanhistory.org/issues/952/interchange/index.html">http://www.journalofamericanhistory.org/issues/952/interchange/index.html</a></p>
<p>CTI. 2009. <em>Corpus Toneelkritiek Interbellum. </em><a href="http://www.corpustoneelkritiek.org/cti">http://www.corpustoneelkritiek.org/cti</a></p>
<p>Dyson, George. 2005. “Turing’s Cathedral: A visit to Google on the occasion of the 60th anniversary of John von Neumann’s proposal for a digital computer.” <em>Edge: The Third Culture. </em>October 24. <a href="http://www.edge.org/3rd_culture/dyson05/dyson05_index.html">http://www.edge.org/3rd_culture/dyson05/dyson05_index.html</a></p>
<p>Google Books. 2010. “History of Google Books.” Accessed August 25. <a href="http://books.google.com/intl/en/googlebooks/history.html">http://books.google.com/intl/en/googlebooks/history.html</a></p>
<p>Nunberg, Geoff. 2009. “Google Books: A Metadata Train Wreck.” <em>Language Log</em>. August 29. <a href="http://languagelog.ldc.upenn.edu/nll/?p=1701">http://languagelog.ldc.upenn.edu/nll/?p=1701</a>.</p>
<p>Segaran, Toby. 2007. <em>Programming Collective Intelligence: Building Smart Web 2.0 Applications. </em>Sebastopol, CA: O’Reilly.</p>
<p>TEI Consortium. 2007. <em>TEI P5: Guidelines</em> <em>for Electronic Text Encoding and Interchange. </em>Last updated July 6, 2010.</p>
<p>Wikipedia. 2010. “Google Books.” Accessed August 25. <a href="http://en.wikipedia.org/wiki/Google_Books">http://en.wikipedia.org/wiki/Google_Books</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/195/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/195/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/195/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=195&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2011/03/04/bridging-the-gap-between-scholarly-editing-and-mass-digitization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>
	</item>
		<item>
		<title>A new era of plagiarism?</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/08/02/a-new-era-of-plagiarism/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/08/02/a-new-era-of-plagiarism/#comments</comments>
		<pubDate>Mon, 02 Aug 2010 20:15:48 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[The Teacher Is a Geek]]></category>
		<category><![CDATA[New York Times]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[plagiarism detection]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=159</guid>
		<description><![CDATA[There is a terrific article on the New York Times website on the changed attitudes of today&#8217;s students towards plagiarism. It develops a number of points I vaguely hinted at in my earlier post on &#8216;DIY plagiarism detection.&#8217; The most disquieting observation is that students simply no longer realize that they are plagiarizing another author&#8217;s [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=159&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>There is a terrific <a href="http://www.nytimes.com/2010/08/02/education/02cheat.html?_r=1&amp;hp=&amp;pagewanted=all">article on the <em>New York Times </em>website</a> on the changed attitudes of today&#8217;s students towards plagiarism. It develops a number of points I vaguely hinted at in my earlier post on &#8216;<a href="http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/">DIY plagiarism detection</a>.&#8217; The most disquieting observation is that students simply no longer <em>realize </em>that they are plagiarizing another author&#8217;s text, or, even more perplexing, that they do not grasp that text is never just nameless &#8216;copy&#8217; or <em>texte trouvé,</em> but always <em>someone</em>&#8216;s writing. (Even if that someone is an anonymous collective, as in the case of most Wikipedia articles.)<span id="more-159"></span>It is just one technology, the internet, that has made this enormous change of attitude possible in the space of a mere twenty years. In the 1990s, &#8216;plagiarizing&#8217; still meant to physically copy text from a (printed) source into your own text. So you were holding not just a <em>text </em>but a <em>document</em> &#8212; it was a book or a journal article or a videotape that had at least one and mostly numerous mentions of the author&#8217;s name on it. Merely consulting the document &#8212; i.e., holding the object in your hands &#8212; implied to become conscious of its authorship.</p>
<p>(And the other conditions of its production: publishing house, printing quality &#8230;)</p>
<p>Nowadays, a huge amount of &#8216;text&#8217; is indeed, as Teresa Fishman of Clemson University rightly remarks, &#8220;hanging out there.&#8221;</p>
<blockquote><p>Now we have a whole generation of students who’ve grown up with  information that just seems to be hanging out there in cyberspace and  doesn’t seem to have an author (&#8230;) It’s possible to believe this information is just out there for anyone to take.</p></blockquote>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/zemified_e.png?x-id=7bcd6b78-e83b-4107-8366-00290d48062e" alt="Enhanced by Zemanta" /></a>It&#8217;s an incredible change of mind, and it is happening as we write.</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/159/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/159/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/159/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=159&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/08/02/a-new-era-of-plagiarism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://img.zemanta.com/zemified_e.png?x-id=7bcd6b78-e83b-4107-8366-00290d48062e" medium="image">
			<media:title type="html">Enhanced by Zemanta</media:title>
		</media:content>
	</item>
		<item>
		<title>I&#8217;m on Day of Digital Humanities 2010!</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/03/18/im-on-day-of-digital-humanities-2010/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/03/18/im-on-day-of-digital-humanities-2010/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 08:41:57 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Reflections on Digital History]]></category>
		<category><![CDATA[Day of DH]]></category>
		<category><![CDATA[digital humanities]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=146</guid>
		<description><![CDATA[Today, it&#8217;s blogging time for digital humanists all over the globe. I&#8217;m happy to participate in &#8220;Day of Digital Humanities 2010,&#8221; an initiative hosted at the University of Alberta in Canada. According to the organizers, A Day in the Life of the Digital Humanities (Day of DH) is a community publication project that will bring [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=146&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Today, it&#8217;s blogging time for digital humanists all over the globe. I&#8217;m happy to participate in &#8220;<a href="http://ra.tapor.ualberta.ca/~dayofdh2010/">Day of Digital Humanities 2010</a>,&#8221; an initiative hosted at the University of Alberta in Canada.<span id="more-146"></span></p>
<p>According to the organizers,</p>
<blockquote><p>A Day in the Life of the Digital Humanities (Day of DH) is a  community publication project that will bring together <a title="List of Day of DH Participants" href="http://tapor.ualberta.ca/taporwiki/index.php/List_of_Day_of_DH_Participants">digital  humanists from around the world</a> to document what they do on one day,  March 18th. The goal of the project is to create a web site that weaves  together the journals of the participants into a picture that answers  the question, “Just what do computing humanists really do?” Participants  will document their day through photographs and commentary in a  blog-like journal. The collection of these journals with links, tags,  and comments will make up the final work which will be published online.</p></blockquote>
<p>I think it&#8217;s a wonderful project &#8212; not merely for promoting the digital humanities, but also to gain insight in the working methods &amp; daily routines of fellow scholars &amp; teachers. I expect to learn a lot today, and in the following week, when we&#8217;re expected to comment on the others&#8217; posts &amp; link to them.</p>
<p>So, if you&#8217;d like to eavesdrop on my personal activities, fly to <a href="http://ra.tapor.ualberta.ca/~dayofdh2010/thomascrombez/">http://ra.tapor.ualberta.ca/~dayofdh2010/thomascrombez/</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/146/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/146/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/146/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=146&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/03/18/im-on-day-of-digital-humanities-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>
	</item>
		<item>
		<title>A Naive Bayesian in the auction house, pt. 1</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/03/15/a-naive-bayesian-in-the-auction-house/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/03/15/a-naive-bayesian-in-the-auction-house/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 22:52:37 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Digital History Hacks]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Naive Bayes classifier]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=21</guid>
		<description><![CDATA[The title of this post refers to a fascinating series from the summer of 2008 on William Turkel&#8217;s blog Digital History Hacks. He used the digitized archive of court records from the Old Bailey (London&#8217;s central criminal court) in order to demonstrate the new avenues of research made possible through digital and computational tools. Reading [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=21&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The title of this post refers to a fascinating series from the summer of 2008 on William Turkel&#8217;s blog <a href="http://digitalhistoryhacks.blogspot.com/">Digital History Hacks</a>. He used the digitized archive of court records from the <a href="http://www.oldbaileyonline.org/">Old Bailey</a> (London&#8217;s central criminal court) in order to demonstrate the new avenues of research made possible through digital and computational tools. Reading this must have been one of those moments of insight which truly tempted me into the territory of <a class="zem_slink" title="Digital history" rel="wikipedia" href="http://en.wikipedia.org/wiki/Digital_history">digital history</a>. <span id="fullpost"> </span></p>
<p><span id="more-21"></span>More particularly, Turkel showed how you may train an automatic classifying program (known as a &#8220;<a class="zem_slink" title="Naive Bayes classifier" rel="wikipedia" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes Classifier</a>&#8221; <a href="http://digitalhistoryhacks.blogspot.com/2008/06/naive-bayesian-in-old-bailey-part-7.html">for mathematical reasons</a>) to identify the type of crime based solely on the transcript of the court session.</p>
<div class="zemanta-img zemanta-action-dragged" style="display:block;margin:1em;">
<div class="wp-caption alignright" style="width: 244px"><a href="http://commons.wikipedia.org/wiki/Image:Old_Bailey_Microcosm_edited.jpg"><img class=" " title="A trial at the Old Bailey in London as drawn b..." src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Old_Bailey_Microcosm_edited.jpg/300px-Old_Bailey_Microcosm_edited.jpg" alt="A trial at the Old Bailey in London as drawn b..." width="234" height="177" /></a><p class="wp-caption-text">Image via Wikipedia</p></div>
</div>
<p>In the conclusion of his series, Turkel sums up various reasons <a href="http://digitalhistoryhacks.blogspot.com/2008/07/naive-bayesian-in-old-bailey-part-14.html">why machine learning can prove a valuable tool for historical research</a>. The real eye-opener for me was when he demonstrated the advantages of scrutinizing the errors turned up by a not-so-well performing classifier. In the example of the Old Bailey, Turkel trained a classifier to identify cases of assault amongst nineteenth-century court records. The accuracy of the classifier is not particularly good, but neither is this the prime interest. It gets interesting when you study the cases falsely identified by the classifier as belonging to the assault category. One example is particularly revealing for the kind of &#8220;fuzzy&#8221; searching that machine learning may lead to:</p>
<blockquote><p><span id="fullpost">In other words, about 96% of the learner&#8217;s false positive &#8220;errors&#8221; in  this case were other kinds of assault.  What of the trials classified as  &#8220;miscellaneous &#8211; other&#8221;?  One was <a href="http://www.oldbaileyonline.org/browse.jsp?id=t18360919-2166&amp;div=t18360919-2166">this  trial</a>, where 44 year old William Blackburn was found guilty of  &#8220;unlawfully and maliciously administering to Hannah Mary Turner 6  drachms of tincture of cantharides, with intent to excite, &amp;c.&#8221;  I  understand that this case probably doesn&#8217;t fit the definition of assault  used by either Blackburn&#8217;s contemporaries or by the person who coded  the file.  Nevertheless, it is not completely unrelated to the idea of  an assault, and is exactly the kind of source that a historian could use  to shed light on gender relations, sexuality, or other topics.</span></p></blockquote>
<p><span id="fullpost">In the concluding post, Turkel sums up the advantages of automatic classifiers for research into large digitized collections. Learning from false positives proves particularly important, &#8220;giving you a way of  finding interesting things just beyond the boundaries of your  categories.&#8221;</span><span id="fullpost"><br />
</span><br />
<span id="fullpost">For a long time I had wanted to move beyond digitizing and making available online the corpora that interest me (such as <a href="http://www.corpustoneelkritiek.org/cti">Corpus Toneelkritiek Interbellum</a>), and to start exploring computational tools such as machine learning. Recently, just such an occasion turned up when PhD student Dries Lyna asked me for some tools to analyze the ca. 5000 auction lots he digitized from catalogues of auctions in eighteenth-century Antwerp and Brussels. More particularly, Dries is interested in the <em>evolving literary style</em> of these catalogues. His starting point is the distinct impression amongst art historians of this period that </span></p>
<blockquote><p><span id="fullpost">during the second half of the eighteenth century paintings&#8217; descriptions in catalogues in general became longer, more detailed and increasingly precise (&#8230;). Not surprisingly, longer accounts were accompanied by a growing richness of vocabulary and often more florid use of language. </span></p></blockquote>
<p><span id="fullpost"> </span><span id="fullpost">Using the digitized records from more than forty catalogues, we were able to show how the art vocabulary actually evolved during the second half of the eighteenth century. Not only did the vocabulary richness greatly increase, but discourse also became much more specific. In the 1840s the generic term <em>tableau </em>(painting) is used almost indiscriminately, while the next decade sees a decrease in the use of this word and an increase in more specialized terms such as <em>paysage </em>(landscape) and <em>portrait</em>.</span></p>
<p><span id="fullpost">We were both quite excited about these results. But Dries also wanted to know if there was a correlation between the <em>style </em>of the descriptions and the <em>price </em>the paintings eventually fetched during the auction. In other words, was the evolving language also a performative tool in the rapidly maturing late-modern marketplace for artworks?</span></p>
<p>More on that soon&#8230;&#8230;</p>
<p><span id="fullpost"> </span></p>
<p><span id="fullpost"><br />
Related articles by <a class="zem_slink" title="Zemanta" rel="homepage" href="http://www.zemanta.com">Zemanta</a> </span></p>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://blog.smellthedata.com/2010/01/machine-learning-in-one-sentence.html">Machine Learning in One Sentence</a> (smellthedata.com)</li>
<li class="zemanta-article-ul-li"><a href="http://www.slideshare.net/alfonsoeromero/linkbased-document-classification-using-bayesian-networks">Link-based document classification using Bayesian Networks</a> (slideshare.net)</li>
</ul>
<p><span id="fullpost"><br />
</span></p>
<div class="zemanta-pixie" style="height:15px;margin-top:10px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/c188f937-1a2a-4804-8a10-1305ba67824a/"><img class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=c188f937-1a2a-4804-8a10-1305ba67824a" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/21/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=21&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/03/15/a-naive-bayesian-in-the-auction-house/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Old_Bailey_Microcosm_edited.jpg/300px-Old_Bailey_Microcosm_edited.jpg" medium="image">
			<media:title type="html">A trial at the Old Bailey in London as drawn b...</media:title>
		</media:content>

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=c188f937-1a2a-4804-8a10-1305ba67824a" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>Publishing scholarly projects using Google Sites, pt. 3</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/03/06/publishing-scholarly-projects-using-google-sites-pt-3/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/03/06/publishing-scholarly-projects-using-google-sites-pt-3/#comments</comments>
		<pubDate>Sat, 06 Mar 2010 20:09:42 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Scholarly Publishing]]></category>
		<category><![CDATA[Google Sites]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[TEI]]></category>
		<category><![CDATA[web server]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=123</guid>
		<description><![CDATA[On the TEI mailing list, Martin Mueller recently raised some incisive comments on my previous posts detailing the process of publishing a scholarly editing project on Google Sites (see pt. 1 and pt. 2). His first criticism concerns the limitations of HTML, which de facto becomes the &#8220;target for all indexing and searching&#8221; in my [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=123&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>On the <a href="http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&amp;L=TEI-L&amp;T=0&amp;O=D&amp;P=8836">TEI mailing list</a>, Martin Mueller recently raised some incisive comments on my previous posts detailing the process of publishing a scholarly editing project on <a class="zem_slink" title="Google Sites" rel="crunchbase" href="http://www.crunchbase.com/product/google-sites">Google Sites</a> (see <a href="http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites/">pt. 1</a> and <a href="http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process/">pt. 2</a>).<span id="more-123"></span></p>
<div class="zemanta-img zemanta-action-dragged" style="display:block;float:right;margin:1em;">
<div class="wp-caption alignright" style="width: 223px"><a href="http://commons.wikipedia.org/wiki/Image:HTML.svg"><img class=" " title="A graphical despiction of a very simple html d..." src="http://upload.wikimedia.org/wikipedia/commons/thumb/8/84/HTML.svg/266px-HTML.svg.png" alt="A graphical despiction of a very simple html d..." width="213" height="248" /></a><p class="wp-caption-text">Image via Wikipedia</p></div>
</div>
<p>His first criticism concerns the limitations of HTML, which de facto becomes the &#8220;target for all indexing and searching&#8221; in my approach, negating some of the intricacies of your TEI markup.</p>
<blockquote><p>One disadvantage of this approach, however, is that the html transformation becomes the target for all indexing and searching, and since that transformation is intrinsically lossy you give up a lot. If  the goal is to display a site for reading and global searches, this does not matter. But if you had, say, an archive of several hundred plays from Shakespeare&#8217;s day and wanted to have your students look for differences in lexical uses by verse or prose, you&#8217;d be stuck.</p></blockquote>
<p>The second point is about the relevancy of my central problem &#8212; escaping the burdens of hosting. Here, he compares it to a much more luxurious situation at other research institutes:</p>
<blockquote><p>I appreciate the advantages of Google Sites in terms of getting you out of the business of babysitting your web server. But at my university and generally at American <a class="zem_slink" title="University" rel="wikipedia" href="http://en.wikipedia.org/wiki/University">research universities</a> that is not really where the shoe pinches.  At my university we are moving very rapidly into virtual server environments of one kind or another. If I think of myself as the owner/editor of a project, I&#8217;m likely to get help from my Academic Technologies group with getting some application environment installed, maintained, and updated if it runs on a standard platform. For instance, if I had an eXist or a Django/Berkeley implementation, I could count on its being installed and maintained on a Linux server. But I&#8217;d be largely on my own with regard to everything that happens &#8216;inside&#8217; my application. And that would be OK provided there is adequate documentation for how to work with TEI documents. I look forward to Lou Burnard&#8217;s Demonstrator.</p></blockquote>
<p>On the first point, Mueller is absolutely right. Converting to HTML eliminates a lot of the editorial complexity that may have been the central impetus in starting a TEI editing project in the first place.</p>
<p>However, this is not precisely where the charm of the Google Sites solution lies &#8212; that is nothing but speed &amp; simplicity. Here we touch on the second point. I can perfectly imagine projects like my own that are small-scale and do not<em> </em>have grand IT facilities at hand. In my case, as in the case of other projects I&#8217;m acquainted with, I&#8217;m basically my own IT guy.</p>
<p>And that&#8217;s where the convenience of Google Sites steps in. Suppose you have recently started a relatively small editing project &#8212; not small in terms of documents or richness of markup, but small in funding. You figure that there will probably be some funding opportunities further on the road ahead, so it could turn into a conventional book publication or into a full-blown, rich web app for doing research on those documents. But for the moment, they have not materialized. Yet you <em>do </em>want to have a preliminary publication, for advertising your work in the scholarly community or just making the edition searchable for students and colleagues. In this case, Google Sites will do a great job.</p>
<p>In this usage scenario, it does not matter very much that some of your markup richness gets lost in the transformation process to HTML, nor that the search engine is basically just a Google box. What counts is that your research gets visible, on a platform that is swiftly deployed and at zero maintenance cost.</p>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles by Zemanta</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://www.techcrunch.com/2009/11/16/google-sites-become-prettier-with-templates/">Google Sites Become Prettier With Templates</a> (techcrunch.com)</li>
<li class="zemanta-article-ul-li"><a href="http://techie-buzz.com/utilites/how-to-import-export-google-sites-data.html?utm_source=subscriber&amp;utm_medium=rss&amp;utm_campaign=rss">How To Import &amp; Export Google Sites Data?</a> (techie-buzz.com)</li>
</ul>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/34b000d6-3121-4485-adeb-ddf550576dd7/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=34b000d6-3121-4485-adeb-ddf550576dd7" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/123/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=123&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/03/06/publishing-scholarly-projects-using-google-sites-pt-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/8/84/HTML.svg/266px-HTML.svg.png" medium="image">
			<media:title type="html">A graphical despiction of a very simple html d...</media:title>
		</media:content>

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=34b000d6-3121-4485-adeb-ddf550576dd7" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>The Past&#8217;s Digital Presence, through the lens of Twitter</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/03/04/the-pasts-digital-presence-through-the-lens-of-twitter/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/03/04/the-pasts-digital-presence-through-the-lens-of-twitter/#comments</comments>
		<pubDate>Thu, 04 Mar 2010 21:56:33 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Reflections on Digital History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=74</guid>
		<description><![CDATA[My fascination for the backchannel at academic conferences &#8212; you know, that bunch of nerds twittering away behind their laptop screens while you present that laboriously written presentation with utmost theatrical skill &#8212; first started at TEI &#8217;09 in Ann Arbor. (I wanted to insert a reference to #tei09, but Twitter is really ridiculously short-memoried, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=74&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My fascination for <a href="http://en.wikipedia.org/wiki/Backchannel">the backchannel</a> at academic conferences &#8212; you know, that bunch of nerds twittering away behind their laptop screens while you present that laboriously written presentation with utmost theatrical skill &#8212; first started at <a href="http://www.lib.umich.edu/spo/teimeeting09/">TEI &#8217;09</a> in Ann Arbor. (I wanted to insert a reference to #tei09, but <a class="zem_slink" title="Twitter" rel="crunchbase" href="http://www.crunchbase.com/product/twitter">Twitter</a> is really ridiculously short-memoried, and starts deleting old tweets after I believe one week or so.)<span id="more-74"></span></p>
<p>When a discussion on a difficult paper is in full swing, it really helps to follow the conversation when two or three more knowledgeable people are simultaneously sending the key phrases through the air on Twitter.</p>
<div class="zemanta-img zemanta-action-dragged" style="display:block;margin:1em;">
<div>
<dl class="wp-caption alignright">
<dt class="wp-caption-dt"><a href="http://www.crunchbase.com/product/twitter"><img title="Image representing Twitter as depicted in Crun..." src="http://www.crunchbase.com/assets/images/resized/0000/2755/2755v30-max-250x250.png" alt="Image representing Twitter as depicted in Crun..." width="220" height="61" /></a></dt>
<dd class="wp-caption-dd zemanta-img-attribution">Image via <a href="http://www.crunchbase.com">CrunchBase</a></dd>
</dl>
</div>
</div>
<p>The real-time magic works even more beautifully when a nice conference you could not afford to visit is streamed live to your living room through tweets. It first happened to me the other day, when a grad conference with the magnificent title <a href="http://digitalhumanities.yale.edu/pdp/">The Past&#8217;s Digital Presence</a> took place at Yale. I did not attend any of the talks in person (nor am I well acquainted with the speakers) but simply by tuning in to the conference&#8217;s hash tag <a href="http://twitter.com/#search?q=%23pdp2010">#pdp2010</a>, I could get the gist of the talks &#8212; and pick up some nice one-liners on Digital History in the process. More on those below.</p>
<p>I guess following a conference through Twitter might be compared to attending a sports game through the radio.</p>
<p>Radio and tv reporters have long realized that the medium is the message, and that their reporting, imperfectly though it may represent the event, also <em>adds </em>a significant dimension. The speaker&#8217;s intonation and the many many background stories (on cyclists, I have a terrible love/hate relationship to <a href="http://www.facebook.com/pages/Michel-Wuyts/28130431454?v=info">Michel Wuyts</a>, who can tell you off the top of his head what color of underwear was worn by any biker at any major racing event) &#8212; they all add tons of drama, narrative, and heroics to the game.</p>
<p>So, what does the Twitter medium add to the game of conferencing?</p>
<p>Well, first: the medium is the <em><a href="http://en.wikipedia.org/wiki/The_Medium_is_the_Massage">massage</a>, </em>too. I happen to pick up Northern American daytime tweets at about 8pm, happily enjoying the comforts of a reclined seat &amp; some good Belgian Trappist beer. I don&#8217;t know why, but it sure helps me digest academic conferences better.</p>
<p>Seriously, even if Twitter undoubtedly subtracts an awful lot from the live event, it also adds something. That something must be <strong>semantic</strong>, I would guess. In a way, Twitter may be considered an &#8220;auto-summarization service&#8221; for academic conferences, and that is surely because the tweets are very high-quality.</p>
<p>Speaking of high-quality tweets, let&#8217;s move over to the actual subject of this post.</p>
<p>Some tweets on #pdp2010 caught my attention, because they succinctly expressed some of the crucial questions in Digital Humanities today.</p>
<p>(On the subject of an <a class="zem_slink" title="Ivy League" rel="wikipedia" href="http://en.wikipedia.org/wiki/Ivy_League">Ivy League</a> university such as Yale hosting a digital humanities conference, there&#8217;s been ample discussion on the Humanist mailing list. It all started out with <a href="http://lists.digitalhumanities.org/pipermail/humanist/2010-February/001072.html">this fascinating post</a> by Willard McCarty.)</p>
<p>I&#8217;m not sure whose statements these are, or who twittered them, but I&#8217;ll just quote them as I received them &#8212; anonymous infobits, whose main reason of existence is to be re-tweeted. (Are tweets the materialization, at last, of Dawkins&#8217; ill-famed <a href="http://en.wikipedia.org/wiki/Memes"><em>memes</em></a>?)</p>
<blockquote><p>We haven&#8217;t yet dealt with the issues of reliability and bias (as  historians) of digital primary sources. #pdp2010</p>
<p>Digital resources (humanities in particular) are often &#8220;link  graveyards,&#8221; growing collections of broken interfaces #pdp2010</p>
<p>Important comment from a digital librarian-descriptive metadata is key.  Scanned images are not the entirety of the situation</p>
<p>We have got to talk to our users more &#8211; there&#8217;s no excuse for scholars  not knowing about the existence of descriptive metadata #pdp2010</p></blockquote>
<p>All of these statements could be the starting point for interesting discussions. However, for now I&#8217;d like to leave them in the air for a moment (just as they kept floating around my head for a few days) and focus instead on one comment that is particularly dear to me. Apparently it came from a talk by April Merleaux:</p>
<blockquote><p>Merleaux: If u don&#8217;t program, your work will always be at the mercy of those who do</p></blockquote>
<p>I believe that is quite true, and it&#8217;s also the main reason why I started investing so much time in learning new technical skills (back in 2008, when I wanted to set up my first online digital document collection). If you do not grasp some core concepts from information science, web programming, or state-of-the-art natural language processing (as it is used by search companies whose interfaces we use daily), you will never fully understand what possibilities digital tools may hold for your own research.</p>
<p>There&#8217;s a variety of different opinions on this subject, however. <a href="http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&amp;L=TEI-L&amp;T=0&amp;F=&amp;S=&amp;P=5094">Recently, on the TEI mailing list</a>, Martin Holmes commented (this is from a discussion on eXist, a tool for indexing XML source files):</p>
<blockquote><p>It&#8217;s fair to say that to use eXist effectively, you have to learn quite a lot. Our researchers (faculty, RAs, etc.) never have anything to do with it. We (the technical support) set up eXist, write the <a class="zem_slink" title="XQuery" rel="wikipedia" href="http://en.wikipedia.org/wiki/XQuery">XQuery</a> and all the other web application logic, and so on. (&#8230;)</p>
<p>So I think people like you need people like me, and vice versa. Sometimes there are people who happily live in both camps &#8212; active faculty researchers who are also quite comfortable programming their own web applications &#8212; but these are few and far between. But that&#8217;s one of the things that makes digital humanities an inspiring enterprise: it&#8217;s almost always more collaborative than traditional humanities research.</p></blockquote>
<p>Holmes is certainly right, only he is advocating one &#8216;role model&#8217; of digital humanities &#8212; that of <strong>collaboration</strong> between technically skilled programmers &amp; academically skilled scholars. Only, these kind of collaborations always need some time to grow and (more importantly) a well-funded research framework to take place in.</p>
<p>I don&#8217;t have the resources for such a framework, and I do not feel like spending the next few years of my academic life obtaining the funds for them. But I <em>do </em>want to do research in digital humanities.</p>
<p>That&#8217;s why I prefer the DIY attitude that (I believe) Merleaux represents. It&#8217;s tempting (because it feels flattering) to identify with Holmes&#8217; rare individuals &#8212; &#8220;active faculty researchers who are also quite comfortable programming  their own web applications&#8221; &#8212; but actually our predicament is a little less comfortable. We have to keep up with our &#8216;traditional&#8217; specializations in the humanities-at-large, <em>and </em>program ourselves.</p>
<p>The medium, then, not merely is the message, no longer is the massage &#8212; the medium is the <a href="http://en.wikipedia.org/wiki/Bricolage">bricolage</a>.</p>
<p>Some endnotes:</p>
<ul>
<li>Podcasts from the talks at PDP 2010 may be found at Jana Remy&#8217;s blog (who was also one of the excellent &amp; magnanimous twitterers for PDP, thanxalot!) <a href="http://makinghistorypodcast.com">makinghistorypodcast.com</a></li>
</ul>
<ul>
<li>Douglas Knox commented on the Humanist mailing list:</li>
</ul>
<blockquote><p>Date: Wed, 3 Mar 2010 08:45:51 -0600<br />
From: Douglas Knox</p></blockquote>
<blockquote><p>What I thought I glimpsed between the tweets about PDP2010 was nascent home-grown theory arising out of methodological reflection within historically oriented disciplines. Digital challenges to presumptions about research, evidence, analysis, communication, and audience certainly call for this reflection throughout the humanities, not just in humanities departments but in libraries, archives, museums, and publishing enterprises driven by an intellectual mission. The grad students who came together for PDP recognize the necessity of thinking about, and historicizing, the role of libraries, archives, their own collecting and publishing, and, not least, the dark matter of missing information, in the production of knowledge about the past.</p></blockquote>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles by Zemanta</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://www.profhacker.com/2009/12/21/how-to-hack-a-conference-aka-attend-one-productively/">How to Hack A Conference (AKA Attend One Productively)</a> (profhacker.com)</li>
<li class="zemanta-article-ul-li"><a href="http://stevebuttry.wordpress.com/2010/03/03/resources-for-journalists-using-twitter/">Resources for journalists using Twitter</a> (stevebuttry.wordpress.com)</li>
<li class="zemanta-article-ul-li"><a href="http://www.dancohen.org/2009/11/18/introducing-digital-humanities-now/">Introducing Digital Humanities Now</a> (dancohen.org)</li>
</ul>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/432eb6b3-d20e-436f-a787-458a94e4bcc5/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=432eb6b3-d20e-436f-a787-458a94e4bcc5" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/74/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/74/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/74/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=74&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/03/04/the-pasts-digital-presence-through-the-lens-of-twitter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://www.crunchbase.com/assets/images/resized/0000/2755/2755v30-max-250x250.png" medium="image">
			<media:title type="html">Image representing Twitter as depicted in Crun...</media:title>
		</media:content>

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=432eb6b3-d20e-436f-a787-458a94e4bcc5" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>Publishing scholarly projects using Google Sites, pt. 2</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process/#comments</comments>
		<pubDate>Sat, 13 Feb 2010 22:47:00 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Scholarly Publishing]]></category>
		<category><![CDATA[Google Data API]]></category>
		<category><![CDATA[Google Sites]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/?p=25</guid>
		<description><![CDATA[In my previous post on scholarly publishing projects, I summed up the advantages of using Google Sites to make documents available online. But how to automate that process for a huge number of documents? For example, how to build a fully-searchable website for hundreds of Victorian letters, or scores of descriptions of performance events? In [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=25&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites/">previous post on scholarly publishing projects</a>, I summed up the advantages of using Google Sites to make documents available online. But how to automate that process for a huge number of documents? For example, how to build a fully-searchable website for hundreds of Victorian letters, or scores of descriptions of performance events?</p>
<p><span id="more-25"></span>In my earlier publishing projects (e.g., the SARMA project for <a href="http://corpustoneelkritiek.org/ptj">collecting dance &amp; theatre reviews</a> by Pieter T&#8217;Jonck) there seemed to be no other option but to <span style="text-decoration:line-through;">master the power</span> overcome the restrictions of HTML and CSS, throw a website together, add a search engine, and host the whole bunch on your own webserver or your university&#8217;s or a hosting company&#8217;s. That means lots of work, and between three and five new areas of expertise you need to explore if you&#8217;re coming from a traditional humanities background, as I do (I studied Philosophy and Theatre Studies, and those are still the main subjects of my teaching assignments).</p>
<p>That&#8217;s how I got into maintaining my own webserver, and although it&#8217;s certainly been a rich and rewarding process (insert your favorite funny accent to pronounce these words) it is getting on my nerves, too.</p>
<p>Still, other systems seemed like either <em>still </em>more work to do (such as learning the ropes of a <a title="Content management system" rel="wikipedia" href="http://en.wikipedia.org/wiki/Content_management_system">Content Management  System</a>) or too limited in possibilities. That how I first thought of Google Sites (or similar systems).</p>
<p>What changed my mind was the <a href="http://googledataapis.blogspot.com/2009/09/new-data-api-for-google-sites.html">release of the Google Data API for Google Sites</a> in September 2009. That&#8217;s quite a mouthful. The Data API is Google&#8217;s backdoor for programmers. Using the &#8220;<a class="zem_slink" title="Application programming interface" rel="wikipedia" href="http://en.wikipedia.org/wiki/Application_programming_interface">Application Programming Interface</a>,&#8221; you may automate the sending of emails through GMail, automatically update events in Google Calendar, upload movies to YouTube, request maps from Google Maps etc. etc. Basically, anything you can do through one of Google&#8217;s services by mouseclicks, you can automate through the Data API. Moreover, there&#8217;s not only an API client for serious Java people, but also one for the queen of playful code &#8212; Python!</p>
<p>It&#8217;s a godsend, and I love every inch of it.</p>
<p>(Well, except for that little one stupid inch that makes it impossible to publish the PDF files in your Google <a class="zem_slink" title="Google Docs" rel="crunchbase" href="http://www.crunchbase.com/product/google-docs">Docs</a> account automatically, but more on that in a future post.)</p>
<p>Using the Google Data API, you can automatically post HTML-formatted documents to a Google-hosted website, where they are instantly<em> </em>made available online <span style="text-decoration:underline;">and</span> fully indexed for site searching.</p>
<p>Read that sentence again. It&#8217;s crazy, if only because of this: although there is an upper limit to the amount of stuff you can freely post on a Google Site, <em>this limit only holds for page attachments</em>. If it&#8217;s simple HTML pages you are posting, <strong>there is a no upper limit</strong>.</p>
<p>To give a concrete example: this is the workflow I followed for the current version of <a href="http://sites.google.com/site/belgiumishappening">Belgium is Happening</a>, an online register of post-war performance events.</p>
<ol>
<li>Students <strong>collected information</strong> on performance events through research in books, journals, and archival records. They recorded this info in a gigantic spreadsheet on which they could work simultaneously (i.e., collaborative editing &#8212; I used Google Docs but there&#8217;s other systems, too). Each row of the spreadsheet holds one event, detailing its date, place, participants, and a short description.</li>
<li>A first script <strong>pulled in the data</strong> from the Google spreadsheet through the Google Data API for Google Docs. (If they put <em>Google </em>in the name of one more new service they devise, I&#8217;ll have them sued for artificially limiting the vocabulary richness of my blogposts.) Note that this step isn&#8217;t strictly necessary &#8212; the script could also read in the data from a local data file.</li>
<li>A second script <strong>formatted the event data </strong>as HTML, logged in to the Google Sites API (using my normal Google account details), selected the first event, <strong>created a new page</strong> on the Google Site for the project, <strong>posted the HTML</strong> to the new page, and then continued to do so for the next 1,399 events.</li>
</ol>
<p>And that&#8217;s how I posted no less 1,400 separate pages on the site of Belgium is Happening, <em>and </em>made all of those events fully searchable at the same time. Running not on my server, not on my time, but on Google&#8217;s servers &#8212; i.e., much more efficiently than I could ever aspire to do. Check out a sample of events here: <a href="http://sites.google.com/site/belgiumishappening/home/events">belgiumishappening/home/events</a></p>
<p>(To be perfectly honest, what you see there is already a bit of a hack &#8212; I really wanted to offer visitors a random selection of events, which is impossible in the current setup of Google Sites. You can present an overview of subpages, but since all my events are under one &#8216;parent&#8217; page, this list would be very large and unwieldy. So I cheated a bit, more on this later&#8230;.)</p>
<p>I first wanted to share the Python scripts I used at the end of this post, but they&#8217;re really too much tailor-made for my project to be informative for others. However, I strongly suggest reading the code for the sample applications that is provided with the <a href="http://code.google.com/p/gdata-python-client/"><span style="font-family:Courier New;color:blue;">gdata</span></a> Python client &#8212; all the basics are there.</p>
<p class="getsocial" style="text-align:left;"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1001.png?w=604" alt="" /><a title="Add to Facebook" rel="nofollow" href="http://www.facebook.com/sharer.php?u=http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1011.png?w=604" alt="Add to Facebook" /></a><a title="Add to Digg" rel="nofollow" href="http://digg.com/submit?phase=2&amp;url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;title=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A..." target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1021.png?w=604" alt="Add to Digg" /></a><a title="Add to Del.icio.us" rel="nofollow" href="http://del.icio.us/post?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;title=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1031.png?w=604" alt="Add to Del.icio.us" /></a><a title="Add to Stumbleupon" rel="nofollow" href="http://www.stumbleupon.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;title=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1041.png?w=604" alt="Add to Stumbleupon" /></a><a title="Add to Reddit" rel="nofollow" href="http://reddit.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;title=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1051.png?w=604" alt="Add to Reddit" /></a><a title="Add to Blinklist" rel="nofollow" href="http://www.blinklist.com/index.php?Action=Blink/addblink.php&amp;Description=&amp;Url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;Title=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1061.png?w=604" alt="Add to Blinklist" /></a><a title="Add to Twitter" rel="nofollow" href="http://twitter.com/home/?status=Publis+%40+Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1071.png?w=604" alt="Add to Twitter" /></a><a title="Add to Technorati" rel="nofollow" href="http://www.technorati.com/faves?add=http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1081.png?w=604" alt="Add to Technorati" /></a><a title="Add to Yahoo Buzz" rel="nofollow" href="http://buzz.yahoo.com/buzz?targetUrl=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;headline=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1091.png?w=604" alt="Add to Yahoo Buzz" /></a><a title="Add to Newsvine" rel="nofollow" href="http://www.newsvine.com/_wine/save?u=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-websites-using-google-sites-pt-2-automating-the-process&amp;h=Publishing%20scholarly%20websites%20using%20Google%20Sites%2C%20pt.%202%3A%20Automating%20the%20process" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1101.png?w=604" alt="Add to Newsvine" /></a><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1111.png?w=604" alt="" /></p>
<h6 class="zemanta-related-title" style="font-size:1em;">Related articles by Zemanta</h6>
<ul class="zemanta-article-ul">
<li class="zemanta-article-ul-li"><a href="http://dataliberation.blogspot.com/2009/09/data-liberation-front-advances-to.html">Data Liberation Front Advances to Google Sites</a> (dataliberation.blogspot.com)</li>
<li class="zemanta-article-ul-li"><a href="http://blog.pelland.com/?p=38">Using Google Interactive Maps On Your Website</a> (pelland.com)</li>
<li class="zemanta-article-ul-li"><a href="http://techie-buzz.com/utilites/how-to-import-export-google-sites-data.html?utm_source=subscriber&amp;utm_medium=rss&amp;utm_campaign=rss">How To Import &amp; Export Google Sites Data?</a> (techie-buzz.com)</li>
</ul>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/44aa3f16-1834-4026-8dc6-fcd2a243c291/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=44aa3f16-1834-4026-8dc6-fcd2a243c291" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/25/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=25&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-websites-using-google-sites-pt-2-automating-the-process/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1001.png" medium="image" />

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1011.png" medium="image">
			<media:title type="html">Add to Facebook</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1021.png" medium="image">
			<media:title type="html">Add to Digg</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1031.png" medium="image">
			<media:title type="html">Add to Del.icio.us</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1041.png" medium="image">
			<media:title type="html">Add to Stumbleupon</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1051.png" medium="image">
			<media:title type="html">Add to Reddit</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1061.png" medium="image">
			<media:title type="html">Add to Blinklist</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1071.png" medium="image">
			<media:title type="html">Add to Twitter</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1081.png" medium="image">
			<media:title type="html">Add to Technorati</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1091.png" medium="image">
			<media:title type="html">Add to Yahoo Buzz</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1101.png" medium="image">
			<media:title type="html">Add to Newsvine</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1111.png" medium="image" />

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=44aa3f16-1834-4026-8dc6-fcd2a243c291" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>Publishing scholarly projects using Google Sites</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites/#comments</comments>
		<pubDate>Sat, 13 Feb 2010 13:18:00 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Scholarly Publishing]]></category>
		<category><![CDATA[Google Data API]]></category>
		<category><![CDATA[Google Sites]]></category>
		<category><![CDATA[search engines]]></category>
		<category><![CDATA[TEI]]></category>
		<category><![CDATA[web server]]></category>
		<category><![CDATA[XML]]></category>
		<category><![CDATA[XSLT]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites</guid>
		<description><![CDATA[So, you finally got that beautiful scholarly editing project on the rails. You have begun to painstakingly edit those [insert crazy number] letters from late Renaissance artists. Or those [insert second crazy number] medieval prose manuscripts. Or, as in my case, you have collected data on a humungous amount of performance events from twentieth-century theatre, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=24&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>So, you finally got that beautiful scholarly editing project on the rails. You have begun to painstakingly edit those<em> [insert crazy number]</em> letters from late Renaissance artists. Or those <em>[insert second crazy number]</em> medieval prose manuscripts.</p>
<p><span id="more-24"></span>Or, as in my case, you have collected data on a humungous amount of performance events from twentieth-century theatre, letting various groups of students work together using a Google Spreadsheet. (A project going by the name of <a href="http://sites.google.com/site/belgiumishappening">Belgium is Happening</a>, more on this below &amp; in upcoming posts.)</p>
<p>It goes without saying that for the <em>editing</em> phase of those projects the <a href="http://www.tei-c.org/">Text Encoding Initiative</a> has been a tremendous boon.</p>
<p>But how about <em>publishing</em> your results? For some projects, &#8216;going print&#8217; is no longer a viable option. The materials you have collected may simply be too large or cumbersome for print. Besides, a scholarly publication (at least in my experience) takes at least half so much time for negotiations with publishers, going through another editorial process, &amp;<em>c.</em>, as the edition itself took to complete.</p>
<p>So, you want to do an online scholarly publishing project.</p>
<p>When I started out my first project in the summer of 2008 &#8212; the <a href="http://www.corpustoneelkritiek.org/cti">Corpus Toneelkritiek Interbellum</a>, a corpus of Dutch theatre reviews from the interwar period &#8212; the natural choice for publishing my TEI files online seemed to be the formatting language <a class="zem_slink" title="XSL Transformations" rel="wikipedia" href="http://en.wikipedia.org/wiki/XSL_Transformations">XSLT</a>. As a TEI beginner, it was the evident way to go, as it is used in many <a href="http://www.tei-c.org/Activities/Projects/index.xml">projects</a> for the presentation of an edition.</p>
<p>XSLT is basically a powerful formatting language (and, in some ways, also a rudimentary programming language) that tells the web browser how to sort, lay-out, and present the intricacies of your XML edition for the benefit of the illiterate masses out there who prefer to do their reading without angle brackets or namespace declarations.</p>
<p>XSLT has the distinct advantage of a ridiculously simple publishing model. If you simply connect a stylesheet to an XML document on your web <a class="zem_slink" title="Web server" rel="wikipedia" href="http://en.wikipedia.org/wiki/Web_server">server</a>, the browser will pick up both, and render your XML documents accordingly. At least, that&#8217;s what supposed to happen <em>after </em>you managed to write or customize your own XSLT stylesheet.</p>
<p>For the sake of clarity, I did not reach that stage of my initiation.</p>
<p>As it happened, I was also picking up some basic Python scripting skills at the time. And what I saw in Python &#8212; delicious pseudo-code featuring non-offending commands such as <span style="font-family:'Courier New',Courier,monospace;"><span style="color:blue;"><strong>file.open()</strong></span></span> &#8212; looked quite different from the verbose and swollen style of your average XSLT stylesheet.</p>
<p>I decided to convert my source files into run-of-the-mill HTML using Python&#8217;s quite powerful XML parsing capabilities. And then publish them, first as static files on a general-purpose server at our university, later on as part of a <a href="http://www.corpustoneelkritiek.org/">dynamic (and again Python-driven) website on my own web server</a>.</p>
<p>However, there are certain distinct disadvantages &amp; upper limits to these approaches. Either way &#8212; running your website on a university server, a private hosting solution, or your own server &#8212; you are basically into self-publishing. Will you use an established platform aka CMS (Content Management System, e.g., <a class="zem_slink" title="WordPress" rel="homepage" href="http://wordpress.org/">WordPress</a> or <a href="http://www.drupal.org/">Drupal</a>) or do you prefer to grow your own HTML/CSS? What is the most advantageous and flexible place to host it? If you run your own server, when does it need to be updated? Do you really need that latest Apache update? If you are doing a dynamic website, will the database continue to behave as it does today? When to update your database software? Is it possible that your website will one day attract a lot of traffic, necessitating more than one server? What search engine do you use for your collection of texts? Do you simply plug in a Google search box, or do you want some more searching power for your users? If so, what software do you choose?</p>
<p>Ah, the joys of server maintenance.</p>
<p>While looking for alternative models of publishing online, I stumbled upon <a href="http://www.google.com/sites/overview.html">Google Sites</a>. It is part of the Google Apps suite that is marketed to companies, schools, and groups. Using Google Sites, you design a basic website and populate it with content. Plus, every page you add is automatically indexed for powerful and lightning-fast site searching.</p>
<p>At first, that was exactly how I used it. It was a good choice for <a href="http://sites.google.com/site/kunstfilosofiesite">course background materials</a> &amp; wikis &#8212; i.e., as a high-tech replacement for the idiotic constraints of your average Blackboard installation.</p>
<p>But then I realised the full potential of the feature allowing you the edit the HTML contents of any page. If you would simply convert your scholarly project&#8217;s source files to HTML (using any of the above-mentioned methods) and insert them into a Google Site, it could grow to be the ideal publication platform for your project:</p>
<div class="zemanta-img" style="display:block;margin:1em;">
<div class="wp-caption alignright" style="width: 250px"><a href="http://commons.wikipedia.org/wiki/Image:Google%E2%80%99s_First_Production_Server.jpg"><img class=" " title="Google's  first production server rack, circa 1999" src="http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Google%E2%80%99s_First_Production_Server.jpg/300px-Google%E2%80%99s_First_Production_Server.jpg" alt="Google's first production server rack, circa 1999" width="240" height="398" /></a><p class="wp-caption-text">Image    via Wikipedia</p></div>
</div>
<ul>
<li><strong>Styling</strong> is applied automatically to your text. No more fiddling about with CSS stylesheets, just choose a general theme for your site, insert the HTML, and you&#8217;re done. If you want to apply additional styling, you can use the WYSIWYG editor or the HTML editor that <a href="http://siteshelp.kccloudsolutions.com/step-by-step-guides/usingthestyleattributewithgooglesites">accepts in-line CSS styling</a>, too.</li>
<li>Your project is running on <strong>Google&#8217;s servers</strong>. <em>They</em> are doing the maintenance, and since they&#8217;re the world&#8217;s largest data farm, it should remain accessible even when under heavy load. (The Google approach to hosting is dealt with from a different perspective in a Campfire One video on <a href="http://www.youtube.com/watch?v=3Ztr-HhWX1c">Google App Engine</a>)</li>
<li>Your project&#8217;s texts are immediately indexed and <strong>fully searchable</strong>. Anyone who has dipped into the difficulties of configuring your own search engine (certainly when coming from a non-IT background) should be happy about this.</li>
<li>If you decide to make your website public, it seems quite reasonable to expect that its contents are also swiftly added to the <strong>main Google search index</strong>, so your editing efforts may become more visible.</li>
</ul>
<p>The only point is &#8230; how are you going to upload those <em>[refer to crazy number 1] </em>of HTML documents to your Google Site?<br />
More on that soon, in an upcoming post.</p>
<p class="getsocial" style="text-align:left;"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1004.png?w=604" alt="" /><a title="Add to Facebook" rel="nofollow" href="http://www.facebook.com/sharer.php?u=http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1014.png?w=604" alt="Add to Facebook" /></a><a title="Add to Digg" rel="nofollow" href="http://digg.com/submit?phase=2&amp;url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;title=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1024.png?w=604" alt="Add to Digg" /></a><a title="Add to Del.icio.us" rel="nofollow" href="http://del.icio.us/post?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;title=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1034.png?w=604" alt="Add to Del.icio.us" /></a><a title="Add to Stumbleupon" rel="nofollow" href="http://www.stumbleupon.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;title=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1044.png?w=604" alt="Add to Stumbleupon" /></a><a title="Add to Reddit" rel="nofollow" href="http://reddit.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;title=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1054.png?w=604" alt="Add to Reddit" /></a><a title="Add to Blinklist" rel="nofollow" href="http://www.blinklist.com/index.php?Action=Blink/addblink.php&amp;Description=&amp;Url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;Title=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1064.png?w=604" alt="Add to Blinklist" /></a><a title="Add to Twitter" rel="nofollow" href="http://twitter.com/home/?status=Publishing%20scholarly%20projects%20usin+%40+Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1074.png?w=604" alt="Add to Twitter" /></a><a title="Add to Technorati" rel="nofollow" href="http://www.technorati.com/faves?add=http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1084.png?w=604" alt="Add to Technorati" /></a><a title="Add to Yahoo Buzz" rel="nofollow" href="http://buzz.yahoo.com/buzz?targetUrl=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;headline=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1094.png?w=604" alt="Add to Yahoo Buzz" /></a><a title="Add to Newsvine" rel="nofollow" href="http://www.newsvine.com/_wine/save?u=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F13%2Fpublishing-scholarly-projects-using-google-sites&amp;h=Publishing%20scholarly%20projects%20using%20Google%20Sites" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1104.png?w=604" alt="Add to Newsvine" /></a><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1114.png?w=604" alt="" /></p>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/f92427f5-c7db-44c4-ab83-bc6b921a259c/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=f92427f5-c7db-44c4-ab83-bc6b921a259c" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=24&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/02/13/publishing-scholarly-projects-using-google-sites/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Google%E2%80%99s_First_Production_Server.jpg/300px-Google%E2%80%99s_First_Production_Server.jpg" medium="image">
			<media:title type="html">Google's  first production server rack, circa 1999</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1004.png" medium="image" />

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1014.png" medium="image">
			<media:title type="html">Add to Facebook</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1024.png" medium="image">
			<media:title type="html">Add to Digg</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1034.png" medium="image">
			<media:title type="html">Add to Del.icio.us</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1044.png" medium="image">
			<media:title type="html">Add to Stumbleupon</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1054.png" medium="image">
			<media:title type="html">Add to Reddit</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1064.png" medium="image">
			<media:title type="html">Add to Blinklist</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1074.png" medium="image">
			<media:title type="html">Add to Twitter</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1084.png" medium="image">
			<media:title type="html">Add to Technorati</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1094.png" medium="image">
			<media:title type="html">Add to Yahoo Buzz</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1104.png" medium="image">
			<media:title type="html">Add to Newsvine</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1114.png" medium="image" />

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=f92427f5-c7db-44c4-ab83-bc6b921a259c" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>DIY Plagiarism Detection</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 22:06:00 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[The Teacher Is a Geek]]></category>
		<category><![CDATA[APIs]]></category>
		<category><![CDATA[ngrams]]></category>
		<category><![CDATA[plagiarism detection]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection</guid>
		<description><![CDATA[Sure, many elaborate plagiarism detection services are available online, many of them free of charge &#38; some which have beautiful interfaces (I&#8217;m looking at you, gorgeous Urkund demo). Most of them seem to connect to huge back-end databases of text from journals, books, and similar library materials. However, for most beginning students, the obvious sources [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=23&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Sure, many elaborate <a href="http://en.wikipedia.org/wiki/Plagiarism_detection">plagiarism detection</a> services are available online, many of them free of charge &amp; some which have beautiful interfaces (I&#8217;m looking at you, <a href="https://secure.urkund.com/view/2154474-482453-351207">gorgeous Urkund demo</a>). Most of them seem to connect to huge back-end databases of text from journals, books, and similar library materials.</p>
<p><span id="more-23"></span>However, for most beginning students, the obvious sources to scavenge are simply found through Google. They do not bother with scholarly databases or real books anyway. So what better way to detect their scams than Google itself?</p>
<p>As it stands, the collective intelligence a.k.a. Wikipedia has a not so positive opinion on plagiarism detection through search engines:</p>
<blockquote><p>Although it can easily detect blatant cases, it is less  effective when the plagiarizer has mixed multiple small fragments from  different sources, and will not return any relevant results if the  search engine has not indexed the original source or sources. Also,  considerable effort is required to investigate each suspected case.</p></blockquote>
<p>Still, I had the feeling I could make search engine-powered plagiarism detection work, even if it was only to help out with my own grading duties. There&#8217;s some terrific <a href="http://www.programmableweb.com/apitag/?q=search">Web search APIs</a> available right now. (An API or <a class="zem_slink" title="Application programming interface" rel="wikipedia" href="http://en.wikipedia.org/wiki/Application_programming_interface">Application Programming Interface</a> is like a &#8216;back door&#8217; for programmers to query a web application&#8217;s data and programs &#8212; not through a <em>user </em>interface (such as a website) but through a <em>programming</em> interface, i.e., enabling automated queries. There&#8217;s a great tutorial to APIs available from <a href="http://www.profhacker.com/2009/08/31/working-with-apis-part-1/">ProfHacker</a>.)</p>
<p>Web search APIs make it quite easy for a beginning programmer to do some automated  googling. I particularly like <a href="http://developer.yahoo.com/search/web/V1/webSearch.html">Yahoo Web Search</a>, or the <a href="http://code.google.com/apis/ajaxsearch/">Google AJAX Search</a> API.</p>
<div class="zemanta-img zemanta-action-dragged" style="display:block;margin:1em;">
<div class="wp-caption alignright" style="width: 220px"><a href="http://en.wikipedia.org/wiki/Image:Python_logo.svg"><img class=" " title="CPython" src="http://upload.wikimedia.org/wikipedia/en/thumb/0/06/Python_logo.svg/300px-Python_logo.svg.png" alt="CPython" width="210" height="50" /></a><p class="wp-caption-text">Image via Wikipedia</p></div>
</div>
<p>Here&#8217;s what I concocted with my basic <a class="zem_slink" title="Python (programming language)" rel="homepage" href="http://www.python.org/">Python</a> scripting skills. (Full details on the script at the end of this post)</p>
<p>My DIY plagiarism detector takes a simple text file as input, then automatically strips punctuation, and divides the text into <a href="http://en.wikipedia.org/wiki/N-gram">ngrams</a>. Simply put, an ngram is any sequence of n words you can extract from a given sentence. So for &#8220;My name is Mark&#8221; the bigrams would be<br />
<span style="font-family:Courier New;">&#8220;My name&#8221; &#8212; &#8220;name is&#8221; &#8212; &#8220;is Mark&#8221;</span><br />
and the trigrams would be<br />
<span style="font-family:Courier New;">&#8220;My name is&#8221; &#8212; &#8220;name is Mark&#8221;</span></p>
<p>Next, the script sends each of these &#8220;sentence chunks&#8221; to a search engine, and asks for hits that correspond to a <strong>phrase search</strong>, or a verbatim search of that exact sequence of words. Given that punctuation has been stripped, and the search engine itself disregards capitals, any available website containing that same word sequence (including the vast repository of Google Books, for instance) will be retrieved.</p>
<p>If my plagiarism detector finds many consequent hits, it&#8217;s a clear sign that the student has either quoted that passage, or has been engaging in some questionable copy-paste writing. I usually run the script at the same time I am &#8220;manually&#8221; reading and grading the paper, to see if any stolen goods turn up.</p>
<p>Some observations on my recent grading experiences.</p>
<ul>
<li><strong>Eight </strong>or <strong>nine </strong>consecutive words seems like a good window for detecting plagiarism. It means you won&#8217;t be scoring too many false positives, i.e., hits for sequences of words which are just frequent turns of phrase instead of illegitimate copy/paste.</li>
<li><strong>Many students &#8216;plagiarize&#8217; </strong>to a certain degree. Even the good ones. They may not copy whole paragraphs from the net, but they do snatch an occasional phrase. Is writing slowly turning into &#8220;montage&#8221;?</li>
<li>The Wikipedia criticism that search engines are less effective &#8220;if the  search engine has<strong> not indexed the original source</strong> or sources&#8221; has become quite obsolete, if only for the increasing number of print sources indexed through Google Book Search or similar mass digitization projects.</li>
<li>The Wikipedia criticism that the search engine method &#8220;is less  effective when the plagiarizer has <strong>mixed multiple small fragments</strong> from  different sources&#8221; still holds very much true. Some students take an inspiring chunk of text and start &#8216;remixing&#8217; it in a way that is very close to plagiarism, yet is more contrived than simple copy-pasting. I find it hard to judge them harshly, since some of them are obviously unaware that what they&#8217;re doing is fraudulent.</li>
<li><strong>A critical attitude toward <span style="text-decoration:underline;"><em>all</em></span> sources</strong>, be they written, oral, or digital, is undoubtedly one of the most important virtues an academic education should convey today.</li>
</ul>
<p>Now, for those who want the full gory details of the script, read on &#8230;</p>
<p>The script is written in <a href="http://www.python.org/">Python</a> and details a number of functions that are combined in a final function <span style="font-family:&amp;">GPlagFile()</span> for sending a whole file to Google for plagiarism detection. Some of the functions were snatched from the magnificent introduction to using Python in the humanities that William J. Turkel and Alan MacEachern wrote (<a href="http://niche-canada.org/programming-historian/1ed">The Programming Historian</a>). You can find the script <a href="http://www.zombrec.be/plagiarism.py">here</a>.</p>
<p>(Please note that you will have to download one additional Python module &#8212; namely <a href="http://pypi.python.org/pypi/simplejson"><span style="font-family:Courier;color:blue;">simplejson</span></a> &#8212; and that you may have to replace <span style="font-family:Courier;color:blue;">www.example.com</span> with a website of your own to identify yourself to the Google API.)</p>
<p class="getsocial" style="text-align:left;"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1002.png?w=604" alt="" /><a title="Add to Facebook" rel="nofollow" href="http://www.facebook.com/sharer.php?u=http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1012.png?w=604" alt="Add to Facebook" /></a><a title="Add to Digg" rel="nofollow" href="http://digg.com/submit?phase=2&amp;url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;title=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1022.png?w=604" alt="Add to Digg" /></a><a title="Add to Del.icio.us" rel="nofollow" href="http://del.icio.us/post?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;title=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1032.png?w=604" alt="Add to Del.icio.us" /></a><a title="Add to Stumbleupon" rel="nofollow" href="http://www.stumbleupon.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;title=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1042.png?w=604" alt="Add to Stumbleupon" /></a><a title="Add to Reddit" rel="nofollow" href="http://reddit.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;title=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1052.png?w=604" alt="Add to Reddit" /></a><a title="Add to Blinklist" rel="nofollow" href="http://www.blinklist.com/index.php?Action=Blink/addblink.php&amp;Description=&amp;Url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;Title=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1062.png?w=604" alt="Add to Blinklist" /></a><a title="Add to Twitter" rel="nofollow" href="http://twitter.com/home/?status=DIY%20Plagiarism%20Detection+%40+http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1072.png?w=604" alt="Add to Twitter" /></a><a title="Add to Technorati" rel="nofollow" href="http://www.technorati.com/faves?add=http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1082.png?w=604" alt="Add to Technorati" /></a><a title="Add to Yahoo Buzz" rel="nofollow" href="http://buzz.yahoo.com/buzz?targetUrl=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;headline=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1092.png?w=604" alt="Add to Yahoo Buzz" /></a><a title="Add to Newsvine" rel="nofollow" href="http://www.newsvine.com/_wine/save?u=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F02%2F06%2Fdiy-plagiarism-detection%2F&amp;h=DIY%20Plagiarism%20Detection" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1102.png?w=604" alt="Add to Newsvine" /></a><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1112.png?w=604" alt="" /></p>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/c006daf6-ed44-4ae9-84a4-8bfaece6e460/"><img class="zemanta-pixie-img" style="border:medium none;float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=c006daf6-ed44-4ae9-84a4-8bfaece6e460" alt="Reblog this post [with Zemanta]" /></a></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/23/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/23/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=23&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/en/thumb/0/06/Python_logo.svg/300px-Python_logo.svg.png" medium="image">
			<media:title type="html">CPython</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1002.png" medium="image" />

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1012.png" medium="image">
			<media:title type="html">Add to Facebook</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1022.png" medium="image">
			<media:title type="html">Add to Digg</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1032.png" medium="image">
			<media:title type="html">Add to Del.icio.us</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1042.png" medium="image">
			<media:title type="html">Add to Stumbleupon</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1052.png" medium="image">
			<media:title type="html">Add to Reddit</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1062.png" medium="image">
			<media:title type="html">Add to Blinklist</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1072.png" medium="image">
			<media:title type="html">Add to Twitter</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1082.png" medium="image">
			<media:title type="html">Add to Technorati</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1092.png" medium="image">
			<media:title type="html">Add to Yahoo Buzz</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1102.png" medium="image">
			<media:title type="html">Add to Newsvine</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1112.png" medium="image" />

		<media:content url="http://img.zemanta.com/reblog_e.png?x-id=c006daf6-ed44-4ae9-84a4-8bfaece6e460" medium="image">
			<media:title type="html">Reblog this post [with Zemanta]</media:title>
		</media:content>
	</item>
		<item>
		<title>A new digital humanities blog</title>
		<link>http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog/</link>
		<comments>http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog/#comments</comments>
		<pubDate>Sat, 16 Jan 2010 16:55:00 +0000</pubDate>
		<dc:creator>tcrombez</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[digital corpora]]></category>
		<category><![CDATA[digital history]]></category>

		<guid isPermaLink="false">http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog</guid>
		<description><![CDATA[This blog is written &#38; presented by Thomas Crombez. I&#8217;m a researcher at the University of Antwerp (Belgium). I also teach Philosophy of Art and Theatre History at the Royal Academy of Fine Arts in Antwerp. Since 2008, my research in theatre history has increasingly focused on doing &#8220;digital history&#8221;. What does it mean to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=19&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This blog is written &amp; presented by Thomas Crombez. I&#8217;m a researcher at the University of Antwerp (Belgium). I also teach Philosophy of Art and Theatre History at the Royal Academy of Fine Arts in Antwerp.</p>
<p>Since 2008, my research in theatre history has increasingly focused on doing &#8220;digital history&#8221;.</p>
<p><span id="more-19"></span>What does it mean to have a digital corpus of source texts at our disposal, instead of a traditional library or archive? How can we apply techniques from rapidly advancing fields such as artificial intelligence, Natural Language Processing, and computational linguistics to such &#8216;traditional&#8217; disciplines as literature and the arts?</p>
<p>My current projects are focused on how to make a large collection of source documents accessible through &#8220;research websites,&#8221; as I call it for lack of a better term. Examples include <a href="http://www.corpustoneelkritiek.org/cti">Corpus Toneelkritiek Interbellum</a>, a collection of Flemish theatre reviews from the interwar period, and the collected reviews of dance and theatre critic <a href="http://www.corpustoneelkritiek.org/ptj">Pieter T&#8217;Jonck</a> from 1983 to 2008. (All documents are in Dutch, but the interface and background documents are in English.)</p>
<p class="getsocial" style="text-align:left;"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1002.png?w=604" alt="" /><a title="Add to Facebook" rel="nofollow" href="http://www.facebook.com/sharer.php?u=http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog/" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1012.png?w=604" alt="Add to Facebook" /></a><a title="Add to Digg" rel="nofollow" href="http://digg.com/submit?phase=2&amp;url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;title=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1022.png?w=604" alt="Add to Digg" /></a><a title="Add to Del.icio.us" rel="nofollow" href="http://del.icio.us/post?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;title=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1032.png?w=604" alt="Add to Del.icio.us" /></a><a title="Add to Stumbleupon" rel="nofollow" href="http://www.stumbleupon.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;title=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1042.png?w=604" alt="Add to Stumbleupon" /></a><a title="Add to Reddit" rel="nofollow" href="http://reddit.com/submit?url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;title=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1052.png?w=604" alt="Add to Reddit" /></a><a title="Add to Blinklist" rel="nofollow" href="http://www.blinklist.com/index.php?Action=Blink/addblink.php&amp;Description=&amp;Url=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;Title=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1062.png?w=604" alt="Add to Blinklist" /></a><a title="Add to Twitter" rel="nofollow" href="http://twitter.com/home/?status=A%20new%20digital%20humanities%20blog+%40+http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1072.png?w=604" alt="Add to Twitter" /></a><a title="Add to Technorati" rel="nofollow" href="http://www.technorati.com/faves?add=http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog/" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1082.png?w=604" alt="Add to Technorati" /></a><a title="Add to Yahoo Buzz" rel="nofollow" href="http://buzz.yahoo.com/buzz?targetUrl=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;headline=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1092.png?w=604" alt="Add to Yahoo Buzz" /></a><a title="Add to Newsvine" rel="nofollow" href="http://www.newsvine.com/_wine/save?u=http%3A%2F%2Fdoingdigitalhistory.wordpress.com%2F2010%2F01%2F16%2Fa-new-digital-humanities-blog%2F&amp;h=A%20new%20digital%20humanities%20blog" target="_blank"><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1102.png?w=604" alt="Add to Newsvine" /></a><img style="border:0;margin:0;padding:0;" src="http://getsocialserver.files.wordpress.com/2009/08/gs1112.png?w=604" alt="" /></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/doingdigitalhistory.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/doingdigitalhistory.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/doingdigitalhistory.wordpress.com/19/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=doingdigitalhistory.wordpress.com&amp;blog=12132892&amp;post=19&amp;subd=doingdigitalhistory&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://doingdigitalhistory.wordpress.com/2010/01/16/a-new-digital-humanities-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/396daa600873248b80b14c9769ff987d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tcrombez</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1002.png" medium="image" />

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1012.png" medium="image">
			<media:title type="html">Add to Facebook</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1022.png" medium="image">
			<media:title type="html">Add to Digg</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1032.png" medium="image">
			<media:title type="html">Add to Del.icio.us</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1042.png" medium="image">
			<media:title type="html">Add to Stumbleupon</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1052.png" medium="image">
			<media:title type="html">Add to Reddit</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1062.png" medium="image">
			<media:title type="html">Add to Blinklist</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1072.png" medium="image">
			<media:title type="html">Add to Twitter</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1082.png" medium="image">
			<media:title type="html">Add to Technorati</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1092.png" medium="image">
			<media:title type="html">Add to Yahoo Buzz</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1102.png" medium="image">
			<media:title type="html">Add to Newsvine</media:title>
		</media:content>

		<media:content url="http://getsocialserver.files.wordpress.com/2009/08/gs1112.png" medium="image" />
	</item>
	</channel>
</rss>
