Thomas Crombez

DIY Plagiarism Detection

In The Teacher Is a Geek on February 6, 2010 at 10:06 pm

Sure, many elaborate plagiarism detection services are available online, many of them free of charge & some which have beautiful interfaces (I’m looking at you, gorgeous Urkund demo). Most of them seem to connect to huge back-end databases of text from journals, books, and similar library materials.

However, for most beginning students, the obvious sources to scavenge are simply found through Google. They do not bother with scholarly databases or real books anyway. So what better way to detect their scams than Google itself?

As it stands, the collective intelligence a.k.a. Wikipedia has a not so positive opinion on plagiarism detection through search engines:

Although it can easily detect blatant cases, it is less effective when the plagiarizer has mixed multiple small fragments from different sources, and will not return any relevant results if the search engine has not indexed the original source or sources. Also, considerable effort is required to investigate each suspected case.

Still, I had the feeling I could make search engine-powered plagiarism detection work, even if it was only to help out with my own grading duties. There’s some terrific Web search APIs available right now. (An API or Application Programming Interface is like a ‘back door’ for programmers to query a web application’s data and programs — not through a user interface (such as a website) but through a programming interface, i.e., enabling automated queries. There’s a great tutorial to APIs available from ProfHacker.)

Web search APIs make it quite easy for a beginning programmer to do some automated googling. I particularly like Yahoo Web Search, or the Google AJAX Search API.

CPython

Image via Wikipedia

Here’s what I concocted with my basic Python scripting skills. (Full details on the script at the end of this post)

My DIY plagiarism detector takes a simple text file as input, then automatically strips punctuation, and divides the text into ngrams. Simply put, an ngram is any sequence of n words you can extract from a given sentence. So for “My name is Mark” the bigrams would be
“My name” — “name is” — “is Mark”
and the trigrams would be
“My name is” — “name is Mark”

Next, the script sends each of these “sentence chunks” to a search engine, and asks for hits that correspond to a phrase search, or a verbatim search of that exact sequence of words. Given that punctuation has been stripped, and the search engine itself disregards capitals, any available website containing that same word sequence (including the vast repository of Google Books, for instance) will be retrieved.

If my plagiarism detector finds many consequent hits, it’s a clear sign that the student has either quoted that passage, or has been engaging in some questionable copy-paste writing. I usually run the script at the same time I am “manually” reading and grading the paper, to see if any stolen goods turn up.

Some observations on my recent grading experiences.

  • Eight or nine consecutive words seems like a good window for detecting plagiarism. It means you won’t be scoring too many false positives, i.e., hits for sequences of words which are just frequent turns of phrase instead of illegitimate copy/paste.
  • Many students ‘plagiarize’ to a certain degree. Even the good ones. They may not copy whole paragraphs from the net, but they do snatch an occasional phrase. Is writing slowly turning into “montage”?
  • The Wikipedia criticism that search engines are less effective “if the search engine has not indexed the original source or sources” has become quite obsolete, if only for the increasing number of print sources indexed through Google Book Search or similar mass digitization projects.
  • The Wikipedia criticism that the search engine method “is less effective when the plagiarizer has mixed multiple small fragments from different sources” still holds very much true. Some students take an inspiring chunk of text and start ‘remixing’ it in a way that is very close to plagiarism, yet is more contrived than simple copy-pasting. I find it hard to judge them harshly, since some of them are obviously unaware that what they’re doing is fraudulent.
  • A critical attitude toward all sources, be they written, oral, or digital, is undoubtedly one of the most important virtues an academic education should convey today.

Now, for those who want the full gory details of the script, read on …

The script is written in Python and details a number of functions that are combined in a final function GPlagFile() for sending a whole file to Google for plagiarism detection. Some of the functions were snatched from the magnificent introduction to using Python in the humanities that William J. Turkel and Alan MacEachern wrote (The Programming Historian). You can find the script here.

(Please note that you will have to download one additional Python module — namely simplejson — and that you may have to replace http://www.example.com with a website of your own to identify yourself to the Google API.)

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Reblog this post [with Zemanta]
Advertisements
  1. […] this: Off-the-rack anti-plagiarism software isn’t good enough for Thomas Crombez, so he built his own: However, for most beginning students, the obvious sources to scavenge are simply found through […]

  2. I like the approach you’re taking. I grabbed the script and fussed about with it a little, and made a few improvements in GPlagFile to make the output a little more informative. Instead of just printing out the possibly plagiarized string (s), I print the string and the possible source. Then, after the whole text has been run through, I output a summary of how many hits corresponded to each url in order to give you a good sense of which source(s) are closest to the text. Here are my mods (starting at the “for s in n:” loop)

    for s in n:
    outcome = GPlag(s,encode=True)
    if outcome != []:
    print ‘””‘+ s + ‘” matched content from ‘+outcome[0]+’ (‘+outcome[1]+’) : ‘+outcome[2]
    print
    sys.stdout.flush()
    if sourcedict.has_key(outcome[1]):
    sourcedict[outcome[1]] = int(sourcedict[outcome[1]])+1
    else:
    sourcedict[outcome[1]] = 1
    print
    print ‘———‘
    print ‘Summary of possibly-used sources:’
    for url in sourcedict.keys():
    print str(int(sourcedict[url])) + ‘ possible matches from ‘+str(url)

    Hope this is useful to you. -Dave

    • I believe the summary you propose is an especially good idea (i.e. the ratio of hits to URLs). I was also toying with the idea of how to ‘stitch’ together all possibly plagiarized strings. Thanks for your input!

  3. […] the subject of lies, and the lying liars who tell them: what about students? DIY Plagiarism Detection Software (hint: it uses teh Googles)  (h/t to […]

  4. […] If SafeAssign is not your ticket, perhaps a simple internet search might work for you! Internet searching is a good option for detecting cut-and-paste plagiarism you suspect was obtained from a web source. While this is a good option for many people there are a few limitations on a typical web search of which you should be aware.  For example, most web searches have a 10-32 word search limit.  If you enter if you enter text that is longer than this into the search box, your search engine will truncate the entry and only look for the first 10-32 words! Another limitation to be aware of is that web searches are based on semantics (http://en.wikipedia.org/wiki/Semantics) and not on structure.  Semantic searching will aid you in finding textural plagiarism, but it will not aide you in finding cases of plagiarism where a student has duplicated the structure of another’s work.  Furthermore, web searches work best on contiguous blocks of text.  According to Thomas Crombez, of the blog Doing Digital History, “The Wikipedia criticism that the search engine method “is less effective when the plagiarizer has mixed multiple small fragments from different sources” still holds very much true. Some students take an inspiring chunk of text and start ‘remixing’ it in a way that is very close to plagiarism, yet is more contrived than simple copy-pasting. I find it hard to judge them harshly, since some of them are obviously unaware that what they’re doing is fraudulent.”   (https://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/) […]

  5. […] one method is to put a few questionable sentences into Google without quotation marks around them. Here’s another method which I want to figure out how to use. Maybe we should run politician’s speeches through it. […]

  6. […] towards plagiarism. It develops a number of points I vaguely hinted at in my earlier post on ‘DIY plagiarism detection.’ The most disquieting observation is that students simply no longer realize that they are […]

  7. […] this: Off-the-rack anti-plagiarism software isn’t good enough for Thomas Crombez, so he built his own: However, for most beginning students, the obvious sources to scavenge are simply found through […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: