Sure, many elaborate plagiarism detection services are available online, many of them free of charge & some which have beautiful interfaces (I’m looking at you, gorgeous Urkund demo). Most of them seem to connect to huge back-end databases of text from journals, books, and similar library materials.
However, for most beginning students, the obvious sources to scavenge are simply found through Google. They do not bother with scholarly databases or real books anyway. So what better way to detect their scams than Google itself?
As it stands, the collective intelligence a.k.a. Wikipedia has a not so positive opinion on plagiarism detection through search engines:
Although it can easily detect blatant cases, it is less effective when the plagiarizer has mixed multiple small fragments from different sources, and will not return any relevant results if the search engine has not indexed the original source or sources. Also, considerable effort is required to investigate each suspected case.
Still, I had the feeling I could make search engine-powered plagiarism detection work, even if it was only to help out with my own grading duties. There’s some terrific Web search APIs available right now. (An API or Application Programming Interface is like a ‘back door’ for programmers to query a web application’s data and programs — not through a user interface (such as a website) but through a programming interface, i.e., enabling automated queries. There’s a great tutorial to APIs available from ProfHacker.)
Here’s what I concocted with my basic Python scripting skills. (Full details on the script at the end of this post)
My DIY plagiarism detector takes a simple text file as input, then automatically strips punctuation, and divides the text into ngrams. Simply put, an ngram is any sequence of n words you can extract from a given sentence. So for “My name is Mark” the bigrams would be
“My name” — “name is” — “is Mark”
and the trigrams would be
“My name is” — “name is Mark”
Next, the script sends each of these “sentence chunks” to a search engine, and asks for hits that correspond to a phrase search, or a verbatim search of that exact sequence of words. Given that punctuation has been stripped, and the search engine itself disregards capitals, any available website containing that same word sequence (including the vast repository of Google Books, for instance) will be retrieved.
If my plagiarism detector finds many consequent hits, it’s a clear sign that the student has either quoted that passage, or has been engaging in some questionable copy-paste writing. I usually run the script at the same time I am “manually” reading and grading the paper, to see if any stolen goods turn up.
Some observations on my recent grading experiences.
- Eight or nine consecutive words seems like a good window for detecting plagiarism. It means you won’t be scoring too many false positives, i.e., hits for sequences of words which are just frequent turns of phrase instead of illegitimate copy/paste.
- Many students ‘plagiarize’ to a certain degree. Even the good ones. They may not copy whole paragraphs from the net, but they do snatch an occasional phrase. Is writing slowly turning into “montage”?
- The Wikipedia criticism that search engines are less effective “if the search engine has not indexed the original source or sources” has become quite obsolete, if only for the increasing number of print sources indexed through Google Book Search or similar mass digitization projects.
- The Wikipedia criticism that the search engine method “is less effective when the plagiarizer has mixed multiple small fragments from different sources” still holds very much true. Some students take an inspiring chunk of text and start ‘remixing’ it in a way that is very close to plagiarism, yet is more contrived than simple copy-pasting. I find it hard to judge them harshly, since some of them are obviously unaware that what they’re doing is fraudulent.
- A critical attitude toward all sources, be they written, oral, or digital, is undoubtedly one of the most important virtues an academic education should convey today.
Now, for those who want the full gory details of the script, read on …
The script is written in Python and details a number of functions that are combined in a final function GPlagFile() for sending a whole file to Google for plagiarism detection. Some of the functions were snatched from the magnificent introduction to using Python in the humanities that William J. Turkel and Alan MacEachern wrote (The Programming Historian). You can find the script here.
(Please note that you will have to download one additional Python module — namely simplejson — and that you may have to replace http://www.example.com with a website of your own to identify yourself to the Google API.)