Our experiment with Mechanical Turk went better than we expected. We were able to approve 85% of submissions automatically, and we ultimately approved 98.3% of submissions. These figures came in higher than we planned: we thought we would approve 80% automatically and ultimately approve 90-95%.
We would (and probably will) use Mechanical Turk again for similar projects.
Background: What Is Mechanical Turk?
Computers can’t do everything. Consider the following sentences:
John saw Jane at the store. She said, “Hi.”
You can tell right away that Jane is the person speaking. A program would have to perform serious computational gymnastics to reach the same conclusion.
Amazon invented Mechanical Turk for situations just like the one above: to solve problems that humans find easy but computers find hard.
We’re always looking to improve the quality of the ESV text database, especially as we prepare an OSIS version. (OSIS is an XML format for the Bible.) OSIS allows us to indicate who speaks each direct quotation in the Bible.
We considered Mechanical Turk as we pondered the best way to go about building this database. For this first foray, we decided to use Mechanical Turk to verify what we created ourselves—in other words, to verify an internal database.
The ESV text contains about 7,100 quotations. Of those, 5,700 are top-level quotations (as opposed to quotes within quotes). Of those, we’ve already identified the quotes spoken by Jesus (indicated by red-letter text). Scratch another 643 quotes. So we have about 5,000 quotations to look at. No problem.
We had someone from Crossway go through each quotation using a web application we developed. This person spent about eight hours on the project over the course of a week and identified about 3,100 quotes in which the text identified by name the speaker of a quotation.
Next we wrote a few Perl scripts (including this module) to handle the uploading. We uploaded the 3,100 quotes over several hours on Tuesday, June 13, 2006, to give our blog readers plenty of time to respond to our invitation to participate.
We then wrote a program to compare the answers submitted by Mechanical Turk workers to our existing answers. About 85% of them were exact matches, so we approved those without even looking at them. We reviewed the remaining 500 or so by hand and only had to reject 54 of them for being wrong.
- Inexpensive. We got a database for about $75 that, as far as we can tell, no one has created before for the Bible.
- Fast. We uploaded one HIT every five seconds over six hours. Workers performed these HITs almost as fast as they were uploaded. Seventy-eight workers participated.
- High quality. We only rejected 1.7% of submissions, an excellent figure by any standard.
- No developer sandbox. We had to upload funds and grab a HIT ourselves to make sure everything worked OK. We would have liked a place to test our programs without having to expose them to the world (and without having to pay).
- Funds have to come from a bank account. We had to get special authorization to withdraw funds, and it took a week after initiating the transfer for the funds to show up in our Mechanical Turk account. We would’ve preferred to pay with a credit card, even if that meant buying $20 blocks of Mechanical Turk credits at a time.
- Limited formatting options. We would’ve liked to be able to put the quotation in bold; instead, we had to indicate it with brackets: [QUOTE BEGINS HERE]. We would prefer to use XHTML, as the limited formatting restricts the type of application we can develop with Mechanical Turk. Some enterprising individuals have worked around this limitation by asking people to visit a different website, answer the questions there, and enter a code from the other website. It works, but it’s not ideal.
We would use Mechanical Turk again, especially if they address the formatting restrictions to allow more interactive applications.
Creating metadata for the Bible is often detail-oriented and labor-intensive. Mechanical Turk presents a new and helpful way to spread the work inexpensively among many people (“crowdsourcing”). We estimate that Mechanical Turk cut our costs by about 60% for a comparable-quality result.
Update: The Amazon Web Services Blog has written about our experiment. We should add that we did find six instances in which our database was wrong and the Turkers were right.