Program Analyzes Text of Shakespeare's Plays

The works of William Shakespeare have entertained students and captivated scholars for centuries, but professors Michael Witmore of the English department and Jonathan Hope of Strathclyde University in the United Kingdom have taken the study of the Bard to a new level. Using a text analysis computer program called Docuscope, developed at CMU, they have made a preliminary analysis of the works of Shakespeare to see if the program can distinguish between the comedies, tragedies, and histories based purely upon a statistical study of certain words. They found that, indeed, the program could distinguish between the comedies and histories, while tragedies fell in between. They will present their discovery tomorrow at noon in a lecture in Scaife Hall.
The basis of this work is that when humans read text, they look for the most important and salient ideas and ignore the rest as ?background noise.?
?What?s in the background may not be noise at all, but the ?soundtrack? that guides you (perhaps unconsciously) through your experience of a text and helps it make sense to you as a certain kind of dramatic experience,? said Witmore. However, computers will ?read? text in the same linear fashion, but will process all words as being equally important. A statistical analysis, then, of the words in a novel or play can lead to a greater understanding of the authors.
?The point is that a computer can count anything, and will if told to do so. But a good computer program counts things that we, as humans, have deemed (on the basis of our limited experiences) to be significant. Counting things with this kind of program is like being able to generalize a form of subjectivity to the point that it can be applied uniformly over a massive amount of instances. It?s the invariance of Docuscope?s categories that makes its results interesting, not their ?objectivity,? ? said Witmore.
Docuscope users define certain rhetorical features found in the text to categorize words. They must ?pre-identify? these features as being important to describe a given genre. For example, First Person defines first person and possessive pronouns. These features can be redefined to improve the analysis of a text. After it reads in a set of works, it makes a statistical analysis of the plays to rank them in order of the frequency of words that appear. Docuscope shows the results in a graphical interface, listing all the statistics for each category in box plots and showing the distribution of all of the considered works. Docuscope will also highlight and display the actual words it counts for a chosen feature, showing what it uses in the statistical counts.
Witmore and Hope used the current form of Docuscope, which was unprepared for early modern text, to study the first published collection of Shakespeare?s plays. Docuscope and a separate statistical analysis program made comparisons between the results for each play and used the information to sort the plays into specific groups and indicate which features were most important in defining the groups of plays. The programs effectively separated the comedies and histories from each other based upon unqualitative word counts.
?Genre has not been thought of as something with a statistically significant linguistic difference,? The Times of London reported Hope as saying. Some anomalies did appear. For example, the plays The Comedy of Errors, A Midsummer Night?s Dream, and The Tempest appear as histories, though Docuscope did place them together in a separate subsection. The Comedy of Errors, for example, was written early in Shakespeare?s career, perhaps before he fully developed his writing style. Henry VIII appeared with the comedies, but historians note it for its qualitative differences to the other histories.
Witmore and Hope used Docuscope to analyze the plays, but labeled each play as a member of a group ? comedy, history, or tragedy ? to get a statistical count to see how Docuscope would distinguish between the plays as members of one of these groups based upon rhetorical features, such as Interacting, Notifying, and Linear Guidance. The program did find statistical differences between the comedies and histories using the Interacting category, explaining why they were separated into two groups. The comedies feature more interaction between the characters, while the histories tend to feature longer speeches.
The use of Docuscope implies that researchers can expand the definition of genres, such as history and comedy, from a purely qualitative definition to also an empirical and statistical one. It raises the possibility that genres such as tragedy are not that unique from other genres. The program allows people to study the role of words in plays that normally go unnoticed, bringing up new questions on the relation between genre, cultural identities, thought, and literary styles, which will be included in the future work of Witmore and Hope.
?The Docuscope tool helps especially as an aid, not a substitute, to human readers of the texts. The application Witmore and Hope are making to Shakespeare scholarship through it are very much in that spirit.... The tool has given them some ways of ?seeing? the corpus that supplements what they know and that can advance scholarly interpretations,? wrote David Kaufer, head of the CMU English department and a co-developer of Docuscope, in an e-mail.