A few months ago, Harvard’s J.B. Michel and Erez Lieberman-Aiden, along with colleagues from Google and elsewhere, made quite a splash with their cover article in Science on “Quantitative Analysis of Culture Using Millions of Digitized Books .” Along with the article, they released a new tool called the Google N-gram Viewer , that plotted word occurrences from the Google Books corpus. Suddenly, humanities scholars around the world were talking about n-gram this and n-gram that. N-gram graphs were used to trace everything from the rise and fall of political philosophies to the popularity of TV and movie stars.
Before that article came out, how many people outside the fields of linguistics or natural language processing knew what an “n-gram” was? I’m guessing that linguists and computer scientists working in Natural Language Processing/Computational Linguistics (NLP/CL) must be quite amused to see what was previously a fairly technical term in their discipline suddenly become a trending topic on social networks. N-gram t-shirts anyone?
But I think all this talk about n-grams is a good thing. Increasingly, the terminologies and techniques of NLP/CL are infiltrating the humanities in a big way and will become an important part of humanities research. Just as applied methods from computer science have become an integral part of research in biology, physics, and practically every other science, I believe we will see applied NLP/CL become a major driver for text-based humanities research. And just as today’s scientists must have a reasonably good understanding of these computer science-driven methods used in their home discipline, tomorrow’s humanists working with large bodies of text will need a firm grasp on methods that are currently within NLP/CL’s domain.
In his essay "A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences ," part of an CLIR/NEH report  that we funded in 2008, the University of Maryland’s Douglas Oard does a terrific job laying out the various “tribes” of the NLP/CL world. The tribes include experts in document recognition and retrieval, speech processing, summarization, machine translation, and information retrieval. Many of these experts combine this work with probability theory, which is necessary for getting results at the very large scale. As the digital mountain of books, newspapers, and journals that humanists study continues to grow, research will become more reliant on these computational language processing techniques. As Doug notes, “…humanities scholars are going to need to learn a bit of probability theory.” (Or, as Doug recently noted to me by e-mail, at least their grad students will.)
NEH co-sponsors a major international grant competition, the Digging into Data Challenge  which is designed to investigate new research methods for the humanities and social sciences that use very large data sources. When the first round of applications came in back in 2009, it was quite eye-opening to see the importance of NLP/CL. Even though the research topics ranged across the humanities and social sciences (including history, literature, philosophy, economics, political science, etc), in a high percentage of the projects, techniques from NLP/CL were being harnessed for the research. Collaborators from the NLP/CL community were on many of the research teams, bringing their own expertise in language processing to attack problems in other domains.
I think Doug is right. Humanities scholars will need to learn a bit about probability theory – and n-grams and many other bits borrowed from the NLP/CL world. I don’t think every humanist needs to be a computational linguist – there is a role here for interdisciplinary teams (or as Doug succinctly puts it, we need to “make friends.”). There is also a role – actually I’d argue a huge opportunity – for professional digital humanists who can help lead these interdisciplinary teams and be the “glue people” (to borrow a great term from Francine Berman), between disciplines. These digital humanities scholars will be the ones who have training in both a humanities or social science discipline, as well as in natural language processing or computational linguistics domains. By being a “double threat,” they will be able to work across fields and bring the right people together to tackle the right problems.
[Note: The conference is now over] J.B. Michel and Erez Lieberman-Aiden will be discussing their Google n-gram paper at the upcoming Digging into Data Challenge conference, along with the eight winning projects from our 2009 competition. If you’d like to join us, please register on our website. The conference will be June 9-10 in Washington, DC. Attendance is free.
[Note: The competition is now closed] If you are are a researcher interested in participating in the 2011 round of the Digging into Data Challenge, please see our competition website . The competition is jointly sponsored by eight international funders: NEH, NSF, IMLS, SSHRC, JISC, ESRC, AHRC, and NWO. The deadline to apply is June 16th.