"Text Mining"--Digging through Digital Archives

(December 20, 2010)

Searching through a sizable fraction of civilization’s written records before digital technology was like searching a beach for buried coins before the metal detector: a formidably time-consuming and onerous burden with a dismayingly small chance of success.

In recent years, however, the growing ability to digitize even antique documents by converting the text into a form that is machine-readable—thereby generating a database that can be sifted by computer for specific terms or themes—has made it possible to discover patterns and trends that were obscured by the sheer volume of text.

In one recent, heavily publicized study, published on December 16, 2010 in the online journal Scienceexpress, researchers at Harvard University and colleagues conducted analyses of more than five million books in English (“roughly 4 percent of all books ever published,” according to the researchers) printed between 1800 and 2000. By tracking the frequency of word occurrence, the researchers concluded, among other things, that:

Use of the word “men” occurred vastly more frequently than “women” until the 1970s, and briefly achieved equal status in the 1980s. Now they occur at roughly the same frequency.
The date “1880” dropped to half its peak use in books by 1912. But the year “1973” was down to half its maximum by 1983—declining at three times the rate of 1880. This suggests that “We are forgetting our past faster with each passing year,” the researchers write.
Inventions in the period 1800–1840 took an average of more than 66 years until they began to show up fairly frequently in printed text. But those from 1880–1920 took only 27 years, indicating that “cultural adoption of technology has become more rapid,” the article says.

That study is only one of many current efforts to take advantage of the digitized riches now available. Another is an ambitious project to determine whether, as is widely supposed, people in Victorian England dramatically changed their views on science, religion and progress toward the end of the 19th century.

Through a grant awarded by NEH in January of 2008, George Mason University’s Center for History and New Media (CHNM) and its Director Daniel Cohen have been exploring ways to use what is called “text mining” to locate documents of interest in the ocean of texts online, to extract and synthesize information from those texts, and to analyze large-scale patterns across those texts.

Nadina Gardner, Director of the Division of Preservation and Access at NEH, explains that the GMU researchers “have been developing sophisticated search, extraction, and analysis tools to allow historians to use ‘micro’ methods, such as word-frequency analysis, and ‘macro’ methods, such as document analysis, that will allow researchers to be excited, instead of overwhelmed, by the prospect of access to such enormous numbers of digitized texts.”

Gardner’s division awarded a grant of $300,000 to the CHNM project, which has since received further support through a Digging into Data award of $100,000 from NEH’s Office of Digital Humanities (ODH). It is one of only 12 international projects selected by Google in 2010 for additional support.

“Another major corpus that lends itself to text mining,” says Ralph Canevali of Preservation and Access, is Chronicling America (http://chroniclingamerica.loc.gov/), a digital resource of historically significant newspapers published between 1836 and 1922, that will eventually include all states and U.S. territories. “At present, Chronicling America provides free access to more than 3 million pages of newspapers published in 22 states and the District of Columbia,” Canevali says. “This ongoing effort is sponsored jointly by the National Endowment for the Humanities and the Library of Congress as part of the National Digital Newspaper Program."

The Scienceexpress paper “speaks directly to an area the Office of Digital Humanities has been focusing on the past few years,” says ODH Director Brett Bobley. “The question at the heart of this research is ‘scale’—that is, what can we do now that we have massive quantities of digital materials? Over the past few years, we’ve seen the objects of traditional humanities study—that is books, newspapers, journals, images, music—digitized on an unprecedented scale. Now scholars have access to far more materials than they could ever read in a lifetime. So it raises the question of how this massive scale affects the way we do research. Might there be new computationally-based tools and methods to allow the scholar to analyze these large digital collections to pinpoint trends or raise new questions that couldn't be seen before?”

To address this question, the Office of Digital Humanities created a partnership with three other leading research agencies, the Joint Information Systems Committee (JISC) from the United Kingdom, the National Science Foundation (NSF) from the United States, and the Social Sciences and Humanities Research Council (SSHRC) from Canada. Together, the four agencies launched the Digging into Data Challenge.

“Under this innovative grant program,” Bobley says, “we challenged international teams to pursue the cutting edge of humanities and social science research that utilizes very large digital collections. During the first round of the competition, we awarded grants to eight teams, doing a wide variety of innovative work. For example, one team explored how to computationally analyze digitized music, another explored how to analyze images and maps, another explored linguistic analysis of the spoken word, and one looked at letters written by key figures during the Enlightenment.

In addition to the Digging into Data Challenge, Bobley says, ODH “also has a grant program called Digital Humanities Start-Up Grants. This program makes small seed grants to fund innovative research that takes advantage of technology. Under this program, to cite a few examples, we have funded a project exploring intellectual history by computationally analyzing the text of many editions of the Encyclopedia Britannica, a project to analyze Old English literature, and a project to analyze digitized newspapers.”

The New York Times
http://www.nytimes.com/2010/12/17/books/17words.html

The Wall Street Journal
http://online.wsj.com/article/SB1000142405274870407380457602374184992200...

The Boston Globe
http://www.boston.com/news/science/articles/2010/12/17/harvard_google_jo...

The Chronicle of Higher Education
http://chronicle.com/article/Scholars-Elicit-a-Cultural/125731/

Media Contacts:

Curt Suplee: