Skip to main content

Feature

Reading in the Age of Google

By Gregory Crane | HUMANITIES, September/October 2005 | Volume 26, Number 5

Modern information technology began when humans started storing language in physical media--scrolls, stone tablets, steles-that existed separately from their brains. By Plato's time in the fourth century BCE, writing was already thousands of years old and subject to some philosophical criticism.

Tellingly, in the Phaedrus, Plato has Socrates commenting that written words are like statues that may imitate life but have no life of their own. He stresses the inert quality of written language: "Writing says one single thing--whatever it may be--the very same thing forever."

We may ask the written word what it means, but it cannot answer or even hear our questions. For at least four thousand years we have been writing ever more numerous books and assembling these in ever more sophisticated libraries. But in the end, like Akkadian-speaking schoolchildren struggling to learn Sumerian, we could only turn to other humans and the reading aids they created, if we could not understand the written words before us.

Twenty years ago, Marvin Minsky, a proponent of artificial intelligence, responded to this ancient challenge and imagined a time when people could not imagine a library in which the books did not talk to each other.

The grand vision of artificial intelligence remains elusive, but simpler approaches have already allowed books to converse with one another and to adapt themselves to the needs of individual human readers. Millions of people have experienced early versions of this idea through e-commerce.

When I query Google for "Plato Phaedrus," for instance, I not only receive links to a number of online translations of the dialogue but two sponsored links, one to a used-book service, another to a company that sells term papers. A slight variation on this query ("Plato's Phaedrus") elicits a link to Google Print and a digitized copy of a translation published by Purdue University Press. As more works are digitized and become available online, the ability of these documents to interact and learn from each other will grow as well.

Amazon.com has already digitized one hundred thousand books, and the Google Print Library Project has set out to digitize the contents of five major libraries, including Harvard's, which contains more than ten million books in five hundred languages.

Google, Amazon, and other companies mine data, analyzing our queries and making inferences about our goals, using as much information as they have to help us spend money.

Many of these same techniques can, however, help us learn. In the Perseus Digital Library, we already have the beginnings of new reading environments that help us understand complex documents in a variety of languages.

For people studying Plato, for example, the Perseus Digital Library can assemble a range of materials relevant to the Phaedrus, including a Greek text, English translation, and a list of documents that comment on the opening section of the dialogue. The reader can customize the display by explicitly asking for original source text in Greek, choosing a translation font, and making other decisions about what should be displayed. The reader can ask a question about a particular Greek word, and the system can personalize its response: it recognizes that the reader is looking at a dialogue of Plato and highlights all citations to Plato in the online lexicon entry.

These electronic actions are simple in nature but profound in their implications. Many different books are, in effect, having a conversation among themselves and deciding how best to serve the human reader. The library asks its collection what it knows about a particular work, and then assembles what is, in fact, a virtual book, combining materials derived from multiple print sources into a single integrated display. The Greek edition of the Phaedrus recognizes not only that it is being read but also what word in what section the reader is asking about. It can also tell the lexicon that it is being read. The lexicon then adapts itself to the reader, shifting its appearance to suit the needs of individual readers better.

Customization and personalization support changes that we ask for specifically and adaptations that a system makes automatically. Customization can build on extensive information as we have found with language students. For example, a student studying a particular Roman text can tell the Perseus Digital Library which chapters in which edition of Wheelock's Latin she has studied. The text queries the textbook as to what vocabulary it includes in those chapters and then reports to the reader which words in a new passage of text she has and has not encountered before. Over time, a learning profile evolves, allowing a person to track what has been learned and to develop better ways to work with new documents.

Personalization begins with systems trying to anticipate the readers' needs. At the Perseus Digital Library, machine learning and data mining also have been put to use to discover what questions people have previously asked when they encounter a text. By studying the queries people have made regarding a widely read passage by Ovid, we discovered that readers generally followed a small number of patterns. Once a reader asked about three or four words in a passage, we could predict most of the words about which they would ask next.

Such personalization points to a world where readers using the same materials can identify the parts that are most interesting or most challenging to them. The historian may find pointers to discussions of treaties and international law, while a linguist may get information about the meaning of the subjunctive in a particular passage.

The implications extend beyond the reading of academic texts. As I wrote this, my eighteen-year-old son was familiarizing himself with a new cell phone that includes color screen and keyboard and provides Web access while my fourteen-year-old son was studying three-dimensional graphics.

We are educating a generation that assumes it can ask questions about almost any topic and examine evidence in almost any medium. This generation also has high expectations regarding the availability of materials online and easy means of discovering their content. In response, our libraries are shedding their physical constraints in an effort to meet the expanding needs of a larger and more technologically savvy audience.

We are currently in the incunabulum stage of a long-term shift to electronic publication. Just as early books looked like manuscripts, electronic publications imitate many aspects of printed works. Electronic publications have a few obvious extensions such as linked citations and additional images that would not fit in the print version. But the electronic pages still feel familiar just as the U.S. Navy's first steamship did after it was retrofitted with a complete set of masts and sails.

Habits and preexisting structures constrain the direction of new technology. Even so, there are elements that are already beginning to drive humanities publishing beyond the familiar framework. Larger and broader audiences, documents that learn, the reuse of material for multiple purposes, and decentralized production will all play some role in the making of dictionaries, encyclopedias, translations, commentaries, and other tools that structure information in the humanities.

A networked world challenges humanists to reimagine their potential audience. No longer trapped in a tiny network of exclusive academic libraries, ideas can circulate more quickly across the globe.

Broadening the audience through electronic publishing raises many questions. Which institutions should sustain this information? Are the humanities well served with a subscription model, where academic work is mainly available in scholarly journals? Might this model exclude those without access to a university library? If the humanities community chooses to pursue open publication and engage with a broader subset of our fellow citizens, will scholars express themselves in the same way? To what extent will scholars' choices of topics evolve in response to different audiences?

In a print world, publications drift out of date the moment that their authors finish writing. In an electronic world, however, documents can perform more intelligent functions. Rather than creating static content, authors can now prepare data that will drive automated systems, which can then in turn scan much larger bodies of material in order to update information.

The production of traditional reference works may be better suited to such an automated electronic model. A lexicographer, for example, may collect hundreds of instances of a word's use and the different meanings associated with it, but will only have space for a few examples in the print lexicon. This larger set of examples could instead be used with a word-sense disambiguation system.

The word-sense disambiguation system can not only support the lexicographer at work but can be applied to much larger projects. The dictionary entry becomes part of a system to which subsequent editors can add and that can continue organizing data long after the initial project is complete. The same techniques can be used to generate biographies, event descriptions, and other summaries.

The making of all these works will partly depend on granularity: the extent to which reference works can be broken down into smaller chunks and put together for purposes other than the one originally intended.

Humanists tend to write articles and monographs, linear publications roughly fifteen to thirty pages or one hundred and fifty to three hundred pages. Digital environments support and may strongly favor publications that are much larger and smaller than these. Emerging technologies for summarization, question answering, information retrieval, text mining, and visualization now can provide more immediate support for researchers pondering nineteenth-century reviews of Melville, the general reader of Moby-Dick imagining Melville's Pequod, and the visitor to New Bedford visualizing Ishmael's town. But such services work better when they are able to recombine smaller units of information from a variety of sources. While the article and monograph forms are best suited to some scholarly arguments, successful electronic publications require a shift to more small-scale contributions that can be recombined in novel ways. Ideally, articles and monographs should be structured to make them easier to analyze automatically.

Historically, large research projects have depended upon contributions from a wide range of collaborators, often from non-traditional backgrounds

Among the more famous was the creation of the Oxford English Dictionary, which took seventy years to complete. Thousands of people helped Professor James Murray, including William Chester Minor, an inmate at an asylum for the criminally insane.

The electronic world lends itself to decentralized production with many collaborators. The most important development for the humanities this century may prove to be Wikipedia and the rise of community-driven projects. Wikipedia is an extreme case whose success has so far shocked skeptical scholars. Anyone can contribute to Wikipedia and classic Wikipedia articles do not have single authors. Rather, Wikipedia seems to demonstrate that if a general consensus exists about what a document should look like, and the consenting community can contribute, documents will develop into useful tools of research.

The Wikipedia articles where open production has not worked have generally been on controversial topics such as Israel or the biographies of Kerry and Bush in the 2004 election. While this model of production clearly abandons the distinctive single author's voice, it may better capture the goals of reference works than more professional, centralized efforts.

Wikipedia is immensely interesting for another reason: the relationships between the pieces of data. The 533,000 articles available in May 2005 contained twenty million links to other Wikipedia articles. Of these, fifteen million were disambiguating links, meaning they connected ambiguous terms such as "Springfield" to their referents "Springfield, MA" or "Springfield, IL." A survey of two hundred, randomly selected links revealed only two inaccuracies.

These links demonstrate that communities are willing and able to produce immense amounts of very precise data. Such user input can improve the results of automated processes and can serve as a foundation for authors creating their own reference articles.

As staggering as some changes have been over the past twenty-five years, it is difficult to predict what we will be reading in fifteen, ten, or even five years' time. Subsequent developments may be even more dramatic as old ways of doing things dissolve and a new generation, immersed in electronic information from childhood, takes its place.

The goals we pursue-the hunger for ideas, the desire to understand more, the delight in reasoned, evidence-based debate-will continue to find new modes of expression. Reading has been in flux since writing began to emerge four thousand years ago. The increasing mechanization of print facilitated a shift from intensive reading, where readers repeatedly studied a few texts such as the Bible, Vergil's Aeneid, or Shakespeare's plays to extensive reading where readers moved through one novel after another. This shift had many effects, not least of which was laying the foundation for modern democratic society. The restless, question-driven, active reading in the age of Google may lead to a shift that is just as dramatic.

About the Author

Gregory Crane is editor in chief of the Perseus Digital Library (www.perseus.tufts.edu) at Tufts University. NEH has provided $860,000 in funding to the project. The project received additional funding from the Digital Library Initiative.