An independent council that studies critical issues for libraries and archives has concluded that the advanced computational techniques that are necessary to extract useful data from huge collections of digital music, newspapers, books or other cultural heritage materials will require humanities scholars to fundamentally alter their definition of research.
A Council on Library and Information Resources (CLIR) study of projects funded under the “Digging Into Data Challenge” found that the issues and opportunities presented by “big data” in the humanities and social sciences require basic changes in academic methods and scholarly practices. The "Digging" challenge is an international grant competition sponsored by the National Endowment for the Humanities (NEH) and seven other research agencies from the United States, Canada, the Netherlands and he United Kingdom.
Based on two years of interviews, site visits and focus groups, the CLIR report, One Culture, Computationally Intensive Research in the Humanities and Social Sciences, describes a field facing simultaneous challenges. They are logistical, such as how does a team of ten or more researchers from two or more countries stay in touch, on schedule, and on budget; and they are academic, e.g. how does a large interdisciplinary team get scholarly credit, and how does such a team publish its work?
The Digging into Data Challenge was launched in 2009 in part to answer a question posed by Tufts University Professor Gregory Crane: What Do You Do with a Million Books? International teams were awarded grants to work with vast data collections of materials ranging from images of American quilts to fifteenth century manuscripts, seventeenth-century maps, conversations recorded in kitchens, news broadcasts, court transcripts, digitized music, and texts in many languages.
The study produced recommendations that the authors described as “urgent, pointed, and even disruptive.” See recommendations http://www.clir.org/pubs/reports/pub151
The report was co- authored by Charles Henry, President of CLIR, and Christa Williford, CLIR program officer, who agreed to answer questions from NEH’s Brett Bobley, Director of the Office of Digital Humanities.
BOBLEY: There is an interesting tension between traditional research focused on a thesis (e.g. “How was Lincoln influenced by his Southern upbringing?”) vs. research whose goal is to develop new “methods” (e.g. “How can I develop a new text-mining algorithm to examine Lincoln’s writings to look for Southern influences?”). You examined a lot of projects that explored both of these types of research. How do you see “research” being redefined? What are some examples from your report?
CLIR: This tension you mention was definitely an undercurrent in a lot of our conversations with investigators, and it is true that on the surface a lot of the projects seemed to emphasize the methodological questions about how you access, analyze, and interpret the data rather than the subject-based questions related to what the data represent or suggest. The two linguistics projects, for example, Mining a Year of Speech and Harvesting Speech Datasets from the Web, focused largely on how to use a computer to search for and retrieve snippets of speech within large digital archives of recorded speech. The researchers recognized that if they could achieve this reliably, the method would be a completely new way of testing linguistic theories, not to mention opening up audio data for investigation across the humanities and social sciences in an unprecedented fashion. In these cases, the benefits to the “traditional” scholar seemed clear, but the technical challenges are still so tricky—demanding advanced expertise in computer science in areas such as machine learning—that developing the method has to take precedence at this point. The researchers working on Railroads and the Making of Modern America, by way of contrast, needed to get much more specific about what “traditional” questions they wanted to ask. For example: How many people did railroad industries employ in the nineteenth century, and where did these people live? In order to approach this question, the team had to select the most appropriate kinds of data to consider (such as digitized newspapers, census records, and railroad company records), to bring it together in a reliable, useful way so that factors such as geographic locations and occupations could be matched across the different data types, then to look at it all to decide what it really means. In this case, the technical approach is very much driven by and customized to a specific “traditional” question.
In the end, we concluded that the “traditional” versus the “technical” dichotomy in e-research is a false one. We make the distinction now because the two demand very different kinds of training and expertise. But classifying the Digging into Data projects as one or the other just didn’t work. It may be inconvenient and confusing, but both kinds of questions are critical, and they are interdependent in fundamental ways that demand that the craft of research in the digital age becomes a collaborative enterprise.
BOBLEY: I’d like to turn to another topic that you address in the report: collaboration across disciplines. In the sciences, it is common to publish a paper with many authors, often from different disciplines. But this hasn’t been the norm in the humanities, where the archetypal researcher is the solo scholar. What did you see in these eight Digging into Data projects you studied in terms of team size and composition? And what are some of the implications for scholarly publishing (and, ultimately, promotion and tenure)?
CLIR: The smallest and simplest team we looked at had three participants from similar backgrounds while the largest had well over 50 people lending varying amounts of very different kinds of expertise. Undergraduate and graduate students made important contributions, as did both junior and senior scholars. These larger teams face a big problem with tracking what these individual contributions are and providing proper citation and acknowledgement in project outcomes. One of the project teams, Digging into Image Data to Answer Authorship Related Questions, addressed this by establishing a memorandum of understanding that gave explicit directions about citation and joint authorship, as well as clarifying things like intellectual property rights and how hardware, software, and data would be shared among participants. For larger multidisciplinary initiatives, particularly ones that cross legal jurisdictions, these kinds of agreements will become more and more necessary in the future. For example, in the Digging into Image Data as well as several of the other Digging projects, the teams had to reconcile differing intellectual property rules in the U.S. and U.K.
As for implications for scholarly publishing, obviously jointly authored papers will become more the norm, but even for single authors publishing about their e-research it will become increasingly important to acknowledge the creators of the data and tools they use. There is a lot that publishers can do to promote citation of data and tools that will enable the pioneer data managers and tool developers to be properly credited for their work’s “downstream” impacts. Promotion and tenure committees will need to go beyond merely counting articles, papers, monographs, and citations, and also consider the contributions of scholars who devote their time to producing digital data and metadata, to developing tools and methods, or to coordinating complex, multidisciplinary initiatives involving students and colleagues from around the world.
BOBLEY: When I first proposed the Digging into Data Challenge, I suggested that the basic objects that humanists typically study – books, music, newspapers, art – were precisely the things we’ve been madly digitizing for the past twenty years. To borrow from historian Roy Rosenzweig, I suggested we were entering an “age of abundance” – one in which the availability of massive amounts of information will be both a boon and a major challenge for scholars. Do you see the research practices used by the Digging into Data teams foreshadowing how a lot of humanities and social science research will be conducted in the future? Or is “computationally intensive” research more of a niche?
CLIR: The “age of abundance” is definitely already here, as far as digital information goes. As you note, this abundance can be both a blessing and a curse; we are only now beginning to understand the potential and the challenges it poses. Our expectations have shifted so rapidly. We in the academy are no longer awestruck about the wealth and diversity of our online digital cultural heritage—a wealth beyond our wildest dreams only a generation ago; we are more often frustrated by the even greater wealth and diversity of digital information that is inaccessible, imperfectly represented, or poorly documented. It is vitally important that we remain just as conscious of the limitations of digital data in the production of knowledge as we are of its benefits. Dealing with the limitations of “big data” will require an investment of time, attention, and talent that is proportionately great. For these reasons, if computationally intensive research remains a niche, we will be shortchanging our descendants of the rich intellectual heritage we could leave them if we deal with our digital data in a rigorous way. The research practices used by the Digging into Data teams are still very much in development. There are a lot of unanswered questions about the reliability of data, the integration of different kinds of data, and the accuracy of investigative tools that deserve the attention of as many of the smartest people as we can possibly engage.
BOBLEY: Are universities, scholarly societies, faculty, and libraries ready for this next step?
CLIR: It is hard to say. The problems are so big; “readiness” may be an unrealistic expectation. However, there are definite signs of a growing consciousness of the challenges. It has been gratifying to see the sentiments expressed in our report echoed in other venues and publications—to give two examples, the Board on Research Data and Information is doing some great work these days, and the forthcoming issue of New Media and Society, based around the theme of Scholarly Communication, has a lot to say about the human factors affecting our experience of scholarship today that is pertinent; it’s also great to see data-intensive research and collaborative approaches to promoting it become popular topics at conferences across professional and scholarly disciplines. While the conversations about these complex, interrelated issues are still in their early stages, at least they are now under way.
BOBLEY: I wanted to get your take on the role of libraries and archives in computationally intensive research. Libraries have, of course, always played a critical role in the research process. But I think it is fair to say that in the past, their role was often to connect the researcher to individual items in their collections – that is, help them find books or journals or maps or other single items relevant to their work. But many of the Digging projects are exploring content “at scale” rather than on an item level-- searching thousands or millions of records rather than a handful. How does this change the role of the library and the research librarian? How does this change how the next generation of librarians is trained?
CLIR: Libraries and librarians are still going to have the role of connecting people to relevant content—whether this content is broken into discrete items or is part of a web of data. But when scholars are doing research “at scale,” there is a lot more that goes into making judgments about relevance. You’re not just interested in whether your chosen resource pertains to a particular subject, you need to know how it’s structured and organized, you need to know what kinds of computer algorithms will help you pull out the information you want, and you’ll want to have realistic expectations about how reliable those results will be. Knowledge needed to make these decisions can be highly specialized. The general training that librarians are typically given is still pertinent—information-seeking behavior, for example, or theories about the organization of information that are fundamental to library cataloging and metadata. But to be effective in the research context the librarian is going to be a lot more specific, and more directly involved with the researcher at the outset of a project, helping them put together a set of questions, data, and tools that will move a discipline forward, rather than waiting until a scholar shows up at the library with a problem. In addition, when it comes to data and research tools, libraries are going to have to accept and share responsibility for managing them, while also dealing with all the wealth of non-digital resources that researchers would like to digitize and analyze at scale. These are big responsibilities and require people to make smart decisions about ways to use scarce resources.
You can imagine that all of this has big implications for library training programs. First, this training will need a lot more flexibility built in to allow students to develop specialties in dealing with particular types of data or analytical tools. Secondly, it will need to prepare those students not only to work collaboratively (which has long been the case in libraries), but also to work collaboratively with others from very different disciplinary backgrounds. Last, but certainly not least, it will need to develop students who are highly social, curious, and engaged learners who are capable of performing well in long-term research partnerships, who are able to anticipate problems and devise creative solutions, and who are effective teachers and team leaders. This may seem a high bar, but we’ll need to reach it. What is more, our graduate programs in other disciplines need to incorporate much of this as well.
This report, issued on June 12, 2012, is available at http://www.clir.org/pubs/reports/pub151