NEH Grantees Win Best Paper Award at JCDL

June 24, 2010

Congratulations to David Bamman, Alison Babeu, and Greg Crane from Tufts University's Perseus Digital Library who just won "Best Paper" at the Joint Conference on Digital Libraries (JCDL), held this year at the University of Queensland in Brisbane, Australia. JCDL is sponsored by both ACM and IEEE.

Their paper, Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection, results from research funded by two NEH grants, Large-Scale Learning and the Automatic Analysis of Historical Texts(NEH/DOE Humanities High Performance Computing, HH-5001-09) and The Dynamic Lexicon: Cyberinfrastructure and the Automatic Analysis of Historical Languages (Preservation and Access Research & Development, PR-50013-08).  According to the team, the paper illustrates a technique that is central to their Digging into Data Challenge project, Towards Dynamic Variorum Editions (HJ-50013-10).

The full paper is available (open access) on the Perseus Digital Library website:

Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection

David Bamman, Alison Babeu and Gregory Crane

Abstract. We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.