Race and Ethnicity Keyword Thesaurus for Chronicling America: A New Tool on EDSITEment

September 21, 2022

Samantha Gilmore, PhD student in the English department at the University of Nebraska-Lincoln, completed this project for her NEH Pathways Internship in the Division of Preservation and Access during the summer of 2022.

We are pleased to introduce a new media resource on NEH’s EDSITEment platform, Race and Ethnicity Keyword Thesaurus, to aid and encourage researchers of all levels searching the almost 20 million pages of historical newspaper content in Chronicling America. We have created this resource because the language we use to talk about race and ethnicity evolves over time. To access the histories of these words and the people they described in the eighteenth, nineteenth, and twentieth centuries, a researcher must use the language of the editors, publishers, and writers of these times.

Chinese-American in his home in Flatbush, New York, New York
Photo caption

"Chinese-American in his home in Flatbush, New York, New York," LC-DIG-fsa-8d21942 (digital file from original neg.) LC-USW3-007282-E (b&w film nitrate neg.)

Library of Congress Prints and Photographs Division Washington, D.C. 20540 USA

The National Digital Newspaper Program (NDNP) has expanded significantly in recent years, and this new thesaurus assists in the accessing of Chronicling America’s vast holdings. NDNP began in 2012 to allow for the selection of titles in a small number of Western European languages; beginning in 2016, the Library of Congress expanded the language capabilities of Chronicling America to allow submission in more than 500 languages from across the globe. Currently, Chronicling America includes 22 languages other than English. In 2016, NEH and the Library of Congress also extended the date range of NDNP to include newspapers in the public domain published between 1690 and 1963. By expanding the temporal, geographical, and linguistic reach of Chronicling America as well as by including more newspapers from underrepresented communities, Chronicling America may encompass a fuller range of subject matters that are critically important to U.S. history. This, of course, includes more newspapers addressing the diverse racial and ethnic makeup of our nation.

Newspapers use the language of their time, and their pages often contain biased, offensive, and outdated words and images that are now considered to be harmful. To prepare users for the potentially harmful content they will encounter when consulting the thesaurus we have included a “Harmful Language Warning” pop-up that appears on pages containing offensive or derogatory content.

This pop-up reminds users to take a physical and mental pause when reading these words to consider the ways this language has been used to oppress communities of people. The terms themselves detail the complicated and painful relationship that the country has had with understanding race and ethnicity, and they remind us that language holds power in the ways it can support or dismantle systems of oppression. Searching for these keywords helps reveal the prejudice found within the newspapers in Chronicling America and also traces the growth and evolution of our conversations about race throughout the history of our country.

Launching from a main search guide, seven umbrella groups are listed, and then, within each group, there are keyword entries that describe terms that could be used to search Chronicling America.

Every keyword entry includes similar terms, definitions, and information about past usage, including examples of these terms being used in Chronicling America newspapers. For example, searching the phrase “African American” in Chronicling America produces only 671 results, with the earliest result occurring in 1814. This search produces zero results from the eighteenth century, so almost 40 years (1777-1814) of print is missed by using this phrase. If you search the term “Negro,” however, 2,686,588 results from May 29, 1777 to December 31, 1963, are returned. While “African American” is a phrase we are familiar with in our current time, these results, when compared, show how infrequently it was used in previous centuries. By searching Chronicling America only for terms that we would use today, we overlook millions of articles and countless stories remain unfound.

Each entry also encourages users to take advantage of a relatively unique feature of Chronicling America: a user can see the way in which the computer “reads” the newspaper image through Optical Character Recognition (OCR) software. Proprietary newspaper databases do not usually make their underlying OCR visible to users, so we cannot know what we are searching and what we are not searching. What in common parlance is referred to as “dirty OCR” is the equivalent of dirty laundry for proprietary companies: they do not want users to see it because they do not want us to know the mistakes and omissions it contains. Because profit does not motivate the production of Chronicling America, it can make the messiness of newspaper digitization more transparent. That means a user can view the “dirty OCR” for their search. For each thesaurus entry, we have included a section called “OCR Considerations, or ‘How the Computer Sees it.’” Historical newspapers found in Chronicling America are often digitized from microform copies of the original pages, which makes for some messy images full of smudges, blurs, and imperfections. These imperfections can cause some words to be read differently, that is, a smudge on a “c” might be read as an “e,” so a search for “Hispanic” would not include any results in which OCR has read the word as “Hispanie.”

Including these variants in searches helps users to maximize their search results within Chronicling America’s collection.

Ultimately, it is our hope that the Race and Ethnicity Keyword Thesaurus will help users uncover lesser known histories in Chronicling America, as well as reflect on the ways that language has changed throughout U.S. history.