DHC Weekly 4/26: Google Ngram…friend or foe?

Last week, we looked at a set of corpora that in part allow one to ask questions about language’s changes over time — today I’m going to talk about a similar tool, the Google N-gram Viewer.

The N-gram Viewer has been around for a while, and not without controversy, which I hope to lay out for anyone interested in using it. The N-gram Viewer allows you to search for a word or a phrase across an extremely large corpus of books across 400 years of publishing history — something like 4% of all books published, as Google tells it. The result is rather like the “chart” function in Davies’ corpora — a graph of a word’s popularity over four centuries. 

Theoretically, there should be a lot to learn here about the popularity of certain terms and concepts over time. Like my example of tobacco’s rise within early modern texts as it rose in dominance as a crop, the frequency with which a term occurs within printed discourse is a reasonably good indicator of its cultural importance. When you open up the N-gram Viewer, it helpfully supplies the search results for three culturally iconic figures: Frankenstein, Sherlock Holmes, and Albert Einstein, as if to invite you to stage little cross-century popularity races.

There are issues, however, with taking the N-gram Viewer at its word. Unlike Davies’ corpora, the N-gram Viewer does not allow you to access the source texts of its results. Therefore, there is no way to see the context in which the searched word or phrase turned up, and no way to assess whether or not the data is useful or applicable. The type of text a word appears in is important, as is having a sense of what kinds of books make up the corpus, and whether their demographics, too, change over time. If, for example, the early English books in the corpus are all philosophical in nature, and the more contemporary books have more scientific texts, as scientific discourses and disciplines have proliferated in the contemporary era, then the results for a word like “courage,” unlikely to be used in science writing, may trend distressingly downwards over time but may not be a reflection of a culture that increasingly has no use for the brave so much as an inconsistent set of data skewing results.

I am also somewhat skeptical of the implication that sheer instances of a word in print is at all indicative of its cultural capital, especially across 4 centuries, and especially when limited to a corpus of books. Some people and things are written about extensively in newspapers, but less so in books. Some might be written about extensively and in large numbers but by a relatively small, super-engaged constituency. 

Google N-gram Viewer is certainly more extensive in its corpus than the corpora I wrote about last week, but with its millions of texts come some tradeoffs, that I suspect makes the tool less than useful for anything other than putting in things you’re mildly curious about.  Like hey, check it out –

Sylvia’s on the rise again!

Leave a Reply

Your email address will not be published. Required fields are marked *