Researchers find evolving dialects using Twitter

Credit: Adelaide Cole/ Art Staff Credit: Adelaide Cole/ Art Staff

When one thinks of dialects of a language, one usually thinks of spoken variations that evolve from factors such as geographic location or social class. However, Twitter has provided two Carnegie Mellon researchers and their advising professors with the opportunity to study distinct dialects evolving from the 140-character updates of frequent Twitter users.

Since Twitter is a conversational social medium, users are more likely to spell words as they are pronounced, thus giving Jacob Eisenstein and Brendan O’Connor, a post-doctoral fellow and doctorate student in the Machine Learning Department, respectively, the chance to study this phenomenon. Two other co-authors on their paper are Eric P. Xing, an associate professor of machine learning, and Noah A. Smith, assistant professor at the Language Technologies Institute.

Eisenstein and O’Connor provided background on the project and its objectives. “[This could tell us] what language variations mean, and if this is a random phenomena that rises and peters out — or is there some conscious or subconscious desire that people have to distinguish their communities by the way to speak, and is that inherent to language?” Eisenstein said.

Over the course of a week last March, Eisenstein and O’Connor gathered tweets from 9,500 users who updated at least 40 times a day and had no more than 1,000 users and followers. With a base of 380,000 total tweets (or approximately 15 parents of public tweets) the two then separated out messages with URLs, as they could be a source of spam, and tweets not in the contiguous United States.

After gathering data for a week and sorting the data according to geographical and subject markers, the team was able to create a rough breakdown of words unique to urban centers of the United States, such as New York City, Northern California, Atlanta, and New England. For instance, New York users often tweeted “suttin” instead of “something” whereas tweets containing the words “hella” and “koo” often originated in Northern California.

In order to sift through the tweets, the pair used mathematical models and statistical distributions to unveil the regional dialects. One model they considered was change over time. For instance, does the dialect start in urban centers and expand outward, or does it move from city to city? Change over time comes from the upper-working class and moves up, Eisenstein said, and could provide an entry point for changes in language. While children learn how to write from school, they learn how to speak from their parents.

“When you look at Twitter as a whole, it’s a lot more diverse with respect to age,” Eisenstein said. “It’s not a bunch of teenagers, but when you [first] post from cell phones and [second] post a billion messages, that may be what gives us a bias to have sort of younger people in our data.”
O’Connor added, “Because Twitter is a very informal form of writing, it’s better at getting these changes in casual language than other forms of writing, because you can spell things how you speak, and no one tells [you] how to use Twitter. You can use things like vowel changes or small stuff that don’t make it into formal writing that can quite conceivably appear in Twitter.”

Besides accounting for high-usage words, the team also had to consider which parts of their data were relevant to a sustained change over time.
In reference to an acronym found on the East Coast, Eisenstein explained, “Here is a word that only occurs in New York or only in Pennsylvania, and you have to ask yourself, is this a stable variation...? When we look at this a year from now, is it still going to be the case that we only see it in this part of the country, or will is spread to the whole country, or will it completely disappear off the map?”

Eisenstein also accounted for random variables that do not relate to a region’s dialect. In one instance, the Oklahoma Thunder played two games against the Sacramento Kings, which accounted for why “thunder” was a dialect marker for the Northern California region.

“If you did a more longitudinal sample over the course of the year, presumably you wouldn’t see that,” he said. Another marker that was exclusive to the Lake Erie region was “Bieber.” Whether this is an anomaly or not is yet to be determined.

Change on Twitter happens a lot faster than in other media, and could allow researchers to study changes in real time. While studies about language changes on Twitter are still early, such findings could provide further insight into linguistics and the evolution of culture.