Language Detection of Social Media Data
When you go to a website or open an email in another language, you'll often see a message like this:
In some cases you'll think, "Wow how did they know?!" And in others, it's more like, "Why does Google think my message is in Malaysian?"
Many tools have been developed to detect the language of a piece of text. However, these tend to fail on short texts as well as social media text. With the rise of platforms such as Twitter and Facebook, it's important to be able to accurately detect the language of generally unstructured and noisy data.
Since our beginning we used the Python guess_language package to filter out English posts, and it worked great on longer texts such as blog posts but tended to fail on short posts, especially Twitter posts. As a result, we decided to develop our own language detection package trained on social media data.
Using the Amazon Mechanical Turk system we built, we gathered over 15,000 labeled posts. We also used the Twitter API to gather over a million tweets. We used a Naive Bayes model to build a classifier. Since we did not have enough labeled foreign data (especially before we built our Mechanical Turk system), we chose to use an Expectation-Maximization algorithm to cluster the data into language classes. We trained the classifier by creating a set of ngrams from each post.
Algorithm Details
The Naive Bayes Model
Our language model is constructed as follows:
That is, to determine the language of a post, we pick the language that maximizes the probability of being that language given the n-grams in the post. In this equation, z is a normalization constant equal to the probability of the evidence or in this case the ngrams.
We can break 1/z = P(ngrams) up like this (Law of Total Probability):
Due to independence assumptions of the Naive Bayes model, we can rewrite the original equation as follows:
In the equation above, the unnormalized value gets really small, so to account for underflow issues, we perform our calculations as a sum of logs:
Constructing the N-grams
In this case, by n-gram we mean an n-letter slice of a word. We use non-overlapping n-grams from each word (including an additional space the end of each word).
For example, trigram construction for the string "hello she loves cats" becomes ["hel", "llo", "she", "lov", "es ", "cat"].
Expectation-Maximization
We use the following algorithm for calculating the final probabilities:
1) Assign random probabilities to each of the posts in the training data
2) E-step: using the inital probabilities from 1) , we recalculate the priorP(language) and likelihood P(post | language), applying the normalization.
3) M-step: using probabilities calculated in step 2), we recalcuate the posterior probailities, P(post | language).
4) Repeat E & M steps until convergence.
Results
We experimented with lots of things, such as the size of the n-grams, using a purely supervised method of training data (with the small amount of labeled data we had), and combining both approaches. In the end we tested our classifier against guess_language and Google's Chromium language detector.
In a set of 4000 tweets (half English, half non-English), the performance of the three methods were as follows:
Demo
We made a demo of the classification of Tweets, comparing the three methods. The tweets are retrieved from the Buzzient Post API. Check it out here!






