All my past analytics experience and learning has focused on numerical analysis, including spatial analytics. But since I was vacationing in Nigeria, with plenty of time to spare and a huge appetite for data, I decided to start learning some text mining.
So since I am a bit fresh off the boat here, I decided to do a simple text mining exercise. Read on!
I downloaded one week worth of tweets to the twitter handles of three of the biggest network providers in Nigeria: MTN, Airtel, and GLO.
All three networks had varying numbers of tweets and terms(words) that were scored: –
MTN = 986 tweets, 2542 terms included in the sentiment aggregation.
Glo = 2368, 6242 terms included in the sentiment aggregation
Airtel = 1518, 3554 terms included in the sentiment aggregation
Since I had a class imbalance problem, I used averages to aggregate sentiment scores.
Derive the general sentiment to each network provider from tweets directed at the provider’s twitter handle.
KNIME: Super powerful, gorgeous, efficient, and speedy platform for all things data science. I used KNIME to retrieve the data, prepare it for text mining, and identify the top 50 words present in the body of tweets for each network provider.
Excel: Does this old faithful really need an introduction? 😉 I used Excel to perform sentiment scoring, aggregate sentiments, and visualize the data.
It appears that Airtel is the least problematic mobile network provider in Nigeria! 🥂
Now, since I was analyzing tweets to network accounts, it is more likely that the tweets are on the more negative side. Indeed, many of the tweets entailed customer reports of service and network problems.
All in all, Airtel emerged with the least negative sentiment score of the three. This result ranking of Airtel > MTN > Glo, is consistent with the findings of an evaluation of the three networks. Here is that article.
There primary limitations of this analysis is that although sufficient tweets and words were analyzed, the data scope only encompasses one week. Another improvement would be to include a higher amount of words on the sentiment scoring list. For this analysis I had 80, based on the unique words found in the list of top 50 words per network provider. However, if I were to consider the top 100 words per network provider, then I will more words in my sentiment list, which should lead to a more on point sentiment aggregated scores.
Furthermore, there is the human factor. Since I assigned sentiments to the terms, there was a call for judgement here. Some of the terms had obvious sentiments. For example, “bad” has a negative sentiment, while “thanks” is positive. For more ambiguous terms like “dey”, I had to consider how it was used in the context of several tweets. I classified “dey” as negative in light of the tweets I saw, yet some might argue that it is neutral. If something similar is done, it would be beneficial to have multiple coders assign sentiments to the words, and then take the consensus.
While Airtel appears have the best sentiment from customers tweets, this result should not be taken as definitive, given the limitations identified above.
This project was my first time dabbling into any form of text analytics. For my next project, I will employ some ML techniques, or utilize an established dictionary of word sentiments. I might even have to employ some Python as it has great libraries which perform sentiment analysis.
But while the sentiment scoring in this project was manual, it has helped me gain understanding into some key steps of text mining. And I hope you have too!
Feel free to discuss in the comments section.
I’ll see you in the next one 😉