Introduction
Cyber bullying has become prevalent in today's social media driven world. Awareness about it however, is not very widespread. Given that there is usually no escape for cyber bullying victims from their bullies, it is even more devastating than traditional bullying. Sometimes it is also hard to distinguish between simple negative interactions and cyber-bullying.
Keeping this in mind we wanted to create a program that would help detect cyber bullying on Instagram accounts given only a username.
Keeping this in mind we wanted to create a program that would help detect cyber bullying on Instagram accounts given only a username.
Relevance
In India, nearly 40% of people have never heard of cyber-bullying. Furthermore a majority of people think that current cyber-bullying measures are insufficient. 45% of parents say that their children have been cyber-bullied. Out of all the various ways in which people can be bullied online social media is the most common and also the most personal.
Although the nature of the bullying changes from platform to platform the effect does not change. we picked Instagram as our target social media as we felt that the scope of cyber-bullying is highest on Instagram especially as it a platform to share photos with others, and photos attract more vitriol than just text posts.
We also stress on our experimentation on the media platform, Instagram.
Instagram is a widely used platform by teenagers of this age. Gone are
the days when Facebook was prevalent in use and emails were a big deal. Due to the widespread use of Instagram, we feel most of the bullying is
present on Instagram.
This is not the first time when machine learning is being used for cyber bullying detection, but we can confidently say that this is one of
the first attempts which gives a high confidence percentage while
predicting cyber bullying.
Methodology
Our methodology to predict cyber-bullying against a person has been to use a machine learning model trained on a cyber bullying data-set using a twitter-based Glove vector embeddings. We briefly explain the procedure as follows:
Cyber Bullying Dataset
The data-set we used was an already curated data-set based on cyber bullying content/conversations available in online social media. This was a part of a challenge named EV Hacks. Though the data-set is small it has a comprehensive format with all types of cyber bullying accounted in it.
The Model
The model we used was a Support Vector Machine(SVM) model trained on the above stated cyber-bullying data-set, using glove vector embeddings.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Glove embeddings for twitter is a standard set of word embeddings used for language based off the twitter world. In a nutshell, in this data-set, we can expect 'u' to be beside 'you' and 'yu'.
Basically glove vectors encode words to a particular vector number, where each vector has statistical significance with other vectors or word representations with some connection in meaning.
Basically glove vectors encode words to a particular vector number, where each vector has statistical significance with other vectors or word representations with some connection in meaning.
Use of SVMs
Well, as you might have understood by now that we first took word embeddings of words in a data set got vector values embeddings plug those values into a
machine learning model and finally predict if a particular text is not
is a bullying text or not.
The main question here is why SVM, as many machine learning enthusiasts might ask. To that, we can frankly say that this model gave the highest accuracy, so we chose it 🙂.
But to explain this formally, we will go into the definition of support vector machines. An SVM, is used to basically cut the space of data-points in such a way that there is perfect hyperplane separating the points. If the points aren't separating in this particular hyperspace, we project the data-points in another hyperspace using a kernel. So, this was the best bargain we could get from the small data-set that was available.
We found the training accuracy percentage to be nearly 76%.
Final Setup
Our final program consisted of an interface, which when provided with an Instagram handle, scrapes the 4 most recent posts made by the user/handle and 100 comments on these 4 posts. This scraping, as we can infer is real-time.
We, then, after applying our bully-detection model, rank these posts(using probabilities) and display the 2 most bullied posts along with the 3 most bullied comments on these posts.
Results
As the common saying goes, the proof of the pudding is in its eating, similarly, our model though uses pretty simple and sophisticated algorithms, performs decently well in real life settings.
With the above method running on a particular Instagram(public) handle, here are a few results of a few famous handles:
The structure follows first the bar chart of total percentage of bullied posts in total posts and followed by the comments itself.
Bully Percentage of most recent posts of Narendra Modi |
Top bully comments of NaMo. |
Bully percentage of Barack Obama. |
Top bullied comments of Barack Obama. |
Our software run on @ikamalhaasan |
Our software run on @urvashirautela, showing sexist comments. |
Our software run on @deepakkalal |
Future Scope
This work can be extended by using other techniques like :
- Filtration and identification of obscene language using data-sets such as the Profane Lexicon database.
- Bullying is usually very specific to a particular location, thus we wish to first group different locations and then identify terms and phrases in the particular language which denote bullying.
Comments
Post a Comment