Skip to main content

InstaBully

Introduction

Cyber bullying has become prevalent in today's social media driven world. Awareness about it however, is not very widespread. Given that there is usually no escape for cyber bullying victims from their bullies, it is even more devastating than traditional bullying. Sometimes it is also hard to distinguish between simple negative interactions and cyber-bullying.
Keeping this in mind we wanted to create a program that would help detect cyber bullying on Instagram accounts given only a username.

Relevance

In India, nearly 40% of people have never heard of cyber-bullying. Furthermore a majority of people think that current cyber-bullying measures are insufficient. 45% of parents say that their children have been cyber-bullied. Out of all the various ways in which people can be bullied online social media is the most common and also the most personal. 

Although the nature of the bullying changes from platform to platform the effect does not change. we picked Instagram as our target social media as we felt that the scope of cyber-bullying is highest on Instagram especially as it a platform to share photos with others, and photos attract more vitriol than just text posts.

We also stress on our experimentation on the media platform, Instagram. Instagram is a widely used platform by teenagers of this age. Gone are the days when Facebook was prevalent in use and emails were a big deal. Due to the widespread use of Instagram, we feel most of the bullying is present on Instagram.
This is not the first time when machine learning is being used for cyber bullying detection, but we can confidently say that this is one of the first attempts which gives a high confidence percentage while predicting cyber bullying. 

Methodology

Our methodology to predict cyber-bullying against a person has been to use a machine learning model trained on a cyber bullying data-set using a twitter-based Glove vector embeddings.  We briefly explain the procedure as follows:

Cyber Bullying Dataset

The data-set we used was an already curated data-set based on cyber bullying content/conversations available in online social media. This was a part of a challenge named EV Hacks. Though the data-set is small it has a comprehensive format with all types of cyber bullying accounted in it.

The Model

The model we used was a Support Vector Machine(SVM) model trained on the above stated cyber-bullying data-set, using glove vector embeddings.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Glove embeddings for twitter is a standard set of word embeddings used for language based off the twitter world. In a nutshell, in this data-set, we can expect 'u' to be beside 'you' and 'yu'.

Basically glove vectors encode words to a particular vector number, where each vector has statistical significance with other vectors or word representations with some connection in meaning.

Use of SVMs

Well, as you might have understood by now that we first took word embeddings of words in a  data set got vector values embeddings plug those values into a machine learning model and finally predict if a particular text is not is a bullying text or not.

The main question here is why SVM, as many machine learning enthusiasts might ask. To that, we can frankly say that this model gave the highest accuracy, so we chose it 🙂.

But to explain this formally, we will go into the definition of support vector machines. An SVM, is used to basically cut the space of data-points in such a way that there is perfect hyperplane separating the points. If the points aren't separating in this particular hyperspace, we project the data-points in another hyperspace using a kernel. So, this was the best bargain we could get from the small data-set that was available. 


We found the training accuracy percentage to be nearly 76%.

Final Setup

Our final program consisted of an interface, which when provided with an Instagram handle, scrapes the 4 most recent posts made by the user/handle and 100 comments on these 4 posts. This scraping, as we can infer is real-time.
We, then, after applying our bully-detection model, rank these posts(using probabilities) and display the 2 most bullied posts along with the 3 most bullied comments on these posts.

Results

As the common saying goes, the proof of the pudding is in its eating, similarly, our model though uses pretty simple and sophisticated algorithms, performs decently well in real life settings.
With the above method running on a particular Instagram(public) handle, here are a few results of a few famous handles:

The structure follows first the bar chart of total percentage of bullied posts in total posts and followed by the comments itself.

Bully Percentage of most recent posts of Narendra Modi

Top bully comments of NaMo.

Bully percentage of Barack Obama.

Top bullied comments of Barack Obama.
Our software run on @ikamalhaasan
Our software run on @urvashirautela, showing sexist comments.
Our software run on @deepakkalal

Future Scope

This work can be extended by using other techniques like :
  • Filtration and identification of obscene language using data-sets such as the Profane Lexicon database.
  • Bullying is usually very specific to a particular location, thus we wish to first group different locations and then identify terms and phrases in the particular language which denote bullying. 

Comments

Popular posts from this blog

BSafe

Problem Statement The course Big Data and Policing  has given us a detailed account about the prominence of Data and how it can influence Policing and general safety.  We as students had the chance to attend talks from policemen to lawyers who discussed their role in collecting and analysing data of any form to conduct policing in a smarter way. Our focus was to try and develop something that can tackle the issue of safety and provide a service that helps in general policing. We decided to come up with an application that could aid the process. Preliminary Idea  We started off with the idea to develop a web and mobile application primarily intended for women safety. We wanted to collect data about narrow streets and roads and understand how unsafe it would be for women mainly as pedestrians. The application allows the users to mark a particular spot on the street which they deem as unsafe. It also allows them to enter a short description about the area and

Human Trafficking dataset creation & analysis

Introduction The goal of this project is to create a Human Trafficking dataset from reliable sources such as news articles, Government agencies, etc and analyse the pain points in this area. Motivation   What is human trafficking? Human trafficking involves recruitment, harbouring or transporting people into a situation of exploitation through the use of violence, deception or coercion and forced to work against their will. In other words, trafficking is a process of enslaving people, coercing them into a situation with no way out, and exploiting them. What is it important?   Did you know that in 2015 alone, Human Trafficking generated $150 billion, more revenue  than Google, Nike, The NFL and Starbucks combined ?!?!   Sounds crazy right? Well there is more to this story than you know, that's why 18th of October is the EU Anti-Trafficking Day.According to a September 2017 report from the International Labor Organization (ILO) and Walk Free Foundation:   An es