Skip to main content

InstaBully

Introduction

Cyber bullying has become prevalent in today's social media driven world. Awareness about it however, is not very widespread. Given that there is usually no escape for cyber bullying victims from their bullies, it is even more devastating than traditional bullying. Sometimes it is also hard to distinguish between simple negative interactions and cyber-bullying.
Keeping this in mind we wanted to create a program that would help detect cyber bullying on Instagram accounts given only a username.

Relevance

In India, nearly 40% of people have never heard of cyber-bullying. Furthermore a majority of people think that current cyber-bullying measures are insufficient. 45% of parents say that their children have been cyber-bullied. Out of all the various ways in which people can be bullied online social media is the most common and also the most personal. 

Although the nature of the bullying changes from platform to platform the effect does not change. we picked Instagram as our target social media as we felt that the scope of cyber-bullying is highest on Instagram especially as it a platform to share photos with others, and photos attract more vitriol than just text posts.

We also stress on our experimentation on the media platform, Instagram. Instagram is a widely used platform by teenagers of this age. Gone are the days when Facebook was prevalent in use and emails were a big deal. Due to the widespread use of Instagram, we feel most of the bullying is present on Instagram.
This is not the first time when machine learning is being used for cyber bullying detection, but we can confidently say that this is one of the first attempts which gives a high confidence percentage while predicting cyber bullying. 

Methodology

Our methodology to predict cyber-bullying against a person has been to use a machine learning model trained on a cyber bullying data-set using a twitter-based Glove vector embeddings.  We briefly explain the procedure as follows:

Cyber Bullying Dataset

The data-set we used was an already curated data-set based on cyber bullying content/conversations available in online social media. This was a part of a challenge named EV Hacks. Though the data-set is small it has a comprehensive format with all types of cyber bullying accounted in it.

The Model

The model we used was a Support Vector Machine(SVM) model trained on the above stated cyber-bullying data-set, using glove vector embeddings.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Glove embeddings for twitter is a standard set of word embeddings used for language based off the twitter world. In a nutshell, in this data-set, we can expect 'u' to be beside 'you' and 'yu'.

Basically glove vectors encode words to a particular vector number, where each vector has statistical significance with other vectors or word representations with some connection in meaning.

Use of SVMs

Well, as you might have understood by now that we first took word embeddings of words in a  data set got vector values embeddings plug those values into a machine learning model and finally predict if a particular text is not is a bullying text or not.

The main question here is why SVM, as many machine learning enthusiasts might ask. To that, we can frankly say that this model gave the highest accuracy, so we chose it 🙂.

But to explain this formally, we will go into the definition of support vector machines. An SVM, is used to basically cut the space of data-points in such a way that there is perfect hyperplane separating the points. If the points aren't separating in this particular hyperspace, we project the data-points in another hyperspace using a kernel. So, this was the best bargain we could get from the small data-set that was available. 


We found the training accuracy percentage to be nearly 76%.

Final Setup

Our final program consisted of an interface, which when provided with an Instagram handle, scrapes the 4 most recent posts made by the user/handle and 100 comments on these 4 posts. This scraping, as we can infer is real-time.
We, then, after applying our bully-detection model, rank these posts(using probabilities) and display the 2 most bullied posts along with the 3 most bullied comments on these posts.

Results

As the common saying goes, the proof of the pudding is in its eating, similarly, our model though uses pretty simple and sophisticated algorithms, performs decently well in real life settings.
With the above method running on a particular Instagram(public) handle, here are a few results of a few famous handles:

The structure follows first the bar chart of total percentage of bullied posts in total posts and followed by the comments itself.

Bully Percentage of most recent posts of Narendra Modi

Top bully comments of NaMo.

Bully percentage of Barack Obama.

Top bullied comments of Barack Obama.
Our software run on @ikamalhaasan
Our software run on @urvashirautela, showing sexist comments.
Our software run on @deepakkalal

Future Scope

This work can be extended by using other techniques like :
  • Filtration and identification of obscene language using data-sets such as the Profane Lexicon database.
  • Bullying is usually very specific to a particular location, thus we wish to first group different locations and then identify terms and phrases in the particular language which denote bullying. 

Comments

Popular posts from this blog

Traffic Violations in Metropolitan Cities

Introduction With the advent of the smartphone era and the availability of 4G internet across the country, police forces have begun to use electronic receipts of the traditional traffic challans. E-Challans are electronically generated penalty receipt that takes the place of the physical paper receipts and helps in digitizing the whole process of collecting challans and penalizing violations. In this project, we analyze the set of all unpaid E-Challans collected in metropolitan cities over a large span of time to gain unique insights about the nature of traffic violations in such cities. The problem is very relevant for a course on Big Data & Policing as it tries to answer the following important questions: How are traffic violations distributed spatially and temporally across the city boundaries? Can the most common violation types be characterized and be used for providing intervention insights? How can police leverage social media for increasing awareness and for targe...

Real-Time and Predictive Traffic Data Analysis

Introduction Traffic prediction is crucial to many applications including traffic network planning, route guidance, and congestion avoidance. We have tried to minimize the time required for a vehicle to go from point A to point B, and maximize the efficiency of the flow of traffic, to help the traffic police in managing traffic. Several essential factors affect traffic prediction: Geographical factors such as topology, etc. Social factors such as holidays, concert, weekends, etc. Limited Dataset, i.e., either small or not a publicly available dataset. The primary aim of the project is to use historical and live traffic data to control the traffic lights for efficient traffic flow. Why is the problem statement important? The number of vehicles on the road in India have increased 2-fold in every 8 years since the year 2000. Apart from not having adequately constructed roads, there is no proper system for helping traffic police officers in controlling the flow of traffic...

Detecting Vulnerable regions in metropolitan cities

Introduction The problem is to handle the growing violence rate by estimating the probability of the upcoming violence, especially in metropolitan cities. Why is the problem important? This is important since if by doing so, we could somehow able to stop even 10-15% of upcoming threat then it can have a vast effect. Who will benefit : Police can analyze data in real time and may increase patrolling if required. Based on available data, police can effectively maintain law and order in  vulnerable areas. Our strategy For this we chose the social media platform twitter 1) First of all we collected tweets with geo tagged locations for the last 7 days for 4 citites hyderabad, mumbai, kolkata and delhi 2) But only 2% of total tweets have geo tagged locations. So what we have done is that, we made a dictionary of areas of these cities from maps of india and find   the location if it is mentioned in the tweet like My bag is stolen from CP D...