InstaBully

Introduction

Cyber bullying has become prevalent in today's social media driven world. Awareness about it however, is not very widespread. Given that there is usually no escape for cyber bullying victims from their bullies, it is even more devastating than traditional bullying. Sometimes it is also hard to distinguish between simple negative interactions and cyber-bullying.
Keeping this in mind we wanted to create a program that would help detect cyber bullying on Instagram accounts given only a username.

Relevance

In India, nearly 40% of people have never heard of cyber-bullying. Furthermore a majority of people think that current cyber-bullying measures are insufficient. 45% of parents say that their children have been cyber-bullied. Out of all the various ways in which people can be bullied online social media is the most common and also the most personal.

Although the nature of the bullying changes from platform to platform the effect does not change. we picked Instagram as our target social media as we felt that the scope of cyber-bullying is highest on Instagram especially as it a platform to share photos with others, and photos attract more vitriol than just text posts.

We also stress on our experimentation on the media platform, Instagram. Instagram is a widely used platform by teenagers of this age. Gone are the days when Facebook was prevalent in use and emails were a big deal. Due to the widespread use of Instagram, we feel most of the bullying is present on Instagram.

This is not the first time when machine learning is being used for cyber bullying detection, but we can confidently say that this is one of the first attempts which gives a high confidence percentage while predicting cyber bullying.

Methodology

Our methodology to predict cyber-bullying against a person has been to use a machine learning model trained on a cyber bullying data-set using a twitter-based Glove vector embeddings. We briefly explain the procedure as follows:

Cyber Bullying Dataset

The data-set we used was an already curated data-set based on cyber bullying content/conversations available in online social media. This was a part of a challenge named EV Hacks. Though the data-set is small it has a comprehensive format with all types of cyber bullying accounted in it.

The Model

The model we used was a Support Vector Machine(SVM) model trained on the above stated cyber-bullying data-set, using glove vector embeddings.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Glove embeddings for twitter is a standard set of word embeddings used for language based off the twitter world. In a nutshell, in this data-set, we can expect 'u' to be beside 'you' and 'yu'.

Basically glove vectors encode words to a particular vector number, where each vector has statistical significance with other vectors or word representations with some connection in meaning.

Use of SVMs

Well, as you might have understood by now that we first took word embeddings of words in a data set got vector values embeddings plug those values into a machine learning model and finally predict if a particular text is not is a bullying text or not.

The main question here is why SVM, as many machine learning enthusiasts might ask. To that, we can frankly say that this model gave the highest accuracy, so we chose it 🙂.

But to explain this formally, we will go into the definition of support vector machines. An SVM, is used to basically cut the space of data-points in such a way that there is perfect hyperplane separating the points. If the points aren't separating in this particular hyperspace, we project the data-points in another hyperspace using a kernel. So, this was the best bargain we could get from the small data-set that was available.

We found the training accuracy percentage to be nearly 76%.

Final Setup

Our final program consisted of an interface, which when provided with an Instagram handle, scrapes the 4 most recent posts made by the user/handle and 100 comments on these 4 posts. This scraping, as we can infer is real-time.

We, then, after applying our bully-detection model, rank these posts(using probabilities) and display the 2 most bullied posts along with the 3 most bullied comments on these posts.

Results

As the common saying goes, the proof of the pudding is in its eating, similarly, our model though uses pretty simple and sophisticated algorithms, performs decently well in real life settings.

With the above method running on a particular Instagram(public) handle, here are a few results of a few famous handles:

The structure follows first the bar chart of total percentage of bullied posts in total posts and followed by the comments itself.

Bully Percentage of most recent posts of Narendra Modi

Top bully comments of NaMo.

Bully percentage of Barack Obama.

Top bullied comments of Barack Obama.

Our software run on @ikamalhaasan

Our software run on @urvashirautela, showing sexist comments.

Our software run on @deepakkalal

Future Scope

This work can be extended by using other techniques like :

Filtration and identification of obscene language using data-sets such as the Profane Lexicon database.
Bullying is usually very specific to a particular location, thus we wish to first group different locations and then identify terms and phrases in the particular language which denote bullying.

Traffic Violations in Metropolitan Cities

Introduction With the advent of the smartphone era and the availability of 4G internet across the country, police forces have begun to use electronic receipts of the traditional traffic challans. E-Challans are electronically generated penalty receipt that takes the place of the physical paper receipts and helps in digitizing the whole process of collecting challans and penalizing violations. In this project, we analyze the set of all unpaid E-Challans collected in metropolitan cities over a large span of time to gain unique insights about the nature of traffic violations in such cities. The problem is very relevant for a course on Big Data & Policing as it tries to answer the following important questions: How are traffic violations distributed spatially and temporally across the city boundaries? Can the most common violation types be characterized and be used for providing intervention insights? How can police leverage social media for increasing awareness and for targe...

IIIT-H | Big Data and Policing - Spring 2019 | Projects

Search This Blog