Identifying Inhuman Humans

Prologue

In today's world, the thought of going through a day without checking notifications from social media platforms is nearly impossible. Social Media has changed the way we connect and interact with the world. Platforms like Twitter, operate with a simple principle, i.e. open for all, no bias towards anyone, almost like ideal equality.

The fact that anyone can write about anything to anyone whilst sitting in the comfort of one's home, struggling in office or while in commute, has both a good and a bad side to it.

Terror groups lke ISIS have been active on Twitter since 2014, the time they captured parts of Syria. They use Twitter to spread hatred, radicalize people and recruit new so called 'soldiers'. An astonishing news came to light in December 2014, when an ordinary Software Engineer in Bangalore Shammi Witness was involved with ISIS, helping them by radicalizing and recruiting people. Astonishing thing about this is that he was doing so openly, without hiding his identity, not using any coded tweets.

The outreach of such platfroms is worldwide, and thus one can influence masses using such platforms. Cases like Shammi Witness manier times are overlooked or don't come to light.

With this project we tried to make use of our technical knowledge and apply it to tackle this problem.

Dataset

This project was started not only with the aim of analytics but also dataset creation. It may be considered as not so fancy work, but it is the heart of all analytics and machine learning. Thus, initial weeks were spent in data collection.

We collected real time data in a span of 1 month to 1.5 month. The data is mainly from the hashtags on Twitter like '#ISIS', '#Jihaad', '#ISIL', etc.

Data collection is done using Twitter API, cosidering the rate limiting, we were able to collect 45K tweets with 20 dimensions, i.e. effectively we ended up with a 45,000 x 20 matrix. It is of considerable size to do some analysis, and thus from this point onwards we moved our focus from data collection, preprocessing, cleaning and formating towards analytics.

Below, you can get a glimpse of data. Notice, how clean and well formatted it is. Justice is done to this step!

Some features:

Date Time
Location
Geo Tag
Tweet ID
Language
Hashtags
User Mentions
Retweet Count
Tweet Favourite Count
Device
User ID
User Name
Screen Name
Active Since
Tweet Count
Verification Status
Followers Count
Following Count
URL
Full Text

Language Plot

Talking about Syria, Islamic State the first thing comes to mind is 'Wouldn't there be a language problem while analysing the tweets?'. We had the same doubt, so to burst this bubble the first plot we did (litterally) is the language of tweets. Fortunately, a shocking amount of tweets were in English. Phew!

This did take up the burdern away from our shoulders, or did it? Now what lies ahead is a bunch of plots signifying some or the other thing. You still with us?

Don't worry, we have tried to make the further read interesting.

Day Plot

Ohkay!, so let's begin.

Let's take a simple metric and see if anything interesting comes out. This was the mindset we had when we plotted the thing which you can see on the right.

Interestingly so, we did find out that ISIS supporters and anti-ISIS aren't bias towards the day of the week. Saturday witnessed as much tweets as Wednesday.

Enough fun and games, let's get serious, shall we?

Device Used Plot

Now we do some real analysis. The plot on the left is between the devices used and tweets done using those devices. It can be inferred from the plot that Twitter Web Client is predominantly used for tweeting. Other devices lke iPhones, Android even Blackberry were used in some cases.

A naive assumption can be made, that ISIS has some people who have some amount of technical knowledge. Overtime, we have seen a rise in the numbers for 'Android'.

Location Plot

Most of countries through which Tweets were done are CtrlSec which is Hacker Group also goes by the name anonymous. They tweets in against of ISIS and help out in suspending twitter accounts by notifying to Twitter about ISIS Twitter accounts which are used for radicalization and recruitment.

One more interesting things pops out here is that, count of tweets from Syria are far less than many other countries as Israel, USA(Washington DC), hence, we can conclude on it that ISIS is not a local problem of Syria, it is global problem.

Lets see few other analysis....

User Activity Plots

This plot tell us that, how many users are tweeting how much about ISIS, we had restricted this graph, to some count of tweets, just for the sake of simplicity, as plotting it beyond that won't make any better inference.

From this plot we can get the inference that only few user are producing most of the content, in our case it is tweets, rest of the user are not very active in posting tweets, they do it very raerly, as we can see here that 18000 users had tweeted only one tweet, which clearly justifies the 'Power Law', which states that only 20% of user generate massive data, rest 80% just views that data, instead of generating any new data.

We will go further in it, with next plot....

In this plot we will get to know the most active users, there are for most active users, in rest of the users, few are active on medium level, and others are very less active in tweeting any tweet, they may be active on reading, liking, sharing, etc., but they don't post tweet that much actively.

Sentiment Score

Here we had analysed, how much a user is active on twitter, irrespective of ISIS subject, and what is there sentiment in there tweets for the ISIS subject.

Users who were tweeting against the ISIS

Users who were tweeting pro ISIS

Age Plot

Now here, we are presenting analysis on the basis of age group, like which age group is talking more about 'ISIS' either in favour or in against, but if they are interested in talking about 'ISIS' then we are counting them.

And according to our analysis we get an inference that, peoples of age group 25-34 are more interested in talking about 'ISIS', and on second place peoples of age group 22-24 are interested in it. So from this we can get an inference that youth is more keen to know about global problems.

Gender Plot

While analysing all the data, it is important to analyse if the topic is of same interest in both males and females, and here we are analysing the same for the 'ISIS' subject.

And here we analysed and what we get is, 'ISIS' subject is not that important among females than the males, as we see it among all the users more than 60% of users are male and less than 20% are females.

Hashtag Word Cloud

Tweet Word Cloud

User Mention Word Cloud

ML for PRO ISIS Tweets

We want to predict whether a given tweets in favor of ISIS or not with the collected tweets from twiiter with all total of nearly 35 thousands tweets, containing retweets and tweets .Out of dataset we used nearly 10 thousand tweets annotated them whether tweet is favor or not of ISIS. Then prepared a Prediction Model consist of following layers - LSTM layer 64 nodes then a Dense layer with 256 nodes with ReLU activation function and then a Dropout layer with 0.5 probability and end a single node output layer with sigmoid activation function with predicts with loss function of Binary Cross Entropy.

Accuracy: With this given model we were able obtain a result of 91.2%.

Link For ML Model

Traffic Violations in Metropolitan Cities

Introduction With the advent of the smartphone era and the availability of 4G internet across the country, police forces have begun to use electronic receipts of the traditional traffic challans. E-Challans are electronically generated penalty receipt that takes the place of the physical paper receipts and helps in digitizing the whole process of collecting challans and penalizing violations. In this project, we analyze the set of all unpaid E-Challans collected in metropolitan cities over a large span of time to gain unique insights about the nature of traffic violations in such cities. The problem is very relevant for a course on Big Data & Policing as it tries to answer the following important questions: How are traffic violations distributed spatially and temporally across the city boundaries? Can the most common violation types be characterized and be used for providing intervention insights? How can police leverage social media for increasing awareness and for targe...

IIIT-H | Big Data and Policing - Spring 2019 | Projects

Search This Blog