Skip to main content

Identifying Inhuman Humans

Prologue

In today's world, the thought of going through a day without checking notifications from social media platforms is nearly impossible. Social Media has changed the way we connect and interact with the world. Platforms like Twitter, operate with a simple principle, i.e. open for all, no bias towards anyone, almost like ideal equality. 

The fact that anyone can write about anything to anyone whilst sitting in the comfort of one's home, struggling in office or while in commute, has both a good and a bad side to it.

Terror groups lke ISIS have been active on Twitter since 2014, the time they captured parts of Syria. They use Twitter to spread hatred, radicalize people and recruit new so called 'soldiers'. An astonishing news came to light in December 2014, when an ordinary Software Engineer in Bangalore Shammi Witness was involved with ISIS, helping them by radicalizing and recruiting people. Astonishing thing about this is that he was doing so openly, without hiding his identity, not using any coded tweets. 

The outreach of such platfroms is worldwide, and thus one can influence masses using such platforms. Cases like Shammi Witness manier times are overlooked or don't come to light. 

With this project we tried to make use of our technical knowledge and apply it to tackle this problem.

Dataset

This project was started not only with the aim of analytics but also dataset creation. It may be considered as not so fancy work, but it is the heart of all analytics and machine learning. Thus, initial weeks were spent in data collection.

We collected real time data in a span of 1 month to 1.5 month. The data is mainly from the hashtags on Twitter like '#ISIS', '#Jihaad', '#ISIL', etc.

Data collection is done using Twitter API, cosidering the rate limiting, we were able to collect 45K tweets with 20 dimensions, i.e. effectively we ended up with a 45,000 x 20 matrix. It is of considerable size to do some analysis, and thus from this point onwards we moved our focus from data collection, preprocessing, cleaning and formating towards analytics.

Below, you can get a glimpse of data. Notice, how clean and well formatted it is. Justice is done to this step!

Some features:
  • Date Time
  • Location
  • Geo Tag
  • Tweet ID
  • Language
  • Hashtags
  • User Mentions
  • Retweet Count
  • Tweet Favourite Count
  • Device
  • User ID
  • User Name
  • Screen Name
  • Active Since
  • Tweet Count
  • Verification Status
  • Followers Count
  • Following Count
  • URL 
  • Full Text

Language Plot


Talking about Syria, Islamic State the first thing comes to mind is 'Wouldn't there be a language problem while analysing the tweets?'. We had the same doubt, so to burst this bubble the first plot we did (litterally) is the language of tweets. Fortunately, a shocking amount of tweets were in English. Phew!



This did take up the burdern away from our shoulders, or did it? Now what lies ahead is a bunch of plots signifying some or the other thing. You still with us?

Don't worry, we have tried to make the further read interesting.

Day Plot

Ohkay!, so let's begin.

Let's take a simple metric and see if anything interesting comes out. This was the mindset we had when we plotted the thing which you can see on the right.



Interestingly so, we did find out that ISIS supporters and anti-ISIS aren't bias towards the day of the week. Saturday witnessed as much tweets as Wednesday.

Enough fun and games, let's get serious, shall we?

Device Used Plot

Now we do some real analysis. The plot on the left is between the devices used and tweets done using those devices. It can be inferred from the plot that Twitter Web Client is predominantly used for tweeting. Other devices lke iPhones, Android even Blackberry were used in some cases.



A naive assumption can be made, that ISIS has some people who have some amount of technical knowledge. Overtime, we have seen a rise in the numbers for 'Android'.

Location Plot

Most of countries through which Tweets were done are CtrlSec which is Hacker Group also goes by the name anonymous. They tweets in against of ISIS and help out in suspending twitter accounts by notifying to Twitter about ISIS Twitter accounts which are used for radicalization and recruitment.



One more interesting things pops out here is that, count of tweets from Syria are far less than many other countries as Israel, USA(Washington DC), hence, we can conclude on it that ISIS is not a local problem of Syria, it is global problem.

Lets see few other analysis....

User Activity Plots

This plot tell us that, how many users are tweeting how much about ISIS, we had restricted this graph, to some count of tweets, just for the sake of simplicity, as plotting it beyond that won't make any better inference. 



From this plot we can get the inference that only few user are producing most of the content, in our case it is tweets, rest of the user are not very active in posting tweets, they do it very raerly, as we can see here that 18000 users had tweeted only one tweet, which clearly justifies the 'Power Law', which states that only 20% of user generate massive data, rest 80% just views that data, instead of generating any new data.

We will go further in it, with next plot....


In this plot we will get to know the most active users, there are for most active users, in rest of the users, few are active on medium level, and others are very less active in tweeting any tweet, they may be active on reading, liking, sharing, etc., but they don't post tweet that much actively.

Sentiment Score

Here we had analysed, how much a user is active on twitter, irrespective of ISIS subject, and what is there sentiment in there tweets for the ISIS subject.

Users who were tweeting against the ISIS
Users who were tweeting pro ISIS

Age Plot

Now here, we are presenting analysis on the basis of age group, like which age group is talking more about 'ISIS' either in favour or in against, but if they are interested in talking about 'ISIS' then we are counting them.


And according to our analysis we get an inference that, peoples of age group 25-34 are more interested in talking about 'ISIS', and on second place peoples of age group 22-24 are interested in it. So from this we can get an inference that youth is more keen to know about global problems.

Gender Plot

While analysing all the data, it is important to analyse if the topic is of same interest in both males and females, and here we are analysing the same for the 'ISIS' subject.



And here we analysed and what we get is, 'ISIS' subject is not that important among females than the males, as we see it among all the users more than 60% of users are male and less than 20% are females.


Hashtag Word Cloud




Tweet Word Cloud





User Mention Word Cloud



ML for PRO ISIS Tweets

We want to predict whether a given tweets in favor of ISIS or not with the collected tweets from twiiter with all total of nearly 35 thousands tweets, containing retweets and tweets .Out of dataset we used nearly 10 thousand tweets annotated them whether tweet is favor or not of ISIS. Then prepared a Prediction Model consist of following layers - LSTM layer 64 nodes then a Dense layer with 256 nodes with ReLU activation function and then a Dropout layer with 0.5 probability and end a single node output layer with sigmoid activation function with predicts with loss function of Binary Cross Entropy.

Accuracy: With this given model we were able obtain a result of 91.2%.

Link For ML Model


Comments

Popular posts from this blog

Traffic Violations in Metropolitan Cities

Introduction With the advent of the smartphone era and the availability of 4G internet across the country, police forces have begun to use electronic receipts of the traditional traffic challans. E-Challans are electronically generated penalty receipt that takes the place of the physical paper receipts and helps in digitizing the whole process of collecting challans and penalizing violations. In this project, we analyze the set of all unpaid E-Challans collected in metropolitan cities over a large span of time to gain unique insights about the nature of traffic violations in such cities. The problem is very relevant for a course on Big Data & Policing as it tries to answer the following important questions: How are traffic violations distributed spatially and temporally across the city boundaries? Can the most common violation types be characterized and be used for providing intervention insights? How can police leverage social media for increasing awareness and for targe...

Real-Time and Predictive Traffic Data Analysis

Introduction Traffic prediction is crucial to many applications including traffic network planning, route guidance, and congestion avoidance. We have tried to minimize the time required for a vehicle to go from point A to point B, and maximize the efficiency of the flow of traffic, to help the traffic police in managing traffic. Several essential factors affect traffic prediction: Geographical factors such as topology, etc. Social factors such as holidays, concert, weekends, etc. Limited Dataset, i.e., either small or not a publicly available dataset. The primary aim of the project is to use historical and live traffic data to control the traffic lights for efficient traffic flow. Why is the problem statement important? The number of vehicles on the road in India have increased 2-fold in every 8 years since the year 2000. Apart from not having adequately constructed roads, there is no proper system for helping traffic police officers in controlling the flow of traffic...

Detecting Vulnerable regions in metropolitan cities

Introduction The problem is to handle the growing violence rate by estimating the probability of the upcoming violence, especially in metropolitan cities. Why is the problem important? This is important since if by doing so, we could somehow able to stop even 10-15% of upcoming threat then it can have a vast effect. Who will benefit : Police can analyze data in real time and may increase patrolling if required. Based on available data, police can effectively maintain law and order in  vulnerable areas. Our strategy For this we chose the social media platform twitter 1) First of all we collected tweets with geo tagged locations for the last 7 days for 4 citites hyderabad, mumbai, kolkata and delhi 2) But only 2% of total tweets have geo tagged locations. So what we have done is that, we made a dictionary of areas of these cities from maps of india and find   the location if it is mentioned in the tweet like My bag is stolen from CP D...