PDAREPORT

Analysing United States Flight Delays Using Map

Reduce, Hive And Pig

Vinay Kumar Anamalagundam

x17148146

Data Analytics, School of Computing,

National College of Ireland,

[email protected]

Abstract— Flight delay analysis are extremely helpful in

finding flaws and predicting the possible cases in an

advance. In fact, it will help the airlines to take effective

decisions by processing the available data. It is not easy

to do analysis of huge airport records in analytical tools

such as Python, R etc. To process the historical massive

data, Hadoop distributed environment can handle in a

most appropriate way. This project has been

implemented in Hadoop to determine patterns present in

the big data using different technologies such as Map

reduce, Hive and Pig. Through Mysql data has been

loaded to HDFS and Sqoop were used to transfer the

files between different databases through MySQL.

Hadoop distributed file system was used to carry among

the databases for fast processing to do MapReduce.

Keywords— Hadoop, MapReduce, HBase, Pig, Hive, R,

Mysql, Sqoop.

I. INTRODUCTION

Airlines has to deal with many things to provide services to

around the world. The major common problems that every

airline suffer with delays, these sometimes cause high

additional cost to the airline operators. The four reasons of

airline delays are: Weather, NAS, Carrier and Security. The

top three delays are very often occurred which making the

Airlines has to predict every possible case in an advance.

Sometimes allocation of resource may cause conflicts which

affects the airlines in many ways such as reputation, revenue

lost. As per the survey, the recovery cost predicted were

exceed 1.25 billion pennies. Which costs the airlines around

81 Euros per minute in delay time. The number of flight

commutes are gradually decreasing every year. As per the

survey conducted by Euro Control, In Europe, the number

of travels happened in 2008 were 10 million. Further when

it comes to 2012, it was drastically decreased to 9.5 Million.

Euro Control expects an increment by 1.2 million flights in

2019. The robustness of schedules becomes a primary

challenge for airlines. This dataset gives worthwhile insights

to analyze on delays.

In this report, section 2 contribute a short overview about

related work. Section 3 provides Methodology and Results

are clearly explained in section 4. Finally, this paper ends

with conclusion.

II. RELATED WORK

With Decreasing number of flights cause of delays, airlines

are facing robustness of schedules and managing the

resource to reschedule at unfortunate times such as extreme

weather conditions and security issues brings scope to an

area to researchers. Many researchers proposed relevant

approaches for delay estimation in a way of robust

scheduling by using different techniques of advance data

mining algorithms. And also, some of the researchers went

through analysis of airline delay networks which provided

key insights to manage the operators, workers to get rid of

human delays. In the analysis of regular daily operational

analysis, to achieve dynamic and stochastic for the key

challenge of decision support system, proposed the day time

trends with an analysis of covariance. The in-depth research

work has been modelled to find the valuable information by

processing the historical data such as cancellation reasons,

average of arrival delays and carrier delays.

III. METHODOLOGY

In this section, we will discuss about the methodology of the

project. At the beginning, the project was initiated with an

idea of Airlines and found the dataset from the website

which has observations about 1.5 Million and 29 columns.

And also, I have scraped supplemental data from website

using R studio for further analysis. The next big step of

project is preprocessing which almost took most of the time

to clean null values, NA values, remove unwanted columns

and transform the data. Post preprocessing, I have moved

this dataset to mysql database and copied into HDFS. I have

taken HDFS as a data source for executing queries for

Mapreduce. Later, the output of the files was stored in

Hbase and then visualized the insights using R.

A. Dataset

The data has collected from Statistical Computing &

Statistical Graphics and World Airport Codes websites. The

first dataset which is collected from Statistical computing

for 2008 is a huge dataset with around 1.5 Million rows and

29 columns. The second dataset has scraped which contains

6 columns as a supplemental data to the first dataset. The

description of all the attributes present in these two datasets

are as follows: