Analysing United States Flight Delays Using Map
Reduce, Hive And Pig
Vinay Kumar Anamalagundam
x17148146
Data Analytics, School of Computing,
National College of Ireland,
Abstract— Flight delay analysis are extremely helpful in
finding flaws and predicting the possible cases in an
advance. In fact, it will help the airlines to take effective
decisions by processing the available data. It is not easy
to do analysis of huge airport records in analytical tools
such as Python, R etc. To process the historical massive
data, Hadoop distributed environment can handle in a
most appropriate way. This project has been
implemented in Hadoop to determine patterns present in
the big data using different technologies such as Map
reduce, Hive and Pig. Through Mysql data has been
loaded to HDFS and Sqoop were used to transfer the
files between different databases through MySQL.
Hadoop distributed file system was used to carry among
the databases for fast processing to do MapReduce.
Keywords— Hadoop, MapReduce, HBase, Pig, Hive, R,
Mysql, Sqoop.
I. INTRODUCTION
Airlines has to deal with many things to provide services to
around the world. The major common problems that every
airline suffer with delays, these sometimes cause high
additional cost to the airline operators. The four reasons of
airline delays are: Weather, NAS, Carrier and Security. The
top three delays are very often occurred which making the
Airlines has to predict every possible case in an advance.
Sometimes allocation of resource may cause conflicts which
affects the airlines in many ways such as reputation, revenue
lost. As per the survey, the recovery cost predicted were
exceed 1.25 billion pennies. Which costs the airlines around
81 Euros per minute in delay time. The number of flight
commutes are gradually decreasing every year. As per the
survey conducted by Euro Control, In Europe, the number
of travels happened in 2008 were 10 million. Further when
it comes to 2012, it was drastically decreased to 9.5 Million.
Euro Control expects an increment by 1.2 million flights in
2019. The robustness of schedules becomes a primary
challenge for airlines. This dataset gives worthwhile insights
to analyze on delays.
In this report, section 2 contribute a short overview about
related work. Section 3 provides Methodology and Results
are clearly explained in section 4. Finally, this paper ends
with conclusion.
II. RELATED WORK
With Decreasing number of flights cause of delays, airlines
are facing robustness of schedules and managing the
resource to reschedule at unfortunate times such as extreme
weather conditions and security issues brings scope to an
area to researchers. Many researchers proposed relevant
approaches for delay estimation in a way of robust
scheduling by using different techniques of advance data
mining algorithms. And also, some of the researchers went
through analysis of airline delay networks which provided
key insights to manage the operators, workers to get rid of
human delays. In the analysis of regular daily operational
analysis, to achieve dynamic and stochastic for the key
challenge of decision support system, proposed the day time
trends with an analysis of covariance. The in-depth research
work has been modelled to find the valuable information by
processing the historical data such as cancellation reasons,
average of arrival delays and carrier delays.
III. METHODOLOGY
In this section, we will discuss about the methodology of the
project. At the beginning, the project was initiated with an
idea of Airlines and found the dataset from the website
which has observations about 1.5 Million and 29 columns.
And also, I have scraped supplemental data from website
using R studio for further analysis. The next big step of
project is preprocessing which almost took most of the time
to clean null values, NA values, remove unwanted columns
and transform the data. Post preprocessing, I have moved
this dataset to mysql database and copied into HDFS. I have
taken HDFS as a data source for executing queries for
Mapreduce. Later, the output of the files was stored in
Hbase and then visualized the insights using R.
A. Dataset
The data has collected from Statistical Computing &
Statistical Graphics and World Airport Codes websites. The
first dataset which is collected from Statistical computing
for 2008 is a huge dataset with around 1.5 Million rows and
29 columns. The second dataset has scraped which contains
6 columns as a supplemental data to the first dataset. The
description of all the attributes present in these two datasets
are as follows: