Evaluation and Analysis of the Impact of Airport Delays Free Paper Sample on SunnyPapers.com

1Source: http://hadoop.apache.org/ accessed 16 Feb 2018. 2Source: https://spark.apache.org/docs/2.1.0/index.html accessed 16 Feb 2018. ABSTRACT Over the past years Flight delays have negative effects on passengers, airlines, and airports.

Now it is possible to predict that a flight will be delayed based on the statistics of past flights. This paper is focusing on passenger satisfaction unlike most of the previous researches which are concerned about airlines and airports. In this work a new Dynamic Double Delay Flight Predicting Web (3DFPW) model is created to help a passenger to get the prediction and the probability of delay status in origin and destination airports using certain airline through a website even before booking an airline ticket. Most of the previous studies focused on flights departure delay only or arrival delay only. This work focused on both delays at the same time. Spark is used as an ecosystem cluster over Hadoop cluster, it is handled through a SparkR library from R.

This work answers two questions. The first question is what is the best classification algorithm to use from SparkR MLib? The second question is what is the best caching level of Sparkr which makes the best performance and robustness and why? Keywords: Machine Learning, Big data, Sparkr, Caching, Flight Delay, R, Classification, Prediction, Naïve Bayes. 1- INTRODUCTION Delays in air travel can be very expensive for both passengers and airlines. While many delays are due to weather or mechanical failures are unpredictable, it may be possible to predict that a flight will be delayed based on the statistics of past flights. Flight delays have adverse effects on passengers, airlines and airports, especially economic. Estimated flight delays can increase tactical and operational decisions by airports and airline executives, and can alert passengers to their plans 1.

A passenger is the one who pays the money so he is the client, consequently, if he is dissatisfied with any airport or any airline he will not use it, he will use only the flight he is satisfied with. Therefore there will be a competition between airlines and airports to make a better flight that satisfies the passenger, therefore this paper is focusing on passenger satisfaction despite most of the previous researches which are concerned about airlines and airports. A new Dynamic Double Delay Flight Predicting Web (3DFPW) model is proposed to handle how to help a passenger knows the prediction and the probability of his flight before booking an airline ticket. Most of the previous studies focused on flights departure delay only or flights arrival delay only. This work focuses on both departure and arrival delays at the same time to give the passenger full information about delays in the origin airport and destination airport. 3DFPW model is built on big data machine learning predicting algorithms hence it is needed to learn from a wide range of years of historical flights to make a good prediction.

Therefore, Hadoop cluster is used as a store for big data and for fast predicting. Spark used as an ecosystem cluster over Hadoop and it is handled through SparkR library from RStudio. SparkR is a distributed system. It’s simpler and less complicated than Hadoop, easier to read.

The high speed and scalability of the algorithms created in this system are good because it is inserted into the Spark memory. SparkR can run faster for large-scale data files projects that require parallel solutions 2. For implementing a 3DFPW model which is built on a commodity cluster two questions need to be answered. The First question: what is the best classification algorithm to use from SparkR MLib? The Second question: what is the best caching level of Sparkr which make the best performance and robustness and why? This work answers these questions.

The rest of this paper is organized as follows: In section 2, Background about technologies which were used in the 3DFPW model is explored. A brief review of some related work on flight delay is in section 3. Methods and design are presented in Section 4. Results and discussions are offered in Section 5. Conclusion and future work is presented in Section 6.

2- BACKGROUND Machine learning is research that explores the development of algorithms that can learn from data and provide predictions based on them. Work exploring flight systems increases the use of machine learning methods 1. Hadoop is an open-source software framework for storing data and running applications on commodity hardware clusters. It provides huge storage space for any kind of data, tremendous processing power, and virtually unlimited concurrent tasks or the ability to process jobs.1 Apache Spark is a fast and general cluster computing system.

It offers high-level APIs in Java, Scala, Python, and R, as well as an optimized engine that supports common execution graphs. It also supports a number of higher-level tools, including Spark SQL for SQL and structured data Best Caching Storage Technique Using Sparkr for Big Data Ahmed Elsayed College of Computing and Information Technology, AAST, Egypt [email protected] Prof. Dr. Mohamed Shaheen College of Computing and Information Technology, AAST, Egypt [email protected] Prof. Dr.

Osama Badawy College of Computing and Information Technology, AAST, Egypt [email protected]: https://www.r-project.org/about.html accessed 16 Feb 2018. 4Source: https://www.rstudio.com/ accessed 16 Feb 2018. processing, MLlib for machine learning, GraphX for graphics processing, and Spark Streaming.2 R: is an open source programming language and software environment widely used for statistical computation in data-intensive roles such as data mining and statistics.3 RStudio is an integrated development environment (IDE) for R. It includes a console that supports direct code execution, a syntax highlighting editor, as well as tools for plotting, history, debugging, and workspace management.4 SparkR is an R package that provides a lightweight interface to use Spark from R. Apache Spark.

SparkR provides a distributed implementation of data frameworks that support operations such as selection, filtering, aggregation, etc. (similar to R data frames, dplyr) but in large data sets. SparkR also supports distributed machine learning using MLlib.5 SparkDataFrame is a collection of data that is distributed and organized into named columns. Conceptually, it is equivalent to a table in a relational database or a data frame in R, but more optimizations are made under the hood.

SparkDataFrames can be created from a wide variety of sources, such as structured data files, Hive tables, external databases, or existing local R data frames.5 Shiny is an R package that makes it easy to create interactive web applications directly from R. including standalone applications on a web page, or embedding them in R Markdown documents or creating display tables are possible. And also extending Shiny applications with CSS themes, HTML widgets, and JavaScript actions.4 Apache Zeppelin: is a Web-based notebook that provides datadriven, interactive data analysis and collaboration documentation with SQL, Scala and more.6 3- RELATED WORK The flight delay has led to significant costs for passengers, airlines, and society. Such high delay costs motivate the analysis and prediction of air traffic delays and the development of better delay mechanisms. Predicting flight delays has been the topic of several previous efforts. Sternberg, Soares, Carvalho ; Ogasawara have in 2017 developed a taxonomy scheme and classified models with regard to detailed components based on previous researchers of flight delay models to predict delays.

That work contributes to the analysis of these models from a Data Science perspective, based on arrival delay 1. Mazzeo in 2003 examined the hypothesis that the market power enjoyed by dominant airlines allows them to provide a lower service quality through increased flight delays, based on arrival delay 3. Yi Ding in 2017 executed regression and ordinal classification task based on the multiple linear regression models to predict the delay. They implemented the model and compared it with Naïve-Bayes and C4.5 approach, based on arrival delay 4.

Ugwu1, Ntuk ; Ekaete in 2016 observed that airline carriers had the highest impact on predicting for on-time and delay for flight status. Following therefore the research aimed to predict ontime and delay for flight status based on using extensive potentials of interpretability of decision tree model for flights delays, the percentage accuracy of the system is 74.3%, based on Departure delay 5. Tu, Ball ; Jank in 2006 estimated a flight departure delay distribution, Focused exclusively on downstream delays caused by factors such as weather conditions, estimates of airport surface congestion as well as others. Specifically, a model, which is responding to changes in real time parameter measurements, based on Departure delay 6.

Montforta ; Berg in 2017 used two measures of delays, delays in minutes later than scheduled and if the delay was more than 15 minutes and the results suggest that the larger the nationwide size of an airline is, the shorter and less frequent the delays. This result seems robust to the choice of specification, controls and variable set-up. Larger airlines have more resources, and the efficient use of these may decrease delays, based on arrival delay 7. Cole ; Donoghue in 2017 aimed to training a logistic regression model to predict if a flight will be delayed by more than 15 minutes, based on departure delay 8.

Venkataraman, et al in 2016 found that their results are in line with previous studies that measured the importance of caching in Spark, benefits come not only from using faster storage media, but also from avoiding CPU time in decompressing data and parsing CSV files, caching helps to achieve low latencies that make SparkR suitable for interactive query processing from the R shell, caching the data can improve performance by 10x to 30x for this workload 9. 4- METHODS AND DESIGN 4.1- System Components Hadoop Cluster Specs: (version 2.6 on one namenode and 5 datanodes) 1 Master: Processor: AMD Phenom(TM) 8600B, Cores: 3, Memory: 8 GB, Hard disk: 120 GB, Network card: Gigabit, OS: Linux (Ubuntu 14) System type: 64-pit. 5 Slaves: Processor: Intel Core 2 Duo CPU E8400 3.00GHz, Cores: 2, Memory: 4 GB, Hard disk: 40 GB, Network card: Gigabit, OS: Linux (Ubuntu 14), System type: 64-pit. 6 Machines: connected together on 1 switch (Gigabits), speed approximately 600 Mbit.

Spark Cluster Specs: 1 Driver and 6 Workers on 6 machines over 13 cores, Memory in use: 19.6 GB total, 14.9 GB used, Spark Master at spark://hdmaster:7077, spark version 2.1.0 installed on the same cluster of Hadoop, standalone level. Dataset: The data were obtained from the Bureau of Transportation Statistics, a Federal Agency of the United States of America7. The dataset made up of records of all USA domestic flights of major carriers, Airline on-time performance dataset downloaded as CSV file. It is based on details of the arrival and departure of all commercial flights in the US, from October 1987 to April 2008.

This is an extensive dataset: a total of nearly 123 million records and 12 gigabytes of unpacked data.8 Variables descriptions(29 variables): Year : 1990-2008, Month: 1-12, DayofMonth: 1-31, DayOfWeek: 1 (Monday) – 7 (Sunday),5Source: https://spark.apache.org/docs/2.1.0/sparkr.html accessed 16 Feb 2018. 6Source: https://zeppelin.apache.org/ accessed 16 Feb 2018. Figure 1: 3DFPW model DepTime:actual departure time, CRSDepTime: scheduled departure time, ArrTime: actual arrival time, CRSArrTime: scheduled arrival time, UniqueCarrier: unique carrier code Lookup csv.7 FlightNum: flight number, TailNum: plane tail number, ActualElapsedTim: in minutes, CRSElapsedTime: in minutes, AirTime: in minutes, ArrDelay: arrival delay, in minutes, DepDelay: departure delay in minutes, Origin: origin IATA airport code Lookup csv7, Dest: destination IATA airport code Lookup csv.7 Distance: in miles, TaxiIn: taxi in time in minutes, TaxiOut: taxi out time in minutes, Cancelled: was the flight cancelled?, CancellationCode: reason for cancellation (A = carrier, B = weather, C = NAS, D = security), Diverted: 1 = yes 0 = no, CarrierDelay: in minutes, WeatherDelay: in minutes, NASDelay: in minutes, SecurityDelay: in minutes, LateAircraftDelay: in minutes.8 4.2- Dataset Preparing and Preprocessing SparkR initiating: initiating sparkr by calling the library of sparkr and determining the sparkr cluster IP and port and running a new session using R and RStudio. Reading dataset: reading a CSV file from Hadoop cluster and converting this file to a sparkr dataframe as partitions which are distributed on all spark cluster machines and cores, the full dataset row numbers is 123534969 rows and 12 gigabytes. Preprocessing: by using a sparkr SQL it’s now easy to preprocess the dataset and preparing it. This work is focused on both departure flights delay and arrival delay.

The dataset contained many attributes of which some are irrelevant, the irrelevant attributes were pruned during extensive preprocessing. The resulting data was partitioned into training and test sets. SQL select statement: Using select statement a new columns were created from dataset to make data more meaningful to an ordinary passenger who wants to know if his travel selection will be ontime or delayed as following. Month: a Month column was created depending on Month column and the months were converted into nominal Names (Jan, Feb, Mar…etc.). Weekday: a weekday column was created depending on DayOfWeek column and a number of days were converted into short names of days like (1=’Mo’, 2=’Tu’…etc.). UniqueCarrier: from the dataset, it is a unique code for each airline company.

Origin: origin airport code. Dest: destination airport code. CRSDepTime: a CRSDeptime column was created depending on dataset CRSDeptime column and numbers were collected into three short meaningful names, time between 0001 and 1159 into Morning (‘MO_01_to_12’), time between 1200 and 1759 into Afternoon (‘AN_12_to_18’) and time between 1800 and 2359 into Night (‘NI_18_to_24’).9 in the meantime canceled flights have no actual Deptime so CRSDepTime: (scheduled departure time) was used instead as Deptime. CRSArrTime: same thing as CRSDepTime. Canceled (0/1): canceled flights were considered as a delayed flight. Class: the dataset has no class so a class was built depending on U.S.

Department of transportation federal aviation administration (FAA) air traffic organization policy, delays to instrument flight rules (IFR), Airborne delays are reported for all aircraft which incur 15 minutes or more. 11 Ontime binary class: if departure delay 15 or is canceled then ‘no’. Criteria: Some records which have the wrong CRSDepTime were neglected, only CRSDepTime rows which are less than 2401 are selected and so CRSArrTime is the same as CRSDepTime. 4.3- Dynamic Double Delay Flight Predicting Web (3DFPW) model Criteria: the range of years selected are from 1989 to 2008 (19 years) nearly 112M rows.

Splitting Dataset: After preprocessing the dataset which was resulted from SQL select statement it was separated into training and test sets. 80% for the training (89585408 rows) and 20% for the test (22390987 rows), training and test sets are cashed in spark dataframe cluster. Naïve Bayes Algorithm: Sparkr Naïve Bayes model was ran based on training set and the ontime column as a class on a correlation of columns for departure delay (Month, WeekDay, UniqueCarrier, Origin, Dest, Cancelled, and CRSDepTime) in iteration1 and for arrival delay (Month, WeekDay, UniqueCarrier, Origin, Dest, Cancelled, and CRSArrTime) in iteration2. Each iteration has the same select statement and the same criteria each iteration was implemented from beginning to the end separated from each other. Prediction: after learning from a7Source: https://www.transtats.bts.gov/Fields.asp?Table_ID=236 accessed 17 Feb 2018. 8Source: http://stat-computing.org/dataexpo/2009/the-data.html accessed 17 Feb 2018.

9Source: https://www.fluentu.com/blog/english/how-to-tell-time-inenglish/ accessed 17 Feb 2018. 10Source: http://spark.apache.org/docs/latest/rdd-programmingguide.html accessed 19 Feb 2018. 11Source: https://www.faa.gov/documentlibrary/media/order/7210.55fbasic.pdf Accessed 17 Feb 2018. training set the prediction was implemented on the test set for achieving class prediction from a random combination of columns features also for each iteration separately. Confusion matrix: R confusion matrix library is compatible only with R dataframe (RDD) which is working as a standalone machine only and can’t work with spark cluster (spark dataframe) and it can’t read data bigger than machine’s ram. Therefore, in this work a confusion matrix has been written using R language to read from big data to get results like (accuracy, recall, precision, and f-score) based on related research 2.

Shiny: as shown in figure 1 for interacting online with passengers a web site had to be designed and dynamically can deal with R and sparkr machine learning to achieve the goal of the 3DFPW Big data model, Therefore shiny has been used for doing this. The shiny file has two sections UI and Server. The select statement of the model used as a dataset. Naïve Bayes algorithm was executed using a full dataset as a training set without splitting it to training and test.

Both iteration of DepDelay and iteration of ArrDelay were implemented respectively. Likewise, both predictions were done in server section depending on incoming input data from UI section which entered by the passenger. The input data used as a test set for both predictions. 4.4- Classification Algorithms Comparison In order to choose the best classifier algorithm for implementing the 3DFPW model, three classification algorithms from standard MLib of Sparkr have been tested and matched (Naïve-Bayes(NB), Random Forest(RF) and Gradient Boosted Tree(GBT)). Also, accuracy was matched with another related research 5 to increase the confirmation of the process. Criteria: January 2004 instances were selected (583944 rows).

Splitting Dataset: Same SQL select statement which used in the 3DFPW model was separated into training and test sets. 70% for the training (407761 rows) and 30% for the test (176183 rows), training and test sets are cashed in spark dataframe cluster. Columns: the three classification algorithms were ran based on training set and the ontime column as a class on a correlation of columns (ontime, Month, WeekDay, UniqueCarrier, Origin, Dest, Cancelled and CRSDepTime). Related research 5: the author in this work focused on the same criteria and same terms of columns which were used in this paper model he used a C4.5 algorithm. 4.5- Persisting Performance Evaluation One of the most important options in Spark is the persisting (caching) of a dataset in memory across operations.

When you persist an RDD, each node stores any partitions of it that it calculates in memory, and reuses them in other actions on the dataset (or datasets resulting from it). This allows future actions to be much faster (often with more than 10x). Caching is a key tool for iterative algorithms and fast interactive use. You can label RDD as persistent using persist () method or cache () method. The first calculation in action is stored in the nodes.

The Spark cache is fault-tolerant if any RDD is lost, it will be automatically recalculated using the transformations that originally created it.10 Caching storages are (Memory_Only, Memory_And_Disk, Disk_Only, Memory_Only_Ser, and Memory_And_Disk_Ser).10 In order to choose the best caching level in the best case (fully functional Hadoop and spark clusters) and worse case (low numbers of spark cores or any case of dead cores) caching storages had to be tested for selecting the best persisting. Performance evaluation: The test was done by calculating the time of processing of NB algorithm on all columns of dataset (29 variables) and for all the range of years (21 years) and this is to maximize the overload on the algorithm. Using each caching level individually on five stages, the first stage is running 6 Hadoop datanodes and 13 spark cores (executors), second stage is running 5 Hadoop datanodes and 11 spark cores and so on until the last stage of 2 Hadoop datanodes and 5 spark cores. And by decreasing or increasing the number of nodes and number of rows on a variety of machine learning algorithms the cause was identified. 5- RESULTS AND DISCUSSIONS As shown in figures (2, 3) some airlines selected as samples for matching between the class label and prediction class for illustrating the difference. As a result of the 3DFPW model and as shown in the tables (1, 2) the true positive in DepDelay is better than a true positive in ArrDelay.

as shown in table 3 the Model time in both iterations nearly 2.5 minutes, however, the accuracy in DepDelay (82%) better than accuracy in Results in section 5 Figure 2: Actual flight status against the carriers Figure 3: Prediction flight status against the carriers Explained in section 5Table 3: the test for DepDelay and ArrDelay prediction Table1- DepDelay Table2- ArrDelay ArrDelay (78%) Likewise precision and f-score in DepDelay are better. The Apriori for DepDelay iteration is 0.82 for Ontime and 0.18 For Delayed and the Apriori for ArrDelay iteration is 0.78 for Ontime and 0.22 For Delayed. The whole dataset is prepared as training set to take advantage of the full knowledge, and to increase the chance of predicting the incoming data from the passenger which is in the form of one row, this row is considered as a test set for prediction process in both iterations. Prediction time is almost the same in two iterations.

Consequently, iteration1 results are better. Once the passenger inputting his combinations of flight data and pressing the button ‘predict ontime status’ the status (ontime or delayed will be displayed and the probability of this status also will be displayed on the browser for both delays as shown in figure (4). As an answer to the first question. And as shown in table (4) NB model has the less time 8 seconds and higher accuracy (79.8%). RF and GBT are in the same level of accuracy with (79.6%).

and higher precision 79.3%. RF is 10 times more than the time of the NB. GBT is the worst model in time 505 seconds, GBT is 63 times more than the time of the NB. When trying to select a full range of years or even more than two years using the current cluster hardware configuration on RF and GBT it couldn’t complete the algorithms processing and it threw errors about connection and executors.

However, NB did it well with a range of 19 years. Sparkr NB algorithm has an accuracy (79.8%) better than C4.5 algorithm accuracy (74.3%) of the related work 5. Prediction time is almost the same in all tests and it is calculated for one row only. Therefore, NB is the best algorithm to use in the 3DFPW model because of its accuracy and its time. As an answer to the second question. And as shown in Table 5, when running naive Bayes algorithm over a full range of dataset (123m rows).

in the first stage Memory_Only, Disk_Only, and Memory_And_Disk are almost the same time, however, Memory_Only_Ser and Memory_And_Disk_Ser and Uncached almost the same time which was 3.3 times more than first 3 caching levels, Memory_Only_Ser and Memory_And_Disk_Ser test unneeded any more as a caching level because their time is almost as Uncached time. In the second stage Memory_Only, Disk_Only, and Memory_And_Disk are almost the same time, however, uncached time is 3 times more than first 3 caching levels. In the third stage Memory_Only and Memory_And_Disk are almost the same time however they are 1.9 times more than Disk_Only and their time is 2.6 times more than their time in the second stage in the meantime Disk_Only time is just 1.4 times more than their time in the second stage which mean that there is an overload when using Memory_Only and Memory_And_Disk and that because of memory when dataset processing it was reached the limit of memory of the cluster. Uncached time is 1.4 times more than (Memory_Only, Memory_And_Disk) and 2.6 times more than Disk_Only.

In the fourth stage Memory_Only became not available because it exceeded the limit of cluster memory so it made errors and did not complete the processing, Memory_And_Disk is 2.2 times more than Disk_Only and it is 1.5 of the fourth stage Memory_And_Disk time, Disk_Only is 1.3 of fourth stage Disk_Only time, Uncached Figure 4: 3DFPW model webpage Table5: Persisting performance comparison (time in minutes) Figure 6: Persisting performance comparison (time in minutes) Table 4: Performance classification comparisontime is 1.2 times more than Memory_And_Disk and 2.7 times more than Disk_Only. In the fifth stage, Memory_And_Disk became not available also like Memory_Only from the fourth stage Uncached either not completed. As shown in the figure (6) the only lasted caching level in the worst case was Disk_Only. Following therefore Disk_Only had the best time in all stages and it is the best caching level to use in the 3DFPW model which makes the best performance and robustness. In order to know why when reaching the overload limit in naive Bayes algorithm the Disk_Only was the best caching storage.

A test on a variety of other algorithms had to be done. By running ML algorithms on divided datasets as halves and quarters it was observed that. If the part of the dataset is making no overload on ML and cluster configuration the three caching storage (memory, memory & desk, desk) make the same time. Therefore ML had to be running in the overload limit, this limit is when each and every of the three caching storage (memory, memory & desk, desk) running together and reaching the greatest time with succeeded process without any fail.

It means that each algorithm had to be running many times to reach the overload limit, by decreasing or increasing the number of nodes and number of rows on a variety of machine learning algorithms many times to achieve these results as shown in the table(6) : Logit reached the overload limit running on 6 nodes using 43.8M rows over that it fails, the Memory_Only is the best caching storage time. Naive Bayes reached the overload limit running on 4 nodes using 123.4M rows, the Disk_Only is the best caching storage time. Random forest reached the overload limit running on 6 nodes using 3.5M rows over that it fails, the Memory_And_Disk is the worse caching storage time. Kmeans reached the overload limit running on 5 nodes using 61.7M rows over that it fails, the three caching almost have the same time. Consequently, it is obvious that the best caching storage depends on ML technique and how it accesses the data when it is overloaded.

It was observed that some critics found in SparkR version 2.1.0 during experiments of this paper, for instance, a Confusion matrix of R does not support spark dataframe Therefore it was made programmatically instead. Ggplot2 library does not support spark dataframe and for making bar charts apache zeppelin used to do that instead. Likewise, Plot library does not support spark dataframe. Some of the famous classification algorithms like C4.5 (decision tree) not supported in sparkr however it supported in pyspark and Scala.

CONCLUSION Flight delays are a hot topic for the passenger Nevertheless this research introduce a model using the departure delay and arrival delay prediction status at the same time to the passenger through a website unlike most of the previous studies that focused on flights departure delay only or on flights arrival delay only. Experiments in this paper achieved that predicting departure delay has better accuracy than arrival delay although they both are used in the 3DFPW model. After testing RF, GBT, NB and related research results of C4.5 algorithms, the NB classification algorithm was the best in SparkR MLib. Disk_Only had the best time and robustness in all test stages of Naive Bayes algorithm and it is the best caching level to use in a 3DFPW model for best performance. By reaching the overload limit of a variety of machine learning algorithms to know why Disk_Only is the best caching storage for Naive Bayes algorithm, it is obvious that the best caching storage depends on ML technique and how it accesses the data when it is overloaded.

In future giving the passenger alternates of top ten of ontime carriers and airports will be considered. Using Spark from Pyspark instead of sparkr for more efficiency, flexibility and spreading. REFERENCES 1 Alice Sternberg, Jorge Soares, Diego Carvalho, Eduardo Ogasawara. A Review on Flight Delay Prediction arXiv: 1703.06118v1 cs.CY 2017 2 Udeh Tochukwu Livinus, Rachid Chelouah, and Houcine Senoussi.

Recommender System in Big Data Environment IJCSI ISSN (Online): 1694-0784 2016 3 Michael J. MAZZEO. Competition and Service Quality in the U.S. Airline Industry Kluwer Academic Publishers.

2003 4 Yi Ding. Predicting flight delay based on multiple linear regression Earth and Environmental Science 10.1088/1755- 1315/81/1/012198 2017 5 C. Ugwu1, Ntuk, Ekaete2. Dynamic Decision Tree Based Ensembled Learning Model to Forecast Flight Status European Centre for Research Traininging and Development UK Vol.4, No.6, pp.15-24 2016 6 Yufeng Tu, Michael Ball, Wolfgang Jank. Estimating Flight Departure Delay Distributions —A Statistical Approach With Long-term Trend and Short-term Pattern Robert H. Smith School Research Paper No.

RHS 06-034 2006 7 Joep van Montforta & Vincent A.C. van den Berg. The total size of an airline and the quality of its flights 2017 8 Scott Cole, Thomas Donoghue. Predicting departure delays of US domestic flights S Cole, T Donoghue 2017 9 Shivaram Venkataraman, Zongheng Yang, Davies Liu2, Eric Liang, Hossein Falaki Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia, AMPLab UC Berkeley, Databricks Inc., MIT CSAIL.

SparkR: Scaling R Programs with Spark SIGMOD San Francisco, CA and USA ACM. ISBN 978-1-4503-3531-7/16/06 2016 Table6: Persisting performance comparison in overload status

Evaluation and Analysis of the Impact of Airport Delays

More essays on Evaluation and Analysis of the Impact of Airport Delays