Walking On Air: How Machine Learning Can Help Make Your Next Flight Happier

6 min readMay 8, 2021

When it comes to flying, there are seemingly infinite ways for a trip to leave us feeling dissatisfied. From crying babies to delays, lost bags to rude gate attendants, the myriad of variables that make up our complete flying experience is vast, with any combination of bad factors leading to an ultimately uncomfortable journey.

But what if there was a way to quantify the totality of our experience, gaining wisdom on what variables ultimately factor into customer dissatisfaction the most? This is a question that airlines have been asking for decades as they continue their quest for customer satisfaction, and ultimately, loyalty.

To answer this, I analyzed a dataset of over 130,000 customer interviews that were taken upon the passenger’s arrival at the gate. These interviews contained information such as the length of any delays, customer gender and age, as well as how they rated aspects of their flight such as cleanliness and seat comfort. At the end of the interview, customers were asked a simple yes-or-no question: were you overall satisfied with your flight?

While I have already done some preliminary exploration with this dataset, in this article I am broadening my scope in an attempt to build a model that can predict satisfaction with over 95% accuracy — solely based on how a passenger responded to the interview!

To do this, I first performed some additional data exploration to examine anything I may have missed in my first analysis. Next, I built 7 different models in Python, each leveraging unique algorithms. Lastly, I compared the results of these models, choosing the best 3 candidates for my final model based on various metrics such as accuracy. From here, I fine-tuned the algorithm, ensuring that I had an optimal mix between both accuracy as well as adaptability to new data.

Data Exploration

Although I already knew this dataset fairly intimately from my aforementioned analysis, there were some aspects of the data related to modeling that I wanted to examine. Primarily, I wanted to see if any of the variables displayed high correlation with each other. This was partially out of curiosity, but also because high degrees of unexplained covariance within a dataset can sometimes muddy the waters and reduce accuracy down the line. This is not always the case, but if my models indeed turned out to be uncooperative, I wanted to have some information for diagnostics.

This heat map shows correlation between each of the variables (without regard to satisfaction).

From this heat map, I could clearly see that there was no unusual covariance. Any high correlation (such as between the two delay variables or between cleanliness and food & drink quality) made sense.

However, something I quickly began to appreciate was the sheer size of this dataset. With over 130,000 interviews (each with 22 features), it was clear that computing power would have to be a consideration for modeling. This is an interesting problem that I have not yet encountered this semester.

Preliminary Modeling

After preparing the data and splitting it between a testing and training set, I began considering which models to try. Since my goal is to classify a binary variable (satisfaction), I knew that a simple linear regression would not suffice. Instead, I used the following 6 models, ranked below by initial accuracy:

Given just basic parameters, the 6 models still managed to provide an adequate 80%+ accuracy, with the top models nearly performing at my 95% goal already. However, there was a clear drop-off in accuracy with the decision tree and logistic regression, which can be expected given the caliber and sophistication of the models they’re contending with.

Furthermore, despite this disparity in accuracy, the confusion matrices all look nearly identical to the naked eye.

Clearly, there is still work to be done to identify the strongest model.

At this point, you may be asking why I didn’t test a support-vector machine (SVM) or even a neural network. In fact, not only did I indeed test these, but likely 80% of my time spent modeling was devoted to getting these to work! Unfortunately, due to the size of my dataset (130k+ interviews), I simply didn’t have the computing power to build these models. I tried to pare down my dataset to 10k randomly selected points, but even this took over 2 hours to execute. At this point, I decided it wouldn’t be worth it to compromise the integrity of my dataset, and also was satisfied enough with the performance of my top 4 models to move on.

Selecting the Winner

At this point, I decided to drop the gradient boosting classifier (GBC); this was mainly due to its similarity to random forests, especially in terms of algorithm and hyperparameters. I simply understood random forests better, and felt like I could improve this model more than a GBC. With the 3 remaining models, I decided give each of them a few opportunities to prove their potential given different parameters.

In total, I tried 3 different scenarios for each model, which are documented in my code. In summary, my random forest model performed the best at a staggering 96.5% accuracy, 0.3% higher than my initial model. This was achieved by increasing the number of estimators used, and by increasing the maximum number of features allowed.

Despite this success, my process was still only based on human-selected parameters. I wanted to ensure that I could find the optimal combination of parameters, and thus used a grid search to parse through each possibility. After 2 hours of executing, I found that my current model actually exceeded the parameters that the computer had found! I had truly stumbled upon a great iteration of this random forest, and after some minor fine tuning, came upon this formula:

RandomForestClassifier(n_estimators=250, max_features=10, random_state=42)

Because I selected a random forest, there was already a high degree of cross validation built into the model. Thus, I felt confident moving forward to prepare my model for future data.

Model Simplification

Now that I had a strong model, I wanted to ensure that it could successfully predict new data and avoid any overfitting issues. To do this, I used 100 fewer estimators (150 to 50), 3 fewer maximum features (7 to 4), and set the minimum samples required for a split at 300. Given the sheer volume of data, this minimum was necessary for avoiding any leafs that may simply harbor outliers.

Even given these modifications, my model still achieved 95.1% accuracy, which is only a 1.4% decrease from my best model’s benchmark.

At this point, I was satisfied with my final random forest model, and was able to analyze it to uncover what it found the be its five most important variables:

My random forest’s top 5 variables, ranked by importance.

Given the results of my initial analysis, it is not surprising that the model picked up on the importance of inflight wi-fi and an easy online boarding process. It was also not surprising to see the importance of business travel and business class seats make an appearance. In the future, I may rerun the model using only the 14 variables that passengers ranked from 1–5, such as cleanliness and seat comfort; these may be more actionable for an airline than to know whether a passenger had a business or economy ticket.

Conclusions

Overall, I was very satisfied with my model’s performance. Given my random forest’s strong performance from the start, the biggest challenge of this project was simply finding room for improvement. This ultimately pushed me to learn new methods for performing cross validation, analyzing performance, and identifying optimal hyperparameters. I am confident that my model is now prepared to take on new data, while avoiding any effects of overfit.

Walking On Air: How Machine Learning Can Help Make Your Next Flight Happier

Written by Armand Sarkisian