Introduction
In this article I would like to simulate how to make a correlation between air passenger information, and their hotel booking preference. I’ll show you what type of AI I used for this model, what inputs I used, the overall structure of the code, and I’ll share the results
Applicability
The applicability of this approach and these results could be many: from predicting when and how to promote certain structures and to whom, to evaluate if a certain place has all the required structures or need more, or to adapt the prices to the demand
Problem statement
Assuming that I have some passengers’ information, I would like to find relations between the inputs I have and their booking preference. In particular I would like to assess what type of hotel they will be booking, and what type of room they will require
Approach
To address the scope described above, I asked ChatGPT to generate 100000 data containing:
- The airline transporting the passenger (e.g. Delta Airlines, Lufthansa, etc…)
- The country of origin
- The hotel budget (cost per night)
- The age of the passenger
- The gender of the passenger
- The hotel type (Budget, Mid-Range, Luxury)
- The hotel stars
- The arrival day
- The purpose of the travel (Leisure, Business, Conference)
- The passenger’s company type (Public, Private, Self-Employed)
- The number of nights reserved
- The room type (Single, Double, Suite)
- The number of people in the room
ChatGPT generated the data for me, with some clear patterns, and he produced a csv that I could download. The patters are sometime too obvious, but for the purpose of demonstration it is good enough. From there, I had just to analyze the file, trying to find correlations and predict them with the help of AI. The AI selected for this purpose is KMeans.
What Is KMeans Clustering?
KMeans is one of the most popular and widely used unsupervised machine learning algorithms. It’s used when you want to discover hidden groupings or segments in your data without having labeled outcomes.
The Intuition Behind KMeans
Imagine you have a dataset of thousands of travelers, and you want to group them into distinct types (e.g., business, leisure, family). You don’t know the labels in advance — but you believe there are natural patterns.
That’s exactly what KMeans does: it tries to split your data into k distinct clusters where each point belongs to the cluster with the closest mean (center).
🔄 How It Works (Step by Step)
- Choose k (the number of clusters you want).
- Initialize k centroids randomly in your data space.
- Assign each point to the nearest centroid → these are your temporary clusters.
- Recalculate the centroids as the mean of the points in each cluster.
- Repeat steps 3–4 until the clusters stop changing significantly.
This is why it’s called K-Means — it groups by minimizing the distance to the mean of each cluster.
Code set up
Once the AI has been selected, I set up the code as per structure in Figure 1. The code will absolve these three functions:
- Data preparation
- The code will read the data from the csv file
- KMean training
- The AI will be trained based on the data selected from the csv file
- Only the most influential values will be considered
- Validation
- 100 samples will be selected to verify if the AI model can make accurate predictions
- The success rate of hotel type and room type will be calculated

Figure 1 Code structure
Results
Before talking about results, I would like to share some of the details about the KMeans model. The parameters that have been used for the KMeans model are:
n_clusters=9,
init=’k-means++’.
n_init=10
max_iter=1000
random_state=10
algorithm = ‘lloyd’
Provided the number of inputs, these parameters will provide a good training base, and will provide a good foundation for predictions. Now that the KMeans is set, the input values have been selected, and the most influential values that have been found are:
- The country of origin
- The age of the passenger
- The purpose of the travel (Leisure, Business, Conference)
- The number of people in the room
After experimenting with the other parameters too, I have found that they are not bringing values, and they are lowering the accuracy of the prediction. Using too many values in-fact, is not useful, especially if they are not relevant and this is known as curse of dimensionality. With the four inputs, and the KMeans set up, the overall success score is:
✅ Hotel Type Accuracy on 100 cases: 90.0%
✅ Room Type Accuracy on 100 cases: 98.0%
This means almost perfect prediction of the room type, and excellent prediction of the hotel type. Here is an extract of some of the results, where the only error is on the prediction of the hotel type of a person coming from Germany with two other people, and is there for a conference. The prediction is for a luxury hotel, but the actual selection is mid-range.
| Age | Purpose_of_Travel | Country_of_Origin | People_in_Room | Actual_Hotel_Type | Predicted_Hotel_Type | Actual_Room_Type | Predicted_Room_Type | Hotel_Correct | Room_Correct |
| 25 | Leisure | India | 3 | Luxury | Luxury | Suite | Suite | TRUE | TRUE |
| 32 | Conference | Germany | 3 | Mid-range | Luxury | Suite | Suite | FALSE | TRUE |
| 36 | Business | Germany | 1 | Budget | Budget | Single | Single | TRUE | TRUE |
| 38 | Leisure | India | 3 | Luxury | Luxury | Suite | Suite | TRUE | TRUE |
| 34 | Business | India | 1 | Budget | Budget | Single | Single | TRUE | TRUE |
| 25 | Business | Canada | 1 | Budget | Budget | Single | Single | TRUE | TRUE |
| 51 | Business | Germany | 1 | Budget | Budget | Single | Single | TRUE | TRUE |
| 35 | Leisure | USA | 3 | Luxury | Luxury | Suite | Suite | TRUE | TRUE |
| 32 | Leisure | Germany | 3 | Luxury | Luxury | Suite | Suite | TRUE | TRUE |
| 29 | Leisure | Canada | 3 | Luxury | Luxury | Suite | Suite | TRUE | TRUE |
Table 1 Extract of KMeans predictions
Conclusions
The implementation of KMeans for the detection of patterns, was a good choice as it could achieve very good overall results. To achieve those results, some tuning was necessary, and some of the inputs have been excluded to avoid curse of dimensionality. Once the AI was set up, the code was running very fast, achieving the results in less than one minute.
Provided the ease if implementation and the great results achieved, the applicability of this feature could bring a lot of value in several areas. The thing to consider though is the quality of the data, as not all the data will bring the same results. Tuning and expertise will be required to achieve excellent results as in this example.
Copyright
Author: Simone Togni
Platform: aisciencetalk.blog