Air Traffic Analysis

Vitor dos Santos
10 min readAug 9, 2021

Hello, everyone! Today, I’m going to perform an extensive network analysis related to air traffic, where different techniques will be applied to help perceive patterns and relationships between the connections of the network through some basic metrics and images.

In this article, I will consider the Brazilian air traffic during 2019 and 2020 as a study case, but the same ideas can be applied to any other country. By considering these two years, we will also have an idea of the covid-19 pandemic influence on the Brazilian air traffic. So, let’s dive in!

Database

The database used in this article is provided by the Official National Civil Aviation Agency in Brazil (ANAC), which is available on this official website. Another public database with flight and airport information related to many countries is available at OpenFlights. However, this one does not have information based on years, and that’s why we will use the database from ANAC.

In this article we will use three datasets: two are related to all the flights within Brazil in 2019 and 2020, and another that has information on all Airports in Brazil.

From the first two datasets, we will only need three pieces of information: the OACI code from the Source and Destination airport, which defines a flight route, and the number of flights between both Airports. In the second dataset, we want information related to the OACI code of each airport (which will be used to link with information from the previous datasets), the airport name, city, state, latitude, and longitude.

To perform network analysis, we will build two graphs, one related to all flights in 2019 and another with flights from 2020. The nodes of the graphs will be the airports and the edges will be the flights between two airports.

Pre-Processing

After downloading the datasets, it is necessary to pre-process them. To do this, I used Python, Pandas and a Collab notebook, which you can take a look in this repository.

Initially, I started working on the routes datasets. First, only the columns related to the OACI code from the Source and Destination airport, and the number of departures were selected. Besides, a filter was applied to consider only airports that are inside Brazil.

Next, the columns were renamed, and the type of the number of flights column was set to be int64. However, there were different flights that have the same routes, which happened because the original dataset also considered different companies and months when describing a flight. Therefore, these flights were grouped and the number of flights were summed.

Finally, I also dropped all the routes that had less than 15 flights in the entire year, since it does not represent a regular flight. The figure below shows the 5 first rows from the 2019 dataset (right) and the 2020 dataset (left).

Figure 1: Example of Flights DataFrame.

In the next step, I worked on the airports’ dataset. First, I filtered only the useful information, which will be the OACI code, name, city, state, latitude, and longitude. Then, the airports that did not have a flight in the analyzed year will be disregarded from the dataset. To do this, the information from the two previous datasets were used.

In this way, two airports DataFrames were created, one with airport information from all the flights from 2019 and another with information with all flights from 2020. The figure below shows 5 samples from the 2019 airport DataFrame.

Figure 2: Example of the Airport DataFrame.

Graph Creation

Now, we will create two Graphs using the pre-processed Dataframes: one with 2019 flight information (G2019) and another with 2020 flight information (G2020). In both graphs, the nodes are the airports and the flights are the edges. In this article, we will use the Network X library to create and analyze these Graphs. The figure below shows an example of one node information.

Figure 3: Information from the node called SBSG.

A problem that occurred when analyzing the Graphs is that some nodes did not have any information. This happened because there were some airports OACI codes that existed in the flights datasets but did not appeared in the airport dataset. Therefore, to be able to use some functions of the graph, I decided to remove these empty nodes from the Graphs.

The Figure below shows a first look of both Graphs connections, where the position of each node is related to the latitude and longitude of the airport. It is interesting to notice how both images look like the Brazil map.

Figure 4: Graph plot from G2019 (left) and G2020 (right).

Basic Graph Information

Now that we have both graphs created, let’s analyze them. First, we will compare the Eccentricity, the Diameter, the Periphery, the Radius, and the Center of both Graphs. Here is the definition of these metrics.

  • The Eccentricity represents the maximum distance from a node to all other nodes in the graph.
  • The Diameter represents the maximum Eccentricity and the Periphery are the set of nodes whose Eccentricity is equals to the Diameter.
  • The Radius is the minimum Eccentricity and the Center is the set of nodes whose Eccentricity is equals the Radius.

If a node has a high Eccentricity, this node is more isolated, meaning that it has to go through more vertices to get at another node. This will depend on the other analyzed node.

The image below shows the Eccentricity value for each node in the G2019 graph (left) and the G2020 graph (right). The color bar in the right relates the color of the node with its Eccentricity value.

The dark red nodes have an Eccentricity value of eight, meaning that it needs more leaps to get anywhere in the country. By analyzing both graphs, it is clear that G2019 nodes have lower Eccentricity, mainly in Southwest and Northeast regions, while there are more red nodes in general on G2020. Therefore, we can infer that it was easier to get anywhere in 2019 than in 2020. One of the reasons for this behavior can be the covid-19 pandemic, where fewer flights were made in 2020.

The diameter of both graphs is 8 and the periphery is:

G2019: Eirunepé/Amazonas (SWEI); Cirilo Queiróz/Minas Gerais (SNAR)G2020: Trombetas/Pará (SBTB); Francisco Lacerda Júnior/Paraná (SSCP); Paulo Abdala/Paraná (SSFB); André Antônio Maggi/Mato Grosso (SWBG); Canarana/Mato Grosso (SWEK); São Félix do Araguaia/Mato Grosso (SWFX)

The radius of both graphs is 4 and the center is:

G2019: Tancredo Neves/Minas Gerais (Confins/SBCF)
G2020: Maestro Wilson Fonseca/Pará (Santarém/SBSN)

It is interesting to observe that the SBSN node is not a very famous and big airport. But it has connections with important airports, such as Brasília, Manaus, and Belém, making it a center point and allowing fast paths with any other airport. That’s why it is the center of G2020.

Centrality Distribution

Let’s perform a bivariate analysis using centrality metrics. More precisely, we will analyze the relationship between Degree x Closeness, and Degree x Betweenness of the graphs.

  • The Degree centrality represents the number of connections of a node.
  • The Closeness centrality is the average distance to all other vertices. It is a way of detecting nodes that are able to spread information very efficiently through a graph.
  • The Betweenness centrality is related to the position of the shortest path. It is a way of detecting the amount of influence a node has over the flow of information in a graph.

A node with a high Closeness value means that it is very close to all the other nodes. On the other hand, in a node with a high Betwennness value, the majority of the shortest paths include this node, meaning that it is frequently between the paths.

The image below shows the relationship between Degree (x-axis) and Closeness (y-axis) for both Graphs.

In G2019, we can infer the existence of a positive correlation between Degree and Closeness, meaning that the more connections a node has, the closer it is with any other node.

In G2020, two nodes have a high Degree (over 40) but with a lower Closeness value than other nodes that have around 5 Degree. Besides, the third higher Closeness value is only the 14th highest Degree value. Thus, there is not much correlation between these two variables. However, lower Degree nodes also have lower Closeness values.

Now, let’s analyze the relationship between Degree (x-axis) and Betweenness (y-axis) for both Graphs.

In G2019, there is not much correlation between both variables. The higher Betweenness node has a low Degree, and there are some nodes with high Degree but with very low Betweenness values, meaning that a node high with many connections does not imply that it has a frequent flow.

In G2020, we can infer that there is some positive correlation between these variables, besides some outliers. The higher the Betweenness value is the one high the higher Degree.

Core Decomposition

A K-core in a network represents a subset of its nodes in which all nodes have at least K connections to each other. So, for example, in a 2-core subset, all the nodes have at least 2 connections with each other. This metric helps to identify tightly interlinked groups within a network.

Another important metric in the core decomposition is the K-shell. This parameter is the subset of nodes that were removed when we go from the K-core subset to the (K+1)-core. So, for example, if the 2-core of a network has 10 nodes and the 3-core of this same network has 6 elements, the 2-shell is the subsets of the 4 elements that were removed when going from the 2-core to the 3-core.

In the G2019 graph, there are 10 cores and the max-core is 10. Thus, all the nodes in the 10-core have at least 10 connections with each other. On the other hand, in the G2020 graph, there are 9 cores and the max-core is 9. The image below highlights the max-core, the (max-1)-shell, and the (max-2)-shell of both graphs.

The max-core of both Graphs represent the main Brazilian airports and the ones that have the most flight routes. By using the K-core and K-shell images, it provides an interesting visualization of these airports.

The G2020 graph has more nodes in the North and Northeast regions than G2019, while there was not much change in the South and Southeast regions. Besides, the max-core in G2019 is higher than G2020, which means that more connections (flights) between the main airports were made in 2019 than in 2020.

Another interesting analysis that can be done using the k-core is to plot a histogram that counts the number of nodes with a specific Degree value and the Probability Density Function. So, without further explanation, let’s show this graph and analyze it.

The image below shows the number of nodes in the max-core (10-core) that have a specific Degree value. The left y-axis shows the count and the right y-axis shows the Probability. This graph was made using the G2019 graph.

Thus, in this graph, there’s a probability of approximately 6% of a node to have a degree value of 10 or 12. Also, there’s a probability of around 14% of a node to have a degree value of 11.

Besides, by the shape of the PDF, the nodes degree is mainly concentrated in 11 and 15.

Finally, the image below shows the same plot considering the max-core of the graph G2020.

In this case, the Degree distribution is much more uniform within the range of 9 and 15, as it can be seen by the histogram and the shape of the PDF.

Conclusion

In this article, it was presented an extensive analysis of an air traffic system using network analysis and considering as a study case the Brazilian flights in 2019 and 2020.

Initially, it was necessary to perform some pre-processing techniques in the original dataset to build DataFrames that would be useful in our application. Next, we created our Graphs using Network x, where many analyses were made.

Next, it was presented some basic definitions and information related to the Graphs, such as Eccentricity, Diameter, Radius, and so on. By using this information, it was possible to get a basic understanding of how our network is connected and which airports are “harder to get”. After, we made a deeper analysis of the network using Degree, Closeness, and Betweenness, where some insights related to airports that are closer to any other airport or that can have a high influence when looking for short paths could be found.

Finally, some visualization of the airports with higher connections with each other was shown by using the k-core and k-shell definitions.

References

--

--

Vitor dos Santos

PhD student on Computer Science at Dublin City University. Interested on Computer Vision, Deep Learning and Data Science.