Busiest air routes

Introduction

This is a simple exercise to extract data from a Wikipedia page (List of busiest passenger air routes) and performing a basic data visualisation.

Why am I doing this? I somehow really got into aviation in the past 2 years. Due to the COVID travel restriction, I have spent days on YouTube watching planes taking off and landing. In a moment of self-indulgence, I would also like to add that Airbus 380 is such a beautiful engineering marvel and it is sad that COVID fastened the end of its production.

Airbus 380 at Hong Kong airport, 2018

Extracting data

We can, in theory, copy and paste the data to an Excel sheet and then import the data. However, we will try to do something a bit fancier and use the rvest package to extract this data.

The only downside with extracting the data in this way is that if the webpage is updated, then our code might not work. Hence, I will also save a copy of this data in my GitHub.

suppressPackageStartupMessages({
  library(xml2)
  library(rvest)
  library(tidyverse)
})
webpage = xml2::read_html("https://en.wikipedia.org/wiki/List_of_busiest_passenger_air_routes")

raw_tbl = webpage %>% 
  html_element("table") %>% 
  html_table()

raw_tbl
## # A tibble: 50 x 7
##     Rank `Airport 1`  `Airport 2`    `Distance (km)` `2018[1]` `2017[2]` Type   
##    <int> <chr>        <chr>                    <int> <chr>     <chr>     <chr>  
##  1     1 Jeju         Seoul-Gimpo                449 14,107,4… 13,460,3… Domest…
##  2     2 Sapporo      Tokyo-Haneda               835 9,698,639 8,726,502 Domest…
##  3     3 Sydney       Melbourne                  705 9,245,392 9,090,941 Domest…
##  4     4 Fukuoka      Tokyo-Haneda               889 8,762,547 7,864,000 Domest…
##  5     5 Mumbai       Delhi                     1150 7,392,155 7,129,943 Domest…
##  6     6 Hanoi        Ho Chi Minh C…            1171 6,867,114 6,769,823 Domest…
##  7     7 Beijing      Shanghai-Hong…            1081 6,518,997 6,833,684 Domest…
##  8     8 Hong Kong    Taipei-Taoyuan             802 6,476,268 6,719,030 Intern…
##  9     9 Tokyo-Haneda Naha                      1573 5,829,712 5,269,481 Domest…
## 10    10 Jakarta      Surabaya                   700 5,649,046 5,271,304 Domest…
## # … with 40 more rows
## We are only interested in the 2018 passenger numbers
subset_tbl = raw_tbl %>% 
  dplyr::transmute(
    rank = Rank, 
    airport1 = `Airport 1`, 
    airport2 = `Airport 2`, 
    distance = `Distance (km)`, 
    passengers = `2018[1]` %>% str_remove_all(",") %>% as.integer(), 
    type = Type)

subset_tbl
## # A tibble: 50 x 6
##     rank airport1     airport2          distance passengers type         
##    <int> <chr>        <chr>                <int>      <int> <chr>        
##  1     1 Jeju         Seoul-Gimpo            449   14107414 Domestic     
##  2     2 Sapporo      Tokyo-Haneda           835    9698639 Domestic     
##  3     3 Sydney       Melbourne              705    9245392 Domestic     
##  4     4 Fukuoka      Tokyo-Haneda           889    8762547 Domestic     
##  5     5 Mumbai       Delhi                 1150    7392155 Domestic     
##  6     6 Hanoi        Ho Chi Minh City      1171    6867114 Domestic     
##  7     7 Beijing      Shanghai-Hongqiao     1081    6518997 Domestic     
##  8     8 Hong Kong    Taipei-Taoyuan         802    6476268 International
##  9     9 Tokyo-Haneda Naha                  1573    5829712 Domestic     
## 10    10 Jakarta      Surabaya               700    5649046 Domestic     
## # … with 40 more rows
write_csv(x = subset_tbl, file = "raw_airports_data.csv")

Notice that most of the “airports” are actually just the name of the city. We will use this to grab the longitude and latitude information. However, there are some exceptions like “Tokyo-Haneda”, where “Haneda” is one of the two international airports in the city of Tokyo. We will need to clean up these exceptions for consistency.

clean_tbl = subset_tbl %>% 
  dplyr::mutate(
    city1 = purrr::map_chr(.x = airport1,
                       .f = ~ str_split(.x, "-")[[1]][1]),
    city2 = purrr::map_chr(.x = airport2,
                       .f = ~ str_split(.x, "-")[[1]][1]))

clean_tbl
## # A tibble: 50 x 8
##     rank airport1   airport2     distance passengers type      city1   city2    
##    <int> <chr>      <chr>           <int>      <int> <chr>     <chr>   <chr>    
##  1     1 Jeju       Seoul-Gimpo       449   14107414 Domestic  Jeju    Seoul    
##  2     2 Sapporo    Tokyo-Haneda      835    9698639 Domestic  Sapporo Tokyo    
##  3     3 Sydney     Melbourne         705    9245392 Domestic  Sydney  Melbourne
##  4     4 Fukuoka    Tokyo-Haneda      889    8762547 Domestic  Fukuoka Tokyo    
##  5     5 Mumbai     Delhi            1150    7392155 Domestic  Mumbai  Delhi    
##  6     6 Hanoi      Ho Chi Minh…     1171    6867114 Domestic  Hanoi   Ho Chi M…
##  7     7 Beijing    Shanghai-Ho…     1081    6518997 Domestic  Beijing Shanghai 
##  8     8 Hong Kong  Taipei-Taoy…      802    6476268 Internat… Hong K… Taipei   
##  9     9 Tokyo-Han… Naha             1573    5829712 Domestic  Tokyo   Naha     
## 10    10 Jakarta    Surabaya          700    5649046 Domestic  Jakarta Surabaya 
## # … with 40 more rows

Getting locations for the cities

Google Maps API

There are many ways of getting the location information for cities. In the past, I have found the most reliable way is to get it through ggmap which uses the Google Maps API, but this means you must set up a Google Cloud Platform billing account with them (which unfortunately requires a credit card). See this documentation. Once a project is set up with the Google Cloud Platform, you will then need to enable the Google Maps API by searching for it in the top search bar. The API key is required too, see the documentations for ggmap::register_google for more information.

Aside: is all these worth it? In my experience, absolutely! Because Google Maps is very smart and tends to understand certain complexities that you didn’t think of and handle those for you. For example, if you are interested in the city of Sydney, Google Maps will understand that to be the city of Sydney in Australia, not the city in Nova Scotia, Canada (I don’t know how they do this, but my guess is that they will return results that are more relevant, because, well, they are Google). Google Cloud is also offering free credits for most of their basic services, so one can take advantage of these without incurring substantial costs.

To ensure code reproducibility, I will use the code below to download the coordinates for all the cities, save it as a CSV and make it available on GitHub.

library(ggmap)
# ggmap::geocode("Sydney, Australia", output = "latlon", source = "google")
all_cities = c(clean_tbl$city1, clean_tbl$city2) %>% unique
all_geocode = ggmap::geocode(location = all_cities, output = "latlon")
city_tbl = tibble(
  city = all_cities,
  lon = all_geocode$lon,
  lat = all_geocode$lat)

readr::write_csv(x = city_tbl, file = "./city_tbl.csv")

Alternatively, if you don’t want to register for Google’s billings, you could use the tidygeocoder’s geocode function to get the latitude/longitude information via Open Street Map, which doesn’t require registration, but in my experience, it can be slower than Google Maps.

tidygeocoder location extractions

A small example:

city_tbl = tibble(
  city = c(clean_tbl$city1, clean_tbl$city2) %>% unique) %>% 
  tidygeocoder::geocode(city, method = 'osm', lat = latitude , long = longitude)

Simple maps location extractions

You could also use the data provided in https://simplemaps.com/data/world-cities to perform data joins to get the location information. The downloaded data looks quite tidy with additional ASCII encoding and I was quite impressed with the quality of the data.

Joining data and visualise

Once we have the location information we can simply join the data as followed:

city_tbl = readr::read_csv(file = "https://gist.githubusercontent.com/kevinwang09/006d00ee7a43778171e7fc2fd409cdd6/raw/f12ec9deecdd99b093d0c819e21b125a7f7d4afd/city_tbl.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   city = col_character(),
##   lon = col_double(),
##   lat = col_double()
## )
joined_data = clean_tbl %>% 
  left_join(city_tbl, by = c("city1" = "city")) %>% 
  left_join(city_tbl, by = c("city2" = "city"), suffix = c("1", "2"))

I really like plotly’s globe visualisation, because you can click and drag the globe, which is really nice.

Using this visualisation, we can see most of the busiest routes in the world are concentrated around Asia. What surprised me a few years ago is that Australia, despite its small population, also had a couple of routes made it to this list, with Sydney - Melbourne being the third on the list. In my experience, on a good day, there could be two planes at the Sydney airport flying to Melbourne but only 10 minutes apart. Which I must admit was a rare luxury that I never realised.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
geo <- list(
  scope = 'world',
  projection = list(type = 'orthographic'),
  showland = TRUE,
  landcolor = toRGB("gray95"),
  countrycolor = toRGB("gray80")
)

fig <- plot_geo(color = I("red"))

fig <- fig %>% 
  add_markers(
    data = joined_data, x = ~lon1, y = ~lat1, text = ~city1,
    hoverinfo = "text", alpha = 0.5) %>% 
  add_markers(
    data = joined_data, x = ~lon2, y = ~lat2, text = ~city2,
    hoverinfo = "text", alpha = 0.5) %>% 
  add_segments(
    data = group_by(joined_data, rank),
    x = ~lon1, xend = ~lon2,
    y = ~lat1, yend = ~lat2,
    alpha = 0.3, size = I(1), 
    hoverinfo = "none") %>% 
  layout(
    title = 'Busiest air routes in the world',
    geo = geo, 
    showlegend = FALSE, 
    height = 800)

fig
Kevin Y.X. Wang
Kevin Y.X. Wang
Senior Data Scientist

Senior Data scientist at Illumina. PhD in Statistics.