Posted on
image: vesselfinder.net

image: vesselfinder.net


Introduction


Shipping plays such a key role in our everyday lives, but unless you work in the field, you hardly take the time to wonder how products reach you and where they come from. So in this post, we are going to investigate the world’s busiest container ports in the world using R and finish up with plotting some interactive plots from the Plotly and Leaflet JavaScript libraries.

My inspiration for this post came from a tutorial I did from Sharpsightlabs while learning how to use dplyr and ggplot2. The tutorial showed me how to use rvest to scrape an HTML table from Wikipedia pages, and it is how we are going to get the data for our analysis.



Data Preparation


Load Packages

# Required packages
library(tidyverse)
library(forcats)
library(stringr)
library(purrr)
library(rvest)
library(ggmap)
library(plotly)
library(leaflet)
library(htmltools)
library(htmlwidgets)


Scrape Data from Web

First, we need to read in the html page from Wikipedia.

You can find the data used in the Analysis at the following link.

# Read in HTML from wiki table
url <- "https://en.wikipedia.org/wiki/List_of_busiest_container_ports"
html <- read_html(url)

Now we have a bunch of messy HTML of the entire Wikipedia page which is rather hard to work with. So we are going to now want to find out which section of HTML (World’s Busiest Container Ports table) that we want to keep and throw out the rest.

We can do this by using rvest::html_nodes() and specifying that we want to split on “tables.” We can then take this output and convert it to tables using rvest::html_table()

# Read HTML table data
data <- html %>% 
  html_nodes("table") %>% 
  html_table(fill=TRUE)

The individual html_nodes returned in list format, and the section that we are looking for is the second element of the list data[[1]]. Let’s make that element a data frame and name it df and rename Jurisdiction as economy.

# Create data frame for second node
df <- tbl_df(data[[1]]) %>% 
  rename(economy = Jurisdiction)

That is much easier to work with than raw HTML, however the data are far from tidy. As you can see we have many columns for the different years instead of having one column Year. Luckily it it easy to change this using the tidyr package.

Withtidyr::gather(), we can take all the column names that include Year and Volume data and gather them into only two columns. We do this by first specifying the key = year and value = volume for all the Year columns.

# Tidy data
df <- df %>% 
  gather(year, volume, 4:15)
## Classes 'tbl_df', 'tbl' and 'data.frame':    600 obs. of  5 variables:
##  $ Rank   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Port   : chr  "Shanghai" "Singapore" "Shenzhen" "Ningbo-Zhoushan" ...
##  $ economy: chr  "China" "Singapore" "China" "China" ...
##  $ year   : chr  "2015[1]" "2015[1]" "2015[1]" "2015[1]" ...
##  $ volume : chr  "36,516" "30,922" "24,142" "20,636" ...

This is a stylistic preference of mine. I like to convert all column names to lower case because I find it easier to reference during the data cleaning process. After cleaning the data, we will then capitalize each variable, so that they look good in the final output.

# Convert all colnames to lower
colnames(df) <- colnames(df) %>%
  tolower()

Now that we have a tidy data frame, let’s move on to cleaning it.


Clean Data

This first thing we want to do is address the NA values present in the data.

Let’s have a look using purrr::map(). I highly recommend learning how to use functional programming with the purrr package. It is applicable basically anytime you think to write a loop. A good place to start is the Iteration Chapter from the free online book R for Data Science by Garrett Grolemund and Hadley Wickham.

# Check for NAs
df %>% 
  map(is.na) %>% 
  map_int(sum)
##    rank    port economy    year  volume 
##       0       0       0       0       5

It appears we have 5 NAs in the volume column. We should inspect these values.

# Inspect NA rows
df %>% 
  filter(is.na(volume))
## # A tibble: 5 x 4
##    rank      port      economy volume
##   <int>     <chr>        <chr>  <chr>
## 1    38   Piraeus       Greece   <NA>
## 2    41 Vancouver       Canada   <NA>
## 3    45    Durban South Africa   <NA>
## 4    46     Colón       Panama   <NA>
## 5    49 Melbourne    Australia   <NA>

Given the poor ranking of these 5 ports, the lack of information for year 2004 is probably because they were out of the Top 50 at the time. We will remove these values from the data.

# Remove NA values
df <- df %>% 
  filter(!is.na(volume)) 

You may have noticed that year and volume are both character vectors when they should be integers. This has to do with how R read in the data from the HTML file. We should look at each of these variables individually and take the necessary steps to convert them into their proper types.

We will first look at year:

levels(as.factor(df$year))[1:4]
## [1] "2004[12]" "2005[11]" "2006[10]" "2007[9]"

As you can see we have a number in brackets attached to the end of our year. This is the way that tidyr read in the column names. The best way to reformat year is by first using regular expressions to locate the unwanted characters and remove them using stringr::str_replace(). After that we can convert them into integers.

df <- df %>% 
  mutate(
   year = str_replace(year, "\\[\\d+\\]", "") %>% 
     as.integer()
  )

Everything looks in order. Now onto volume. We will start by filtering out these "N/A" and blank spaces.

df <- df %>% 
  filter(!volume %in% c("N/A", ""))

We can’t quite change the type of Volume until we remove some other non-numeric characters first, notably the comma separating the thousands place. Let’s test this out and see if we can convert it successfully.

df %>% 
  mutate(
    volume = str_replace(volume, ",", "") %>% 
     as.integer()
  )
## Warning in function_list[[k]](value): NAs introduced by coercion
## # A tibble: 3 x 5
##    rank      port   economy  year volume
##   <int>     <chr>     <chr> <int>  <int>
## 1     1  Shanghai     China  2015  36516
## 2     2 Singapore Singapore  2015  30922
## 3     3  Shenzhen     China  2015  24142

It looks like we have received a warning. Maybe we are not done here. The output says that there were NAs introduced let’s see where they are:

df %>% 
  mutate(
    volume = str_replace(volume, ",", "") %>% 
     as.integer()
  ) %>% 
  filter(is.na(volume))
## # A tibble: 1 x 5
##    rank  port economy  year volume
##   <int> <chr>   <chr> <int>  <int>
## 1    46 Colón  Panama  2009     NA

It looks like Colón, Panama in the year 2009 is the only data point that is giving us trouble. Let’s see what this entry looks like in our original data frame:

df %>% 
  filter(
    port == "Colón", 
    year == 2009
    )
## # A tibble: 1 x 5
##    rank  port economy  year    volume
##   <int> <chr>   <chr> <int>     <chr>
## 1    46 Colón  Panama  2009 1,856[13]

So our problem has to do with these brackets at the end of the volume value. Let’s try again while also specifying the removal of these brackets.

df <- df %>% 
  mutate(
    volume = str_replace(volume, ",", "") %>%
      str_replace(., "\\[\\d+\\]", "") %>% 
      as.integer()
  )

It appears to have run with no errors! Let’s move on to converting the character vectors port and economy into factors.

# Convert character variables to factors
df <- df %>% 
  mutate(
    port = as.factor(str_trim(port)),
    economy = as.factor(str_trim(economy))
    )

It appears to have run with no errors. We will now have a look at the port factor if you are unsure whether it succeeded.

The values look logical, but some of them are rather long. We are now going to recode the factors using forcats::fct_recode so that the values are not so long for certain ports.

# Create new variable with shorter port names
df <- df %>% 
  mutate(
    port = fct_recode(port ,
      "Saigon" = "Ho Chi Minh City (Saigon)",
      "New York" = "New York and New Jersey",
      "Jakarta" = "Tanjung Priok (Jakarta)",
      "Bremen" = "Bremen/Bremerhaven",
      "Istanbul" = "Ambarli (Istanbul)",
      "Tangiers" = "Tanger-Med (Tangiers)",
      "Dubai" = "Jebel Ali (Dubai)",
      "Ningbo/Z-shan" = "Ningbo-Zhoushan"
      )
    )

There isn’t a variable for the respective continent of each port. Let’s create a variable continent using forcats::fct_collapse() and then see the counts for each continent using table().

# Create continent variable using fct_collapse
df <- df %>%
  mutate(
    continent = fct_collapse(economy,
      `South America` = c("Brazil","Panama"),
      `North America` = c("Canada","United States"),
      Asia = c(
        "Japan","China","India","Indonesia", "Hong Kong SAR",
        "Malaysia","Philippines","Saudi Arabia",
        "Singapore","South Korea", "Taiwan",
        "Thailand","United Arab Emirates","Vietnam"
        ),
      Europe = c(
        "Belgium","Germany","Italy","Netherlands", 
        "Greece", "Spain","Turkey","United Kingdom"
        ),
      Africa = c("Egypt","Morocco", "South Africa")
      )
    )

Interesting. It looks like the Asia and Europe dominate the World shipping economy with North America coming in at number 3.

Now we have a tidy data set, but before we move on to some more visualizations and analysis, let’s captitalize are column names, so that they look nice in our plots.

# Capitalize column headers
capFirst <- function(s) {
  paste0(
    toupper(substring(s, 1, 1)),
    substring(s, 2)
    )
}

colnames(df) <- colnames(df) %>%
  capFirst()



Analyze & Visualize Data


Plot Top Continents

In this section we are going to have a look at the top continents and their ports. To do this we are going to write a function that will create a plot for the top four continents.

First we will make a list of continents that will be passed into our function:

# List of continents to be passed to function
cont.list <- list(
  Asia = "Asia",
  Europe = "Europe",
  North_America = "North America", 
  South_America = "South America"
  )

Now we will create a function named plotly.fun() which will allow us to plot each continent without rewriting the code by passing our continent list that we made in the last code chunk.

plotly.fun <- function(cont.list, df) {
  # Match the name of continent with continent list
  df <- df %>% 
    filter(Continent == cont.list)
  
  # Extract title for subplots
  title = df$Continent[1]
  
  # Hovertext html
  h <- with(df, paste0(
        'Port: ', Port,
        '<br>Country: ', Economy,
        '<br>Year: ', Year,
        '<br>Volume: ', Volume
        )
  )
  
  # Create Plot
  plt <- df %>% 
    group_by(Economy, Port) %>% 
    # Plotly object
    plot_ly(
      x = ~Year, y = ~Volume,
      color = ~Port,
      type = "scatter",
      mode = "lines+markers",
      colors = "Set1",
      hoverinfo = 'text',
      text = h
      ) %>% 
    # Edit plotly layout
    layout(
      showlegend = FALSE,
      title = "The Crash of 2008 Rocked the World's Shipping Economy",
      xaxis = list(title = ""),
      yaxis = list(title = title),
      margin = "pad"
      )
  # Return plt
  plt
}

The map function takes the cont.list that we created as it’s first argument, passes each continent into the plotly.fun() which then takes our data frame (df) as the second argument. The output is a list with each element of that list a plotly plot of a continent.

# Create plots for each Port by Continent
continent.plotlys <- map(cont.list, plotly.fun, df)

Let’s plot out each of these plots using plotly::subplot()

# Plot Continent plots in a 3x2 grid
subplot(
  continent.plotlys[[1]],
  continent.plotlys[[2]],
  continent.plotlys[[3]], 
  continent.plotlys[[4]],
  nrows = 4, titleY = T, titleX = T
  ) %>% 
  layout(
    showlegend = FALSE,
    margin = "t"
  )

Figure 1: Hover over points to see more information

When analyzing these plots, three things really stood out to me:

  • It appears that the 2008 financial crisis had a huge impact on every continent.

  • Panama’s port Colón took a huge hit in 2010. After doing some research, I discovered that in Dec 8, 2010 Panama experienced historic flooding that forced the closure of the Panama Canal for the first time in 21 years. Interestingly Panama’s other port Balboa experienced an increase in traffic. Colón is on the Atlantic side of the Panama Canal while Balboa is on the Pacific. Perhaps the Pacific side was unaffected during the closure of the Canal.


Plot Top 10 Ports

We have already scene how the ports perform grouped by continent, but now let’s have a look at how each country performs by grouping by Economy.

First let’s group by country and summarise the average volume. Then we will create a variable named Rank, so that we can filter out the top 10 economies.

# Find overall Top 10 over the last 10 years
top.10.overall <- df %>% 
  group_by(Economy) %>% 
  summarise(Volume = sum(Volume)) %>%
  arrange(desc(Volume)) %>% 
  top_n(10) %>% 
  select(Economy)
## Selecting by Volume

Now let’s take the list of the Top 10 economies and join it with our original data set so that we can compare Volume. We will also create a rank column that ranks the top 10 for each year.

# Select only the top 10 countries from the original data set
top.10 <- top.10.overall %>% 
  left_join(df) %>% 
  group_by(Year, Economy) %>% 
  summarise(Volume = sum(Volume)) %>% 
  arrange(Year, desc(Volume))

Great now we will see how these economies match up amongst each other.

# Hovertext html
h <- with(top.10, paste0(
      'Country: ', Economy,
      '<br>Year: ', Year,
      '<br>Volume: ', Volume
      ))

# Plot Top 10 countries
top.10 %>% 
  group_by(Economy) %>% 
  # plotly object
  plot_ly(
    x = ~Year, y = ~Volume,
    color = ~Economy,
    type = "scatter",
    mode = "lines+markers",
    marker = list(size = 8, symbol = "diamond"),
    colors = "Set1",
    hoverinfo = "text",
    text = h
    ) %>% 
  # layout option
  layout(
    legend = list(orientation = 'h'),
    title = "China Competes on a Different Level",
    titlefont = list(size=20),
    xaxis = list(title = ""),
    showlegend = FALSE,
    margin = "t"
    )

Figure 2: Hover over points to see more information. You can also zoom in on the plot by clicking and dragging a window on the desktop on the plot.

As you can see China is by far the most dominant shipping economy in the world with more than the next to countries combined.


Plot Asian Ports

It appears that the Asian economies are the largest in scale. Let’s subset the data to compare Asian economies on a map. We are also going to exclude Saudi Arabia and UAE for mapping purposes and calculate three new variables startYear, endYear, and n (number of years), so that we are able to calculate annual growth for each port.

# Asian data frame
asia.1 <- df %>% 
  filter(
    Continent == "Asia",
    !Economy %in% c("Saudi Arabia", "United Arab Emirates")
    ) %>% 
  group_by(Port) %>% 
  summarise(
    startYear = min(Year),
    endYear = max(Year),
    n = endYear - startYear
  )

This bit is a little complicated. We are going to create another asia data frame by joining the data frame that we just made (asia.1) with the original data frame (df) so that we can get the volume value for each asian economy. We are actually making two joins here: 1. for when the Year (from df) equals startYear (from asia.1) and 2. when Year equals endYear. Phew. Now the easy part. We are going to mutate (make new column) for Annual Growth by plugging in the variables that we have already made.

# Perform joins from original tibble
asia.2 <- asia.1 %>% 
  # join start year Volume 
  left_join(
    df %>% select(Year, Port, Volume), 
    by = c("startYear" = "Year", "Port" = "Port")
    ) %>%
  rename(start.Vol = Volume) %>% 
  # join end Year Volume with Port Label
  left_join(
    df %>% select(Year, Port, Volume),
    by = c("endYear" = "Year", "Port" = "Port")
    ) %>%
  rename(end.Vol = Volume) %>% 
  # calculate annual growth for each port
  mutate(
    `Annual Growth` = round(((end.Vol/start.Vol)^1/n)*100,2)
)
## # A tibble: 3 x 4
##        Port start.Vol end.Vol `Annual Growth`
##      <fctr>     <int>   <int>           <dbl>
## 1     Busan     11430   19469           15.48
## 2    Dalian      2211    9591           39.44
## 3 Guangzhou      3308   17097           46.99

It worked. Now let’s add some coordinates so that we can plot this information on the data on the map. We are going to use ggmap::geocode(), which connects with Data Science Kit’s API to retrieve place information.

# Retrieve Longitude and Latitude for each port
asia.codes <- geocode(as.character(asia.2$Port), source = "dsk")

After you have the asia codes loaded, attach them to the asia.2 data frame and output a new data frame named asia.

# Bind Longitude and Latitude to asia tibble 
asia <- asia.2 %>% 
  bind_cols(asia.codes) %>% 
  rename(Lon = lon, Lat = lat, Port = Port)

For our plot let’s create a new column called Category which will color code the different Ports.

asia <- asia %>% 
  mutate(
    Rank = min_rank(desc(end.Vol)),
    Category = ifelse(
      Rank <= 10, "top10",ifelse(
      Rank <= 20 & Rank > 10, "top20", "top30"))
  )

We will set the Category accordingly: top10 = green, top20 = orange, top30 = red.

# Colors to be passed to the Ranking factor
top10.factor <- colorFactor(
  c("green", "orange", "red"),
  domain = c("top10", "top20", "top30")
  )

Here is some custom HTML that we will insert into our leaflet plot so that the hovertext looks nice.

# HTML to be inserted into Leaflet plot (labels)
hovertext <- mapply(
  function(Port, Rank, Growth, Vol) {
    htmltools::HTML(
      sprintf(
        "<div style='font-size:12px;width:130px;float:left'>
            <span style='color:#007ba7;font-size:18px;font-weight:bold'>%s</span><br/>
            <div style='width:95%%'>
              <span style='float:left;color:#007ba7'>Rank</span>
              <span style='float:right;color:#007ba7'=300px'>Growth</span>
              <br/>
              <span style='color:#5a5a5a;float:left'>%s</span>
              <span style='color:#5a5a5a;float:right'> %s %%</span><br clear='all'/>
              </div>
              <span style='color:#5a5a5a;font-size:11px'><em>Volume (2015): %s</em></span>
        </div>",
        htmlEscape(Port), htmlEscape(Rank), htmlEscape(Growth), htmlEscape(Vol)
      )
    )
  },
  as.factor(asia$Port), asia$Rank, asia$`Annual Growth`, asia$end.Vol,
  SIMPLIFY = F
  )

Let’s finally see what it looks like on a leaflet() map!

# Leaflet plot
asia %>% 
  leaflet(width="100%") %>%
  setView(
    lng = 119.2125,
    lat = 22.80447,
    zoom = 3
    ) %>% 
  addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addCircleMarkers(
    lng = ~Lon, 
    lat = ~Lat,
    radius = ~ifelse(
      Category == "top10", 10, ifelse(
      Category == "top20", 7, 4)),
    color = ~top10.factor(Category),
    fillOpacity = 0.6,
    label = hovertext)

Figure 3: Two finger scroll to zoom in and out

I found it interesting to go port by port and see how the sea meets the land. Most large cities are built by the water to trade, and in this map some of all the largest ports, like Shanghai and Hong Kong, are also the largest cities in their respective countries.



Conclusion


With this project, I hope you learned a lot. We went through a lot of data manipulation and preparation, which is usually the meat of any project that I have worked on. We also saw how to work with a variety of packages, notably the rvest package for extracting HTML data and the javascript plotting packages plotly and leaflet for interactive plots. If anything was unclear, feel free to leave a comment below!



Acknowledgements


Thanks to all the hard work put in to make my life easier! Wouldn’t have been possible without the help of:


comments powered by Disqus