Posted on
Screenshot of Dashboard

Screenshot of Dashboard


Introduction


European Football is something that has really grown on me while living in Hamburg. I have had the pleasure of a going to multiple HSV and St. Pauli FC games, and I must say that the atmosphere is nothing short of amazing. I wanted to dive deeper into the subject and have a look at the different leagues around Europe and their respective teams. To see the live dashboard click here

If you like the dashboard and want to learn how I built it, follow along and I will show exactly how I gathered, manipulated, and analyzed the data that you see on the dashboard. The source code of the dashboard itself can be found in the top-corner of the dashboard.



Data Preparation


Load Packages

# Required packages
library(tidyverse)
library(RSQLite)
library(DBI)
library(ggmap)
library(plotly)
library(leaflet)
library(htmltools)


Import Data

First we are going to need to download the SQLite database that is needed for our analysis at https://www.kaggle.com/hugomathien/soccer. After you have the database downloaded make sure to put it in your working directory.

Using the DBI and RSQLite packages we are going to establish a connection with the database with the following commands:

con <- dbConnect(
  SQLite(), 
  dbname="database.sqlite"
  )

Now that we are connected to the database let’s have a look at the different table names that are present in con.

dbListTables(con)
## [1] "Country"           "League"            "Match"            
## [4] "Player"            "Player_Attributes" "Team"             
## [7] "Team_Attributes"   "sqlite_sequence"

You can look into the different tables to see what data is available or reference the kaggle page where there are descriptions for each table. However, for the sake of this analysis we are only going to be using the Country, League, Match, and Team tables.

In order to bring these tables into R and convert them into data frames we are going to use simple SQL commands in addition to chained dplyr commands. You could create more complex SQL commands in order to select the specific columns that you want; however, I prefer to use dplyr’s query commands as opposed to SQL’s.

Let’s import these four tables into R.

country <- con %>% 
  dbGetQuery("select * from Country") %>% 
  as_tibble() %>% 
  rename(country_id = id, country = name)

league <- con %>% 
  dbGetQuery("select * from League") %>% 
  as_tibble() %>% 
  rename(league = name, league_id = id)

teams <- con %>% 
  dbGetQuery("select * from Team") %>% 
  as_tibble() %>% 
  rename(team_id = id) %>% 
  select(team_id, team_api_id, team_long_name)

matches <- con %>% 
  dbGetQuery("select * from Match") %>% 
  as_tibble() %>% 
  rename(match_id = id) %>%
  # Ignore Player and Betting Info
  select(1:11)

Have a look at each of these data frames. You will notice that they are all connected by their respective IDs. The important thing now is to bring all this information together into one master data frame.


Data Manipulation

The hierarchy that connects the ID codes to one another is:

  • Country >> League >> Matches >> Teams

Connecting the Country and League data frames together seems straightforward. So let’s first join those two data frames using the dplyr::left_join().

# Join country with league
df <- country %>% 
  left_join(
    league, 
    by = c("country_id" = "country_id")
    )
## # A tibble: 5 x 4
##   country_id country league_id                 league
##        <int>   <chr>     <int>                  <chr>
## 1          1 Belgium         1 Belgium Jupiler League
## 2       1729 England      1729 England Premier League
## 3       4769  France      4769         France Ligue 1
## 4       7809 Germany      7809  Germany 1. Bundesliga
## 5      10257   Italy     10257          Italy Serie A

The tables appear to have succesfully joined. Now let’s try to join our newly created data frame, df, with the matches data frame.

# Join df with matches
df <- df %>% 
  left_join(
    matches,
    by = c("league_id" = "league_id")
    )

View the newly joined data frame in your console. It is a tad large to view here.

Now we need to attach teams to our data frame. Joining teams will be more challenging because the newly joined data from matches has a home team and away team that are uniquely identified by respective API codes.

## # A tibble: 5 x 4
##   home_team_api_id away_team_api_id home_team_goal away_team_goal
##              <int>            <int>          <int>          <int>
## 1             9987             9993              1              1
## 2            10000             9994              0              0
## 3             9984             8635              0              3
## 4             9991             9998              5              0
## 5             7947             9985              1              3

Therefore we are going to need to make two joins with teams, one for home and one for away.

df <- df %>% 
  # join home team
  left_join(
    teams, 
    by = c("home_team_api_id" = "team_api_id")
    ) %>%
  rename(home_team = team_long_name) %>% 
  # join away team
  left_join(
    teams,
    by = c("away_team_api_id" = "team_api_id")
    ) %>% 
  rename(away_team = team_long_name)

Great! Now we have each of the four data frames joined, but with many unnecessary columns, in particular the ids columns. So let’s select only the columns that we want to keep going forward.

df <- df %>% 
    select(
      country, league, season, stage,
      date, home_team, away_team,
      home_team_goal, away_team_goal
      )

Our data frame tells us how many goals were scored by the home and away teams, but says nothing about the result of the match. Thus we are going to create new variables named home_result and away_result that calculate the earned points for each match.

The point scale goes as follows:

  • 3 points for wins
  • 1 point for draws
  • 0 points for losses
df <- df %>%
    mutate(
      home_result = ifelse(
        home_team_goal > away_team_goal, 3, ifelse(
        home_team_goal == away_team_goal, 1, 0)),
      away_result = ifelse(
        away_team_goal > home_team_goal, 3, ifelse(
        away_team_goal == home_team_goal, 1, 0))
      )

This data frame is good and all if we want to look at each match individually, but what we are really after is seeing how team’s performed each season.

In order to do this we are going to need to dplyr::summarise() both the home and away results dpylr::group_by() the country, league, season, and team.

# Sum the number of home game points per season
hometeamResults <- df %>% 
  group_by(country, league, season, home_team) %>% 
  summarise(
    home_points = sum(home_result),
    home_games = n()
    )

# Sum the number of away game points per season
awayteamResults <- df %>% 
  group_by(country, league, season, away_team) %>% 
  summarise(
    away_points = sum(away_result),
    away_games = n()
    )

We can now combine the home and away results into a new data frame named teamResults. We are also going to make a couple new variables called total_points, total_games, and year.

# Combine home and away results
teamResults <- bind_cols(
  hometeamResults,
  away_points = awayteamResults$away_points,
  away_games = awayteamResults$away_games
  ) %>% 
  # Rename home_team to team
  rename(team = home_team) %>% 
  mutate(
    # Calculate total points
    total_points = home_points + away_points,
    # Calculate total games
    total_games = home_games + away_points,
    # Calculate a year variable to make plotting easier
    year = as.integer(
      stringr::str_extract(season, "\\d{4}"))
    ) %>% 
  # Remove home_points and away_points columns
  select(-c(home_points, away_points, home_games, away_games))

This data frame gives us all the teams that played in at least one season. In order to get an idea of who the top teams are, we are going to narrow our scope to only the teams that played in the highest division for the past 8 years (sorry St. Pauli).

# Find teams that played in the league for 8 seasons
topTeams <- teamResults %>% 
  group_by(team) %>% 
  count(team, sort = TRUE) %>%
  filter(n == 8)

# Make the data frame column a character vector
topTeams <- as.character(topTeams$team)

Now that we have a list of top teams, apply it to our teamResults data frame so that we only see results from the top teams. Let’s save our new results to a new data frame named top_team_results. We will also exclude Belgium in this step because they only have one team listed.

# Filter only top teams
top_team_results <- teamResults %>% 
  filter(
    team %in% topTeams,
    country != "Belgium"
    )

Our data frame should now be structured like this:

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  728 obs. of  6 variables:
##  $ country     : chr  "England" "England" "England" "England" ...
##  $ league      : chr  "England Premier League" "England Premier League" "England Premier League" "England Premier League" ...
##  $ season      : chr  "2008/2009" "2008/2009" "2008/2009" "2008/2009" ...
##  $ team        : chr  "Arsenal" "Aston Villa" "Chelsea" "Everton" ...
##  $ total_points: num  72 62 83 63 86 50 90 45 36 51 ...
##  $ year        : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...

This data frame can now be used to create the plotly plot that is on the dashboard. You can check out the source code of the plot by going to the dashboard and clicking the Source Code button in the top right corner.



Analyze & Visualize Data


Home Advantage Analysis

We all hear about how important playing at home is, but really how important is it? and is it more important in some countries as opposed to others? In this part of the analysis we will take a look at just how much on average home results differ from away results.

Calculating Average Result

In order to calculate the metrics we need, we have to:

  1. Group by country and team
  2. Sum the total home and away points
  3. Sum the total home and away games
  4. Divide home and away points by respective games
top_pts_game <- top_team_results %>% 
  filter(country != "Belgium") %>% 
  # step 1
  group_by(country, team) %>% 
  summarise(
    Points = sum(total_points),
    Games = sum(total_games),
    # step 2
    home_points = sum(home_points),
    away_points = sum(away_points),
    # step 3
    home_games = sum(home_games),
    away_games = sum(away_games),
    # step 4
    `Overall Result` = round(Points/Games, 3),
    `Home Result` = round(home_points/home_games, 3),
    `Away Result` = round(away_points/away_games, 3)
  ) %>% 
  rename(
    Country = country, Club = team
  ) %>% 
  select(Country, Club, `Home Result`, `Away Result`, `Overall Result`)
## # A tibble: 3 x 4
##          Club `Home Result` `Away Result` `Overall Result`
##         <chr>         <dbl>         <dbl>            <dbl>
## 1     Arsenal         2.138         1.697            1.918
## 2 Aston Villa         1.217         1.059            1.138
## 3     Chelsea         2.217         1.717            1.967

It looks like the calculation worked!

Now let’s see which country has the largest difference in home and away results:

# Largest Difference
top_pts_game %>% 
  group_by(Country) %>% 
  summarise(
    `Home Result` = mean(`Home Result`) %>% 
      round(2),
    `Away Result` = mean(`Away Result`) %>% 
      round(2),
    Diff = `Home Result` - `Away Result`
  ) %>% 
  arrange(desc(Diff))
## # A tibble: 10 x 4
##        Country `Home Result` `Away Result`  Diff
##          <chr>         <dbl>         <dbl> <dbl>
##  1       Spain          1.99          1.34  0.65
##  2      Poland          1.79          1.21  0.58
##  3 Netherlands          1.91          1.33  0.58
##  4       Italy          1.91          1.34  0.57
##  5      France          1.82          1.28  0.54
##  6     England          1.90          1.37  0.53
##  7     Germany          1.79          1.27  0.52
##  8 Switzerland          1.82          1.33  0.49
##  9    Portugal          1.78          1.30  0.48
## 10    Scotland          1.68          1.39  0.29

It looks like it is really hard to come away with a win on the road in Spain, while the following 8 teams are more or less similar, and Scotland appears to have a slight difference in average result.


Adding Map Coordinates to our Data

In order to create the map that is in the dashboard, we need to gather geocode data for each of our clubs.

First, let’s select the country and team columns from top_team_results and save our results to a data frame named location_df.

# Select country and team from top_team_results
location_df <- top_team_results %>%
  select(country, team)

We can group and summarise the variables so that we don’t have duplicates of the teams. Then we will simply remove the n column.

# Group country and team to remove duplicate teams
location_df <- location_df[, 3:4] %>% 
  group_by(country, team) %>% 
  summarise(n = n()) %>% 
  select(-n)

Now let’s call geocode() from the ggmap package to get the coordinates for each club. We are going to query the google API by using both the club name and the country using the with() and paste0() functions.

Example query: geocode("Arsenal, England")

# Call API to get geocode info
coords <- with(location_df,
  paste0(team, ", ", country))

We now have the longitude and latitude for each club. Let’s now combine the data frames into a data frame named team_locations.

# Bind Columns of location_df and coords
team_locations <- bind_cols(
  country = location_df$country,
  team = location_df$team,
  lon = coords$lon, 
  lat = coords$lat
  )
## # A tibble: 4 x 4
##   country        team       lon      lat
##     <chr>       <chr>     <dbl>    <dbl>
## 1 England Aston Villa -1.885272 52.49598
## 2 England     Chelsea -0.174936 51.48509
## 3 England     Everton -2.966368 53.42571
## 4 England   Liverpool -2.991573 53.40837

With our newly created team_locations data frame, you are now ready to plot the clubs on a leaflet map. To see how I created the map for the dashboard, check out the source code of the map by going to the dashboard and clicking the Source Code button in the top right corner.



Acknowledgments


Data Source: https://www.kaggle.com/hugomathien/soccer


comments powered by Disqus