Posted on
image: irishsportsdaily.com

image: irishsportsdaily.com


Introduction


I drew inspiration for this project from a variety sources. First, from the movie Moneyball based off the book by Michael Lewis. Then, from Analytics Edge, a course offered by edX and MIT, where we went step-by-step through how the Oakland A’s used statistics to gain a competitive edge over opponents.

After I completed the course, I thought why not do it for Lacrosse, a sport that I have played since I was five. As I researched whether or not such an analysis had been done before, I stumbled upon an insightful article written in 2011 by Michael Mauboussin. I borrowed insights that he described in the article in addition to exploring different metrics.

You will be able to see how I use R to manipulate and analyze the NCAA Lacrosse data. Feel free to use the code and build upon it! If you are not familiar with the background story of Moneyball, have a look at the wiki article. This project will be broken up into multiple parts and periodically updated over time.



Data Preparation


Required Packages

# load packages
library(tidyverse)
library(broom)
library(plotly)
library(highcharter)
library(fmsb)

Import Data

I collected the data from the NCAA’s website. Unfortunately, I was not able to scrape the tabular data with rvest like I was able to in other analyses because the table is dynamic. If anyone has any idea on how to specify which data you want to scrape from a dynamic table (ie. specifying the date, entry, etc. filters), please reach out!

I have posted the data that I used in this analysis here, so you don’t have to go through the hassle of copying & pasting like I did. The data consist of aggregate regular season team data from 2014 to 2017.

# Path to data
path <- "~/Documents/github/diving4data/data/Lacrosse_Pt1/"

# Read csv into teams
teams <- read_csv(paste0(path, "teams.csv"))

# Inspect structure of data frame
str(teams, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    253 obs. of  20 variables:
##  $ Year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Team     : chr  "Loyola Maryland" "Denver" "Duke" "Bryant" ...
##  $ Conf     : chr  "Patriot" "Big East" "Atlantic Coast" "Northeast" ...
##  $ Won      : int  15 14 13 15 11 11 12 12 11 11 ...
##  $ Lost     : int  1 2 3 4 3 3 4 4 4 4 ...
##  $ Win_Pct  : num  0.938 0.875 0.813 0.789 0.786 0.786 0.75 0.75 0.733 0.733 ...
##  $ Games    : int  16 16 16 19 14 14 16 16 15 15 ...
##  $ Goals    : int  213 215 232 198 163 148 187 181 199 197 ...
##  $ Shots    : int  659 570 659 692 563 500 657 491 551 612 ...
##  $ ShotPct  : num  0.323 0.377 0.352 0.286 0.29 0.296 0.285 0.369 0.361 0.322 ...
##  $ GPG      : num  13.3 13.4 14.5 10.4 11.6 ...
##  $ SPG      : num  41.2 35.6 41.2 36.4 40.2 ...
##  $ GA       : int  114 135 152 141 97 123 160 135 177 145 ...
##  $ GAPG     : num  7.13 8.44 9.5 7.42 6.93 8.79 10 8.44 11.8 9.67 ...
##  $ GBPG     : num  35.8 29.2 35.7 35.1 36.1 ...
##  $ FO_Pct   : num  0.525 0.485 0.577 0.678 0.646 0.484 0.606 0.476 0.467 0.517 ...
##  $ RPI      : num  0.643 0.652 0.732 0.615 0.68 ...
##  $ Playoffs : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Champion : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ proj.Wins: num  15.9 14.4 14.4 14.1 12.3 9 10.2 11.7 9.3 11.7 ...

As you see there are a lot of metrics to tackle. The names that are not so obvious:

  • GPG: Goals per Game
  • SPG: Shots per Game
  • GAPG: Goals Against (per Game)
  • GBPG: Groundballs per Game
  • FO_Pct: Faceoff Percentage
  • RPI: Strength of Schedule Metric

Lastly, I want to discuss proj.Wins, which was calculated based off the article from Mauboussin. The calculation turns out to be a fairly accurate predictor of Wins. Here is how the metric was calculated in R:

# Projected Wins
teams %>% 
  mutate(
    proj.Wins = round((.5 + .08*(GPG - GAPG)) * Games, 1)
  )

And here is the reasoning behind Mauboussin’s calculation:

In plain words, this equation says that goal differential (goals for minus goals against per game) for a season times a constant (α – .08 gives a best fit) added to .500 and then multiplied by the number of games played predicts actual wins (with error term ε).

This calculation is intuitive because lacrosse is a combination of how efficient your offense is in converting possesions into goals, and how effective your defense is in preventing the other team from scoring on their possessions.



Analyzing & Visualizing Data


Predicting Wins

Before we go on with our analysis, let’s split our data into train and test, so we can make predictions on the 2017 season later.

# Split Data
train <- teams %>% 
  filter(Year != 2017)

test <- teams %>%
  filter(Year == 2017)


Projected Wins vs. Actual Wins

Now that the data is split, let’s take a look at how our projected Wins variable performs against actual wins in a plotly scatter plot. In addition, let’s plot whether a team made it to the playoffs or not by making use of the color argument. (Blue = Made Playoffs)

# Scatter Plot
plt.1 <- train %>% 
  plot_ly(x=~proj.Wins) %>% 
  add_markers(
    y=~Won, 
    color=~factor(Playoffs),
    colors="RdYlBu", 
    hoverinfo="text",
    text = ~paste0(
      "<b>",Team,"</b>", 
      "<br> Year: ", Year,
      '<br> Actual: ', Won,
      '<br> Projected: ', proj.Wins
      )
    ) %>% 
  layout(
    title = "Actual Wins vs. Projected Wins (2014-2016)",
    xaxis = list(title = "Projected Wins"),
    yaxis = list(title = "Actual Wins"),
    margin = list(t = 100, r = 100, b=50),
    autosize= TRUE,
    showlegend = FALSE
  ) 
plt.1

Figure 1: If you didn’t take my word for it, you can definetly see now that the proj.Wins calculation captures a strong linear relationship. In addition, one can see that there is a threshold around the 10 wins mark where the majority of teams make the Playoffs.

Now that we know the data is linear, let’s fit a regression named mod.1 and check out its results with the summary() function.

# Model 1: Actual Wins vs. Projected
mod.1 <- lm(Won ~ proj.Wins, data=train)
summary(mod.1)
## 
## Call:
## lm(formula = Won ~ proj.Wins, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4390 -0.7536 -0.0599  0.6095  3.2789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.72165    0.20233   3.567 0.000459 ***
## proj.Wins    0.88705    0.02375  37.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.227 on 187 degrees of freedom
## Multiple R-squared:  0.8818, Adjusted R-squared:  0.8811 
## F-statistic:  1395 on 1 and 187 DF,  p-value: < 2.2e-16

First thing to point out here is that proj.Wins is indeed statistically significant (p-value < 0.05) in predicting the actual Wins. Another important is the adjusted R-squared results of 0.8811 which means about 88.11% of the variance in the data can be explained by our proj.Wins variable.

Let’s now plot our regression line to our former plot plt.1 and annotate the adjusted R-squared.

# Call plt.1 object
plt.1 %>% 
  # add regression line
  add_lines(
    y = fitted(mod.1), 
    name = "Regression", 
    hoverinfo = "none"
    ) %>% 
  # add R-Square value
  add_annotations(
    x = 5,
    y = 15,
    xref = "x",
    yref = "y",
    text = "Adjusted R2: 0.88",
    xanchor = 'right',
    showarrow = FALSE,
    font = list(color = '#347703', size = 14)
  )

Figure 2: Our calculation of proj.Wins shows us that our initial hypothesis was correct: In order to win, we need to maximize goals for (Offensive Efficiency) and minimize goals against (Defensive Efficiency).

This shows only the big picture of what makes a team succesful, but it does not tell us what leads a team to win games. Thus we need to individually look at what contributes to a good Offense (efficient possessions that end in goals) and good Defense (vice-versa).

Predicting the nominal number of wins that a team will win is great in all, but not every team plays the same amount of games. So from now on we are going to concentrate on the variables that effect Win_Pct.


Offense vs. Defense

We now know that a good offense and defense contributed to a successful season, but which side of the field is more important? I know what I want to say… but let’s see what the data tells us.

Figure 3: Goals (top), Goals Against (bottom)

It turns out that both Goals per Game and Goals Against per Game are significant variables when predicting Win Percentage. However, Goals per Game explains 62% of the variance in the data, while Goals Against per Game explains 52%. You can see this in the sparsity of the data in the second plot as Goals Against moves toward zero; some teams have very high win percentages while other have low ones.



Offensive Efficiency


To improve our team’s offense we need goals, which relies on effecient shooting and having possesions. Therefore, the variables that we are going to analyze in order to maximize goals are:

  • Shots
  • Shot Percentage
  • Ground Balls
  • Face Offs


Correlation between Variables

Let’s take a look at a correlation matrix, making use of hchar(cor()) from the highcharter package.

# Correlation Plot
cor.off <- train %>% 
  select(
    GPG, ShotPct, SPG, 
    GBPG, FO_Pct
    )
hchart(cor(cor.off))

Figure 4: Hover over box to see correlation value

The correlation plot gives us a very interesting result. As we hoped, each of these variables are positively correlated with goals per game! I was nervous, however, that they would be extremely correlated with each other, which would cause problems of multicolinearity.

I am concerned about some of the variable combinations like Shots per game and Groundballs per game who have a positive correlation of 0.66. It is a judgment call as to whether you want to keep both variables based off their correlation, but there is yet another calculation called Variance Inflation Factor (vif) which quantifies the severity of multicollinearity.


Multiple Regression for Goals per Game

Let’s first run a regression including all the offensive variables to see if we notice anything fishy.

# Goals per Game Regression
mod.off.1 <- lm(
  GPG ~ ShotPct + SPG + GBPG + FO_Pct,
  data = train
)
# Summary of Regression
summary(mod.off.1)
## 
## Call:
## lm(formula = GPG ~ ShotPct + SPG + GBPG + FO_Pct, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45272 -0.07740 -0.02608  0.05392  0.83646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.705417   0.133736 -72.571  < 2e-16 ***
## ShotPct     33.978390   0.362454  93.745  < 2e-16 ***
## SPG          0.272781   0.004447  61.343  < 2e-16 ***
## GBPG         0.020608   0.005906   3.490 0.000606 ***
## FO_Pct      -0.179432   0.203699  -0.881 0.379538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.18 on 184 degrees of freedom
## Multiple R-squared:  0.9927, Adjusted R-squared:  0.9925 
## F-statistic:  6257 on 4 and 184 DF,  p-value: < 2.2e-16

WOW! We are explaining 99% of the variance in the data with our model; however, there does indeed appear to be multicollinearity at play. Doesn’t it seem odd that the coefficient estimate of FO_Pct is negative (-0.179) and not significant? It doesn’t seem illogical given that there was a positive correlation between Goals per Game and Faceoffs? This is a key sign of the dreaded multicollinearity, but it is something we can fix!


Testing for Multicollinearity

We will make use of the VIF() from the fmsb package, which takes a linear model as input and calculates the VIF. If the VIF calculation is greater than 10, multicollinearity is strongly suggested.

# Calculate VIF
VIF(mod.off.1)
## [1] 137.0273

YIKES! A VIF value of 137?! Time to take a closer look at the variables and how they relate to Goals per Game. We can run simple regressions for each variable rather easily with the help of the purrr package and tidy() from the broom package.


Simple Regressions for Goals per Game

# List of simple regression summaries
summaries <- train %>% 
  # exclude GPG
  select(ShotPct, SPG, GBPG, FO_Pct) %>%
  map(~lm(train$GPG ~ .x, data = train)) %>% 
  map(summary)

# Extract Estimate (Coef) values
coef.vals <- summaries %>% 
  map("coefficients") %>% 
  map_dbl(2) %>% 
  tidy %>%
  rename(estimate = x)

# Extract P-values
p.vals <- summaries %>% 
  map("coefficients") %>% 
  map_dbl(8) %>% 
  tidy %>% 
  rename(p.value = x)

# Extract Adjusted R-Square
r2.vals <- summaries %>% 
  map_dbl("adj.r.squared") %>% 
  tidy %>% 
  rename(r.squared = x)

# Summary Table
GPG.mods <- coef.vals %>% 
  left_join(p.vals) %>% 
  left_join(r2.vals)

Output:

GPG.mods
## # A tibble: 4 x 4
##     names   estimate      p.value r.squared
##     <chr>      <dbl>        <dbl>     <dbl>
## 1 ShotPct 43.3394316 1.177670e-56 0.7389885
## 2     SPG  0.4022866 9.062225e-37 0.5746208
## 3    GBPG  0.4128869 1.239577e-28 0.4807331
## 4  FO_Pct 11.5939121 1.436926e-10 0.1935384

It appears as we expected that Shot Percentage is the most significant offensive variable in explaining Goals per Game. I find it interesting that in this case the quality of shots are more important than the quantity of shots.



To Be Continued

In the next Moneyball post, we will take a look at the defensive side of the field, where variables like caused turnovers, clears, and saves contribute to the efficiency of the defense and number of goals conceded.



Acknowledgments



comments powered by Disqus