I drew inspiration for this project from a variety sources. First, from the movie Moneyball based off the book by Michael Lewis. Then, from Analytics Edge, a course offered by edX and MIT, where we went step-by-step through how the Oakland A’s used statistics to gain a competitive edge over opponents.
After I completed the course, I thought why not do it for Lacrosse, a sport that I have played since I was five. As I researched whether or not such an analysis had been done before, I stumbled upon an insightful article written in 2011 by Michael Mauboussin. I borrowed insights that he described in the article in addition to exploring different metrics.
You will be able to see how I use R to manipulate and analyze the NCAA Lacrosse data. Feel free to use the code and build upon it! If you are not familiar with the background story of Moneyball, have a look at the wiki article. This project will be broken up into multiple parts and periodically updated over time.
# load packages library(tidyverse) library(broom) library(plotly) library(highcharter) library(fmsb)
I collected the data from the NCAA’s website. Unfortunately, I was not able to scrape the tabular data with
rvest like I was able to in other analyses because the table is dynamic. If anyone has any idea on how to specify which data you want to scrape from a dynamic table (ie. specifying the date, entry, etc. filters), please reach out!
I have posted the data that I used in this analysis here, so you don’t have to go through the hassle of copying & pasting like I did. The data consist of aggregate regular season team data from 2014 to 2017.
# Path to data path <- "~/Documents/github/diving4data/data/Lacrosse_Pt1/" # Read csv into teams teams <- read_csv(paste0(path, "teams.csv")) # Inspect structure of data frame str(teams, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 253 obs. of 20 variables: ## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## $ Team : chr "Loyola Maryland" "Denver" "Duke" "Bryant" ... ## $ Conf : chr "Patriot" "Big East" "Atlantic Coast" "Northeast" ... ## $ Won : int 15 14 13 15 11 11 12 12 11 11 ... ## $ Lost : int 1 2 3 4 3 3 4 4 4 4 ... ## $ Win_Pct : num 0.938 0.875 0.813 0.789 0.786 0.786 0.75 0.75 0.733 0.733 ... ## $ Games : int 16 16 16 19 14 14 16 16 15 15 ... ## $ Goals : int 213 215 232 198 163 148 187 181 199 197 ... ## $ Shots : int 659 570 659 692 563 500 657 491 551 612 ... ## $ ShotPct : num 0.323 0.377 0.352 0.286 0.29 0.296 0.285 0.369 0.361 0.322 ... ## $ GPG : num 13.3 13.4 14.5 10.4 11.6 ... ## $ SPG : num 41.2 35.6 41.2 36.4 40.2 ... ## $ GA : int 114 135 152 141 97 123 160 135 177 145 ... ## $ GAPG : num 7.13 8.44 9.5 7.42 6.93 8.79 10 8.44 11.8 9.67 ... ## $ GBPG : num 35.8 29.2 35.7 35.1 36.1 ... ## $ FO_Pct : num 0.525 0.485 0.577 0.678 0.646 0.484 0.606 0.476 0.467 0.517 ... ## $ RPI : num 0.643 0.652 0.732 0.615 0.68 ... ## $ Playoffs : int 1 1 1 1 1 1 1 0 1 1 ... ## $ Champion : int 0 0 1 0 0 0 0 0 0 0 ... ## $ proj.Wins: num 15.9 14.4 14.4 14.1 12.3 9 10.2 11.7 9.3 11.7 ...
As you see there are a lot of metrics to tackle. The names that are not so obvious:
GPG:Goals per Game
SPG:Shots per Game
GAPG:Goals Against (per Game)
GBPG:Groundballs per Game
RPI:Strength of Schedule Metric
Lastly, I want to discuss
proj.Wins, which was calculated based off the article from Mauboussin. The calculation turns out to be a fairly accurate predictor of Wins. Here is how the metric was calculated in R:
# Projected Wins teams %>% mutate( proj.Wins = round((.5 + .08*(GPG - GAPG)) * Games, 1) )
And here is the reasoning behind Mauboussin’s calculation:
In plain words, this equation says that goal differential (goals for minus goals against per game) for a season times a constant (α – .08 gives a best fit) added to .500 and then multiplied by the number of games played predicts actual wins (with error term ε).
This calculation is intuitive because lacrosse is a combination of how efficient your offense is in converting possesions into goals, and how effective your defense is in preventing the other team from scoring on their possessions.
Analyzing & Visualizing Data
Before we go on with our analysis, let’s split our data into
test, so we can make predictions on the 2017 season later.
# Split Data train <- teams %>% filter(Year != 2017) test <- teams %>% filter(Year == 2017)
Projected Wins vs. Actual Wins
Now that the data is split, let’s take a look at how our projected Wins variable performs against actual wins in a plotly scatter plot. In addition, let’s plot whether a team made it to the playoffs or not by making use of the color argument. (Blue = Made Playoffs)
# Scatter Plot plt.1 <- train %>% plot_ly(x=~proj.Wins) %>% add_markers( y=~Won, color=~factor(Playoffs), colors="RdYlBu", hoverinfo="text", text = ~paste0( "<b>",Team,"</b>", "<br> Year: ", Year, '<br> Actual: ', Won, '<br> Projected: ', proj.Wins ) ) %>% layout( title = "Actual Wins vs. Projected Wins (2014-2016)", xaxis = list(title = "Projected Wins"), yaxis = list(title = "Actual Wins"), margin = list(t = 100, r = 100, b=50), autosize= TRUE, showlegend = FALSE ) plt.1
Now that we know the data is linear, let’s fit a regression named
mod.1 and check out its results with the
# Model 1: Actual Wins vs. Projected mod.1 <- lm(Won ~ proj.Wins, data=train) summary(mod.1)
## ## Call: ## lm(formula = Won ~ proj.Wins, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.4390 -0.7536 -0.0599 0.6095 3.2789 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.72165 0.20233 3.567 0.000459 *** ## proj.Wins 0.88705 0.02375 37.345 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.227 on 187 degrees of freedom ## Multiple R-squared: 0.8818, Adjusted R-squared: 0.8811 ## F-statistic: 1395 on 1 and 187 DF, p-value: < 2.2e-16
First thing to point out here is that proj.Wins is indeed statistically significant (p-value < 0.05) in predicting the actual Wins. Another important is the adjusted R-squared results of 0.8811 which means about 88.11% of the variance in the data can be explained by our proj.Wins variable.
Let’s now plot our regression line to our former plot
plt.1 and annotate the adjusted R-squared.
# Call plt.1 object plt.1 %>% # add regression line add_lines( y = fitted(mod.1), name = "Regression", hoverinfo = "none" ) %>% # add R-Square value add_annotations( x = 5, y = 15, xref = "x", yref = "y", text = "Adjusted R2: 0.88", xanchor = 'right', showarrow = FALSE, font = list(color = '#347703', size = 14) )
This shows only the big picture of what makes a team succesful, but it does not tell us what leads a team to win games. Thus we need to individually look at what contributes to a good Offense (efficient possessions that end in goals) and good Defense (vice-versa).
Predicting the nominal number of wins that a team will win is great in all, but not every team plays the same amount of games. So from now on we are going to concentrate on the variables that effect Win_Pct.
Offense vs. Defense
We now know that a good offense and defense contributed to a successful season, but which side of the field is more important? I know what I want to say… but let’s see what the data tells us.
It turns out that both Goals per Game and Goals Against per Game are significant variables when predicting Win Percentage. However, Goals per Game explains 62% of the variance in the data, while Goals Against per Game explains 52%. You can see this in the sparsity of the data in the second plot as Goals Against moves toward zero; some teams have very high win percentages while other have low ones.
To improve our team’s offense we need goals, which relies on effecient shooting and having possesions. Therefore, the variables that we are going to analyze in order to maximize goals are:
- Shot Percentage
- Ground Balls
- Face Offs
Correlation between Variables
Let’s take a look at a correlation matrix, making use of
hchar(cor()) from the
# Correlation Plot cor.off <- train %>% select( GPG, ShotPct, SPG, GBPG, FO_Pct ) hchart(cor(cor.off))
The correlation plot gives us a very interesting result. As we hoped, each of these variables are positively correlated with goals per game! I was nervous, however, that they would be extremely correlated with each other, which would cause problems of multicolinearity.
I am concerned about some of the variable combinations like Shots per game and Groundballs per game who have a positive correlation of 0.66. It is a judgment call as to whether you want to keep both variables based off their correlation, but there is yet another calculation called Variance Inflation Factor (vif) which quantifies the severity of multicollinearity.
Multiple Regression for Goals per Game
Let’s first run a regression including all the offensive variables to see if we notice anything fishy.
# Goals per Game Regression mod.off.1 <- lm( GPG ~ ShotPct + SPG + GBPG + FO_Pct, data = train ) # Summary of Regression summary(mod.off.1)
## ## Call: ## lm(formula = GPG ~ ShotPct + SPG + GBPG + FO_Pct, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45272 -0.07740 -0.02608 0.05392 0.83646 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.705417 0.133736 -72.571 < 2e-16 *** ## ShotPct 33.978390 0.362454 93.745 < 2e-16 *** ## SPG 0.272781 0.004447 61.343 < 2e-16 *** ## GBPG 0.020608 0.005906 3.490 0.000606 *** ## FO_Pct -0.179432 0.203699 -0.881 0.379538 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.18 on 184 degrees of freedom ## Multiple R-squared: 0.9927, Adjusted R-squared: 0.9925 ## F-statistic: 6257 on 4 and 184 DF, p-value: < 2.2e-16
WOW! We are explaining 99% of the variance in the data with our model; however, there does indeed appear to be multicollinearity at play. Doesn’t it seem odd that the coefficient estimate of FO_Pct is negative (-0.179) and not significant? It doesn’t seem illogical given that there was a positive correlation between Goals per Game and Faceoffs? This is a key sign of the dreaded multicollinearity, but it is something we can fix!
Testing for Multicollinearity
We will make use of the
VIF() from the
fmsb package, which takes a linear model as input and calculates the VIF. If the VIF calculation is greater than 10, multicollinearity is strongly suggested.
# Calculate VIF VIF(mod.off.1)
##  137.0273
YIKES! A VIF value of 137?! Time to take a closer look at the variables and how they relate to Goals per Game. We can run simple regressions for each variable rather easily with the help of the
purrr package and
tidy() from the
Simple Regressions for Goals per Game
# List of simple regression summaries summaries <- train %>% # exclude GPG select(ShotPct, SPG, GBPG, FO_Pct) %>% map(~lm(train$GPG ~ .x, data = train)) %>% map(summary) # Extract Estimate (Coef) values coef.vals <- summaries %>% map("coefficients") %>% map_dbl(2) %>% tidy %>% rename(estimate = x) # Extract P-values p.vals <- summaries %>% map("coefficients") %>% map_dbl(8) %>% tidy %>% rename(p.value = x) # Extract Adjusted R-Square r2.vals <- summaries %>% map_dbl("adj.r.squared") %>% tidy %>% rename(r.squared = x) # Summary Table GPG.mods <- coef.vals %>% left_join(p.vals) %>% left_join(r2.vals)
## # A tibble: 4 x 4 ## names estimate p.value r.squared ## <chr> <dbl> <dbl> <dbl> ## 1 ShotPct 43.3394316 1.177670e-56 0.7389885 ## 2 SPG 0.4022866 9.062225e-37 0.5746208 ## 3 GBPG 0.4128869 1.239577e-28 0.4807331 ## 4 FO_Pct 11.5939121 1.436926e-10 0.1935384
It appears as we expected that Shot Percentage is the most significant offensive variable in explaining Goals per Game. I find it interesting that in this case the quality of shots are more important than the quantity of shots.
To Be Continued
In the next Moneyball post, we will take a look at the defensive side of the field, where variables like caused turnovers, clears, and saves contribute to the efficiency of the defense and number of goals conceded.
Michael Mauboussin: Analysis of Division I Men’s Lacrosse Statistics Article Source
broom: David Robinson (2013). broom: Convert statistical analysis objects from R into tidy data frames. https://cran.r-project.org/web/packages/broom/index.html
fmsb: Minato Nakazawa (2016). fmsb::VIF(): Calculate variance inflation factor (VIF) from the result of lm. https://cran.r-project.org/web/packages/fmsb/index.html
highcharter: Joshua Kunst (2017). highcharter: a wrapper for the ‘Highcharts’ library including shortcut functions to plot R objects. http://jkunst.com/highcharter/
plotly: Plotly (2014). plotly: Easily translate ‘ggplot2’ graphs to an interactive web-based version and/or create custom web-based visualizations directly from R. https://github.com/ropensci/plotly
tidyverse: Rstudio and Inc. (2016). tidyverse: a set of packages that work in harmony because they share common data representations and ‘API’ design. https://github.com/tidyverse/tidyverse