# Moneyball Lacrosse

**Introduction**

I drew inspiration for this project from a variety sources. First, from the movie Moneyball based off the book by Michael Lewis. Then, from Analytics Edge, a course offered by edX and MIT, where we went step-by-step through how the Oakland A’s used statistics to gain a competitive edge over opponents.

After I completed the course, I thought why not do it for Lacrosse, a sport that I have played since I was five. As I researched whether or not such an analysis had been done before, I stumbled upon an insightful article written in 2011 by Michael Mauboussin. I borrowed insights that he described in the article in addition to exploring different metrics.

You will be able to see how I use R to manipulate and analyze the NCAA Lacrosse data. Feel free to use the code and build upon it! If you are not familiar with the background story of Moneyball, have a look at the wiki article. This project will be broken up into multiple parts and periodically updated over time.

**Data Preparation**

### Required Packages

```
# load packages
library(tidyverse)
library(broom)
library(plotly)
library(highcharter)
library(fmsb)
```

### Import Data

I collected the data from the NCAA’s website. Unfortunately, I was not able to scrape the tabular data with `rvest`

like I was able to in other analyses because the table is dynamic. If anyone has any idea on how to specify which data you want to scrape from a dynamic table (ie. specifying the date, entry, etc. filters), please reach out!

I have posted the data that I used in this analysis here, so you don’t have to go through the hassle of copying & pasting like I did. The data consist of aggregate regular season team data from 2014 to 2017.

```
# Path to data
path <- "~/Documents/github/diving4data/data/Lacrosse_Pt1/"
# Read csv into teams
teams <- read_csv(paste0(path, "teams.csv"))
# Inspect structure of data frame
str(teams, give.attr = FALSE)
```

```
## Classes 'tbl_df', 'tbl' and 'data.frame': 253 obs. of 20 variables:
## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ Team : chr "Loyola Maryland" "Denver" "Duke" "Bryant" ...
## $ Conf : chr "Patriot" "Big East" "Atlantic Coast" "Northeast" ...
## $ Won : int 15 14 13 15 11 11 12 12 11 11 ...
## $ Lost : int 1 2 3 4 3 3 4 4 4 4 ...
## $ Win_Pct : num 0.938 0.875 0.813 0.789 0.786 0.786 0.75 0.75 0.733 0.733 ...
## $ Games : int 16 16 16 19 14 14 16 16 15 15 ...
## $ Goals : int 213 215 232 198 163 148 187 181 199 197 ...
## $ Shots : int 659 570 659 692 563 500 657 491 551 612 ...
## $ ShotPct : num 0.323 0.377 0.352 0.286 0.29 0.296 0.285 0.369 0.361 0.322 ...
## $ GPG : num 13.3 13.4 14.5 10.4 11.6 ...
## $ SPG : num 41.2 35.6 41.2 36.4 40.2 ...
## $ GA : int 114 135 152 141 97 123 160 135 177 145 ...
## $ GAPG : num 7.13 8.44 9.5 7.42 6.93 8.79 10 8.44 11.8 9.67 ...
## $ GBPG : num 35.8 29.2 35.7 35.1 36.1 ...
## $ FO_Pct : num 0.525 0.485 0.577 0.678 0.646 0.484 0.606 0.476 0.467 0.517 ...
## $ RPI : num 0.643 0.652 0.732 0.615 0.68 ...
## $ Playoffs : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Champion : int 0 0 1 0 0 0 0 0 0 0 ...
## $ proj.Wins: num 15.9 14.4 14.4 14.1 12.3 9 10.2 11.7 9.3 11.7 ...
```

As you see there are a lot of metrics to tackle. The names that are not so obvious:

`GPG:`

Goals per Game`SPG:`

Shots per Game`GAPG:`

Goals Against (per Game)`GBPG:`

Groundballs per Game`FO_Pct:`

Faceoff Percentage`RPI:`

Strength of Schedule Metric

Lastly, I want to discuss `proj.Wins`

, which was calculated based off the article from Mauboussin. The calculation turns out to be a fairly accurate predictor of Wins. Here is how the metric was calculated in R:

```
# Projected Wins
teams %>%
mutate(
proj.Wins = round((.5 + .08*(GPG - GAPG)) * Games, 1)
)
```

And here is the reasoning behind Mauboussin’s calculation:

In plain words, this equation says that goal differential (goals for minus goals against per game) for a season times a constant (α – .08 gives a best fit) added to .500 and then multiplied by the number of games played predicts actual wins (with error term ε).

This calculation is intuitive because lacrosse is a combination of how efficient your offense is in converting possesions into goals, and how effective your defense is in preventing the other team from scoring on their possessions.

**Analyzing & Visualizing Data**

### Predicting Wins

Before we go on with our analysis, let’s split our data into `train`

and `test`

, so we can make predictions on the 2017 season later.

```
# Split Data
train <- teams %>%
filter(Year != 2017)
test <- teams %>%
filter(Year == 2017)
```

#### Projected Wins vs. Actual Wins

Now that the data is split, let’s take a look at how our projected Wins variable performs against actual wins in a plotly scatter plot. In addition, let’s plot whether a team made it to the playoffs or not by making use of the color argument. (Blue = Made Playoffs)

```
# Scatter Plot
plt.1 <- train %>%
plot_ly(x=~proj.Wins) %>%
add_markers(
y=~Won,
color=~factor(Playoffs),
colors="RdYlBu",
hoverinfo="text",
text = ~paste0(
"<b>",Team,"</b>",
"<br> Year: ", Year,
'<br> Actual: ', Won,
'<br> Projected: ', proj.Wins
)
) %>%
layout(
title = "Actual Wins vs. Projected Wins (2014-2016)",
xaxis = list(title = "Projected Wins"),
yaxis = list(title = "Actual Wins"),
margin = list(t = 100, r = 100, b=50),
autosize= TRUE,
showlegend = FALSE
)
plt.1
```

Now that we know the data is linear, let’s fit a regression named `mod.1`

and check out its results with the `summary()`

function.

```
# Model 1: Actual Wins vs. Projected
mod.1 <- lm(Won ~ proj.Wins, data=train)
summary(mod.1)
```

```
##
## Call:
## lm(formula = Won ~ proj.Wins, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4390 -0.7536 -0.0599 0.6095 3.2789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.72165 0.20233 3.567 0.000459 ***
## proj.Wins 0.88705 0.02375 37.345 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.227 on 187 degrees of freedom
## Multiple R-squared: 0.8818, Adjusted R-squared: 0.8811
## F-statistic: 1395 on 1 and 187 DF, p-value: < 2.2e-16
```

First thing to point out here is that *proj.Wins* is indeed statistically significant (p-value < 0.05) in predicting the actual Wins. Another important is the adjusted R-squared results of 0.8811 which means about 88.11% of the variance in the data can be explained by our *proj.Wins* variable.

Let’s now plot our regression line to our former plot `plt.1`

and annotate the adjusted R-squared.

```
# Call plt.1 object
plt.1 %>%
# add regression line
add_lines(
y = fitted(mod.1),
name = "Regression",
hoverinfo = "none"
) %>%
# add R-Square value
add_annotations(
x = 5,
y = 15,
xref = "x",
yref = "y",
text = "Adjusted R2: 0.88",
xanchor = 'right',
showarrow = FALSE,
font = list(color = '#347703', size = 14)
)
```

This shows only the big picture of what makes a team succesful, but it does not tell us what leads a team to win games. Thus we need to individually look at what contributes to a good Offense (efficient possessions that end in goals) and good Defense (vice-versa).

Predicting the nominal number of wins that a team will win is great in all, but not every team plays the same amount of games. So from now on we are going to concentrate on the variables that effect *Win_Pct*.

### Offense vs. Defense

We now know that a good offense and defense contributed to a successful season, but which side of the field is *more* important? I know what I want to say… but let’s see what the data tells us.

It turns out that both *Goals per Game* and *Goals Against per Game* are significant variables when predicting Win Percentage. However, *Goals per Game* explains **62%** of the variance in the data, while *Goals Against per Game* explains **52%**. You can see this in the sparsity of the data in the second plot as Goals Against moves toward zero; some teams have very high win percentages while other have low ones.

**Offensive Efficiency**

To improve our team’s offense we need goals, which relies on effecient shooting and having possesions. Therefore, the variables that we are going to analyze in order to maximize goals are:

- Shots
- Shot Percentage
- Ground Balls
- Face Offs

### Correlation between Variables

Let’s take a look at a correlation matrix, making use of `hchar(cor())`

from the `highcharter`

package.

```
# Correlation Plot
cor.off <- train %>%
select(
GPG, ShotPct, SPG,
GBPG, FO_Pct
)
hchart(cor(cor.off))
```

The correlation plot gives us a very interesting result. As we hoped, each of these variables are positively correlated with goals per game! I was nervous, however, that they would be extremely correlated with each other, which would cause problems of multicolinearity.

I am concerned about some of the variable combinations like Shots per game and Groundballs per game who have a positive correlation of 0.66. It is a judgment call as to whether you want to keep both variables based off their correlation, but there is yet another calculation called Variance Inflation Factor (vif) which quantifies the severity of multicollinearity.

### Multiple Regression for Goals per Game

Let’s first run a regression including all the offensive variables to see if we notice anything fishy.

```
# Goals per Game Regression
mod.off.1 <- lm(
GPG ~ ShotPct + SPG + GBPG + FO_Pct,
data = train
)
# Summary of Regression
summary(mod.off.1)
```

```
##
## Call:
## lm(formula = GPG ~ ShotPct + SPG + GBPG + FO_Pct, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45272 -0.07740 -0.02608 0.05392 0.83646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.705417 0.133736 -72.571 < 2e-16 ***
## ShotPct 33.978390 0.362454 93.745 < 2e-16 ***
## SPG 0.272781 0.004447 61.343 < 2e-16 ***
## GBPG 0.020608 0.005906 3.490 0.000606 ***
## FO_Pct -0.179432 0.203699 -0.881 0.379538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.18 on 184 degrees of freedom
## Multiple R-squared: 0.9927, Adjusted R-squared: 0.9925
## F-statistic: 6257 on 4 and 184 DF, p-value: < 2.2e-16
```

WOW! We are explaining **99%** of the variance in the data with our model; however, there does indeed appear to be multicollinearity at play. Doesn’t it seem odd that the coefficient estimate of *FO_Pct* is negative (-0.179) and not significant? It doesn’t seem illogical given that there was a positive correlation between Goals per Game and Faceoffs? This is a key sign of the dreaded multicollinearity, but it is something we can fix!

### Testing for Multicollinearity

We will make use of the `VIF()`

from the `fmsb`

package, which takes a linear model as input and calculates the VIF. If the VIF calculation is greater than 10, multicollinearity is strongly suggested.

```
# Calculate VIF
VIF(mod.off.1)
```

`## [1] 137.0273`

YIKES! A VIF value of **137**?! Time to take a closer look at the variables and how they relate to Goals per Game. We can run simple regressions for each variable rather easily with the help of the `purrr`

package and `tidy()`

from the `broom`

package.

### Simple Regressions for Goals per Game

```
# List of simple regression summaries
summaries <- train %>%
# exclude GPG
select(ShotPct, SPG, GBPG, FO_Pct) %>%
map(~lm(train$GPG ~ .x, data = train)) %>%
map(summary)
# Extract Estimate (Coef) values
coef.vals <- summaries %>%
map("coefficients") %>%
map_dbl(2) %>%
tidy %>%
rename(estimate = x)
# Extract P-values
p.vals <- summaries %>%
map("coefficients") %>%
map_dbl(8) %>%
tidy %>%
rename(p.value = x)
# Extract Adjusted R-Square
r2.vals <- summaries %>%
map_dbl("adj.r.squared") %>%
tidy %>%
rename(r.squared = x)
# Summary Table
GPG.mods <- coef.vals %>%
left_join(p.vals) %>%
left_join(r2.vals)
```

Output:

`GPG.mods`

```
## # A tibble: 4 x 4
## names estimate p.value r.squared
## <chr> <dbl> <dbl> <dbl>
## 1 ShotPct 43.3394316 1.177670e-56 0.7389885
## 2 SPG 0.4022866 9.062225e-37 0.5746208
## 3 GBPG 0.4128869 1.239577e-28 0.4807331
## 4 FO_Pct 11.5939121 1.436926e-10 0.1935384
```

It appears as we expected that Shot Percentage is the most significant offensive variable in explaining Goals per Game. I find it interesting that in this case the quality of shots are more important than the quantity of shots.

*To Be Continued*

*In the next Moneyball post, we will take a look at the defensive side of the field, where variables like caused turnovers, clears, and saves contribute to the efficiency of the defense and number of goals conceded.*

**Acknowledgments**

Michael Mauboussin:

*Analysis of Division I Men’s Lacrosse Statistics*Article Source*broom:*David Robinson (2013). broom: Convert statistical analysis objects from R into tidy data frames. https://cran.r-project.org/web/packages/broom/index.html*fmsb:*Minato Nakazawa (2016). fmsb::VIF(): Calculate variance inflation factor (VIF) from the result of lm. https://cran.r-project.org/web/packages/fmsb/index.html*highcharter:*Joshua Kunst (2017). highcharter: a wrapper for the ‘Highcharts’ library including shortcut functions to plot R objects. http://jkunst.com/highcharter/*plotly:*Plotly (2014). plotly: Easily translate ‘ggplot2’ graphs to an interactive web-based version and/or create custom web-based visualizations directly from R. https://github.com/ropensci/plotly*tidyverse:*Rstudio and Inc. (2016). tidyverse: a set of packages that work in harmony because they share common data representations and ‘API’ design. https://github.com/tidyverse/tidyverse