By Mason Kirchner
In honor of this weekend’s NFL Draft, I decided to make a spiritual successor to my recent article that tried to predict how well incoming NFL rookies would perform in fantasy football. Feel free to read that to get an idea of the dataset used in this project, but note that the method is slightly different this time around. This time around, instead of trying to predict how well the incoming rookie would perform in the NFL, I decided to try to predict which round the player will be selected in.
In order to get more out of my data, I introduced squared terms and Principal Component Analysis. The hope is that by squaring each of the individual statistics, such as yards, is that some sort of inflection point can be more easily identified in the classification problem. Principal Component Analysis is used to transform the data in a way that reduces variance amongst different features.
This time around, instead of using purely random forests, I decided to use a voting ensemble method. This video sums up ensembles nicely, but the strength in using this over just random forests is that it allows me to combine multiple models to get better predictions. In addition to voting, I used bagging, which is explained in this video, in order to additionally strengthen each of the individual models. The three methods used in the voting ensemble are K Nearest Neighbors, Random Forests, and Extra Random Forests, all of which were bagged. Other methods, such as SVM and AdaBoost were tested, but didn’t perform as well as the three that I settled on.
It’s worth noting that the best individual model only had .516 mean accuracy in cross-validation (Bagged Random Forest), and that the voting ensemble performed at .506 mean accuracy in cross-validation. This can be attributed to there being 8 different classifications – one for each round, as well as undrafted status. With this many possible classifications, getting an average of around 50% of the samples being correctly predicted is alright in my books.
Without further adieux, here’s the complete table of round probabilities for players that went to the 2017 Combine (and Joe Mixon):
The table itself is sorted from least to greatest on the probability that the player goes undrafted. You can download the excel file if you wish to make it sortable yourself.
We need to be careful in our analysis of these numbers not to assume that the highest predicted probability for each player is what is most likely to occur.
For example, Samaje Perine is given a .38 predicted probability of being undrafted. Or a .62 probability of being drafted. These predicted probabilities are more useful to us if we set up a probability distribution of where a player might be drafted. Using these numbers, we’re going to find that Perine has a .407 predicted probability of going somewhere between rounds four and six by adding up each of his predicted probabilities for those rounds.
The model still could use plenty of help. Like mentioned previously, it was only able to accurately predict an average of .506 during cross validation. There are some notable players that the model doesn’t like, such as Leonard Fournette, who likely end up being the first RB’s name called Thursday night. Alvin Kamara is another potential first/second rounder that is getting snubbed by the model. Their poor scores likely stem from the fact that both Fournette and Kamara missed games due to injury this past season.
Hopefully you enjoyed this exercise, and enjoy the NFL Draft as well.