image.png

Predicting NBA Players' Points per Game

- Sesha Tipirneni

The goal of this project is to be able to be able to predict the points a player will average in a season and build an accurate tool a team could use when deciding which players to trade or pick up.

Introduction

Since I was child, basketball was my favorite sport to not only play but also watch. Specifically, the National Basketball Association was what I would watch every day as a child and would be very passionate about my hometown team in the NBA, the Washington Wizards. When growing up with friends who also liked the NBA, we would constantly be having discussions about our predictions for the upcoming seasons and how certain players would play in the future. With the growth of technology and data science over the years, NBA teams try to develop new models to analyze their existing players and make predictions on each player and teams future performance in order to help put together the best team combination. This was captured well in the movie "Moneyball" as the MLB team, Oakland A's, assembled a record-breaking team on a budget using statistical analyses.

3-pointers are becoming more common in NBA games of late and teams are shifting their shot selection and offensive schemes to cater to this. Moreover, the art of the game is now changing with the offense heavily being focused on and teams are constantly scrambling to try to get the player that can consistently score the most number of points. This is the main reason why I thought it would be interesting to examine what may affect a player's ability to score. In this project, I will analyze what affects the points a player averages per game for a season and also will attempt to predict a player's average points per game in a season considering various parameters.

Data Collection/Curation

I will be using a dataset from Kaggle and I renamed the csv file for convenience purposes. I will name the data as season_stats and store it in a pandas dataframe for parsing and other implementation reasons. This dataset has 53 different statistics/information about the players so I decided it is sufficient in order to complete my desired analysis. The files are formatted as a Comma Separated Value (CSV) file so I will parsing it by using the "csv" package in python.

Data Management/Representation

Now that I have the data I need, I have to now organize the data so that it will be easier to work with and also see if there is anything insightful during the process. Below are the steps that I will be doing to prepare the data.

  1. Remove any unnecessary columns.
  2. Get data from 2017 since to work with the 2016-2017 season only.
  3. Rename the columns, A glossary of basketball terms can be found here: Basketball Reference
  4. Create new columns for the average points, assists, rebounds, steals, and blocks per game for every player
  5. Get data from 2017 since to work with the 2016-2017 season only.
  6. Rearrange the columns
  7. Deal with any flaws in the dataset

Exploratory Data Analysis

During the Exploratory Data Analysis we will begin to visualize our previously cleaned data to see if there are any interesting correlations that we can draw.

Since during any NBA game a player does a lot more than just scoring, I will be trying to see if there correlations between other aspects of defense and offense such as rebounding, assisting, etc. that has to do with the points a player averages in a season.

Average Assists per game vs Average Points per game

The plot above shows that there is a slight positive correlation between the assists and points a player averages per game. This makes sense since the more points a player averages per game the player tends to be a better basketball player in general, causing them to have more assists as well. However, there is a dominant cluster around 10 and under points and 2 and below assists. This could be because most players do not average that many assists in general if there are not scoring as much since probably do not have the ball with them as much as others during the game when having statistics such as that.

Average Rebounds per game vs Average Points per game

The plot above also shows that there is a slight positive correlation between rebounds and points a player averages per game but there are some interesting outliers. While the same reason for why a player would tend to average a higher APG as there PPG went up, the plot shows that is not always the case as there are numerous players who averaged a high RPG and a low PPG and vice versa. The many outliers could be because attaining rebounds is relatively hard and not as controllable as others since where the ball lands on the ground is random.

Field Goals vs Average Points per game

Field goal percentage is the number of made shots by the total number of shot attempts. Since in order to score a point in an NBA game one has to shoot, I thought seeing if there is a relation between the two would be important. If a player has a relatively high field goal percentage, does that mean that there points per game average is also very high? Another aspect to consider is the field goal attempts. As the player shoots more, does that mean that his average PPG would go up or down. To see if there is any relation between these two and the average PPG of a player, I will be creating scatterplots below.

From the "Field Goal Percentage vs Average Points per Game" plot, there seems to be a strong positive correlation between field goal attempts and a player's average PPG. There are not many outliers and seems show a strong case for when a player's field goal attempts go up, then a players average PPG goes up as well and vice versa. This could be because if a player is shooting less, their chance of scoring points as someone who shoots more will probably be lower.

On the other hand, in the "Field Goal Percentage vs Average Points per Game" plot, there seems to be no relationship between field goal percentage and the a player's average PPG. Most of the field goal percentages seem to be in the 0.4% to 0.6% region as there is a large cluster there. This tells us that regardless of how much a player averages per game, the field goal percentage tends to be in the region between 0.4 and 0.6%.

Shooting Efficiency vs Average Points per game

Shooting Efficiency is the percentage of total shots the player makes. More specifically, the true shooting percentage measure is an advanced statistic that measures a player's efficiency at shooting the ball. The formula used to calculate it is Field Goal Attempts + 0.44 * Free Throw Attempts. If a player averages a relatively true shooting percentage, does that mean that there shooting efficiency is also very high? To see if there is a relation between the two, I will be creating a scatter plot below.

From the plot above, there seems to be no relationship between true shooting percentage and the a player's average PPG. Most of the true shooting percentages seem to be in the 40% to 60% region as there is a large cluster there. This tells us that for the most part regardless of how much efficient a player shoots, their average PPF tends to be in the region between 40 and 60%.

Free Throws vs Average Points per game

Free Throws are another way a player can get points so it must be included when performing analyses on players total points averages. The question I will dive into is whether the higher or lower free throw percentage a player has means that the player has a higher or lower (respectively) average PPG. Since a player who has a high free throw percentage tends to mean they shoot the ball well, does that mean it is related to how many points a player averages?

There seems to be a very slight positive correlation between free-throw percentage and a players average PPG. However, most of the percentages are in a cluster around 60-80% and 5-10 points. As most players do shoot around 60-80%, you can not say based on this there as the free-throw percentage of a player increases their PPG will increase since there are many players that had very high free-throw percentages but relatively low average PPG.

2-point & 3-point vs Average PPG

While we analyzed the field goal percentage, we will now dig even dipper to try to see main forms of scoring, 2-pointers and 3-pointers, have an affect on a players average PPG. I will not only try to see if there is a relationship between the players 2-point and 3-point percentages and their average PPG but also if there is a relationship between the number of 2-point and 3-point attempts. This is because so far I have tested if the player's efficiency has a relationship with their average PPG but not whether just shooting more in general, regardless how accurate they are, has any correlation with their average PPG.

These two plots give us an interesting insight on what possibly effects a player's average PPG. It is clear there is a positive relationship between a player 2-point attempts with a large cluster around 200-400 2-pointer attempts and 0-10 points. This makes sense since naturally as a player shoots more they would likely be scoring more points than someone who shoots less. On the other hand, there is no real relationship between a players average 2-pointer percentage and a players average PPG. It seems that regardless of how high or low a player average PPG is, they tend to fall in the region between 40-60%. This is similar to what we saw when comparing field goal percentages and average PPG as this makes sense since most field goals taken by players are in fact 2-pointers.

As seen above, there is also a positive relationship between 3-point attempts and a players average PPG, however, it is not as clearly positive as the 2-point attempts were. There is a large cluster around 0-100 attempts and 0-15 points. This shows that a player does not necessarily score as many 3-pointers as 2-pointers so they can shoot less 3 pointers but still have a high PPG average. Similar to the 2-point percentage v Average PPG, there does not seem to be a relationship between 3-point percentage and average PPG and most of the players regardless of the points they average, they shoot a 3-point shot around 20-40%. This is less than what we saw with 2-point percentages probably because a 3-pointer is further away from the basket and therefore is harder to shoot, causing accuracy to go down for 3-pointers.

Analysing Non-Skill Based Attributes

So far I have analyzed only statistics related to the skill of players. I will now dig deep into whether other factors such as what team they are on or what position they play to see if there are any significant trends or relationships.

Team vs Average PPG

Firstly, let see if there is a relationship between the team a player is on and the average points per game a player has for a season. Furthermore, in the NBA there are certain teams such as the Boston Celtics and Los Angeles Lakers that have been relatively good team for most of their years in the NBA and maybe well known franchises have a higher average PPG for players than other teams simply because certain organizations could tend to have better players that score more.

The first thing that is at first interesting is that most of the teams have outliers and that many of them have multiple. While the averages are relatively the same for each team around 5-10 points, the ranges that the average PPG vary greatly among the teams. Due to the ranges being random, there does not seem to be a relationship between the team a player is on and the average PPG a player has. However, this does plot does not give us enough information about whether there a certain "big-time" franchises that have a higher PPG than others and we might have to consider it later on.

Games Played/Started vs Average PPG

In the NBA, a player who starts the game tends to play more minutes in the game than the players who come in later in the game. This leads to the question if a player who starts or who has played more games tend to average a high or lower average PPG? Below I will investigate this through plots.

From these two graphs, you can see that there is a slight positive relationship between a player starting or playing games and their average PPG. This makes sense as the more a player is in the game the more chance they have to score points, causing a player playing more games tending to average more points than players playing less games. However, this is definitely not always the case as there are numerous outliers and notable clusters. For example, in the "Total Games Started vs Average PPG", there is a large cluster around 0-20 games started and 0-10 points and numerous players that started many more games (60-80) averaged the same number of PPG. This tells us that although in general there is a positive relationship, it is very weak and that it is pretty much random. There are other factors that could affect this, such as the position of the player, which we will investigate next.

Position vs Average PPG

As you can see above, the averages across the positions are more or less the same but the ranges differ between one another. The positions that tend to have shorter players such as the 'Point Guard' and the 'Shooting Guard' tend to have more players that average higher PPG than the centers and power forwards. This could imply that generally shorter players tend to average more points than the taller players. This could be explained by how taller players tend to do other things on the basketball court such as rebounding and blocking while shorter players tend to focus on scoring and facillitating. However, there is definitely not enough information from this graph to prove my theory since there are centers that average well above the average PPG of point guards.

Machine Learning

Now comes the most interesting part, where we analyze all the data and predict the average PPG for every NBA player in the 2016-2017 season. The goal now is to find the independent variables that will help us calculate a player's average data. We did the exploratory data analysis above to try to analyze the trends of the average PPG based on their attributes. The task at hand now is to find which of these attributes are best at predicting the average PPG. Moreover, the goal is to find the attributes of a player that have the strongest correlation with a player's average PPG so that we can build a predictive model.

Since we have numerous variables, we will be using Linear Regression. We will be creating a correlation matrix to determine the strongest features within the data set. I will be choosing the threshold of 0.7. This means that if an attribute has a correlation greater than 0.7 with the 'PPG' column in the table, we will then choose that attribute as one of the independent variables for the model since if the correlation coefficient is greater than 0.7 that means that it most likely has a strong correlation with the average PPG.

I will then split the dataset into training and testing datasets with the training datasets being 70% of the entire data and 30% for the testing.

The plot above shows the points that my model predicted for what the average PPG of 30% of the players (around 140 players) would be and their actual average PPG for the season. As you can see above, the R-squared is around 90.6% which is high and many of the blue (test values) and red (predicted values) points are either close to each other or somewhat overlapping. In other words, the Linear Regression Model I explains about 90% of the variability of the average ppg data is explained by the independent variables in the regression model. This means that the model I created predicts the average PPG pretty well given that these variables are known: Total Games Started, Total Minutes Played, Offensive Win Shares (OWS), Win Shares(WS), Total Field Goals, Total Field Goal Attempts, 3P Attempts, 2P Field Goals made, 2P Attempts, Free Throws Made, Free Throw Attempts, Turnovers. The Mean Absolute Error was found to be about 1.26 points.

While this model does predict very well what the average PPG would be based on those certain variables, I want to also predict the based on my observations in the Exploratory Data Analysis. The variables that I noticed that showed some correlation with the average PPG were the 2-point attempts, 3-points attempts, games started, and field goal attempts. I will now use these as the independent variables and create a linear regression model and test how accurate it is. Similar to before, I will split the dataset into training and testing datasets with the training datasets being 70% of the entire data and 30% for the testing.

As you can see above, the R-squared was lower this time at around 87%. While this is lower, 87% is definitely not bad and means that the Linear Regression Model I created explains about 87% of the variability of the average ppg data is explained by the independent variables in the regression model. As you can see on the plot above, the red dots (predicted) are not that off from the blue dots (test). Overall, both the models seemed to be a decent fit while they were some nearly perfect predictions and there were others that were slightly innaccurate. I also calculated the Mean Absolute Error to see about how exactly our prediction was off. The MAE was 1.56 points. While this does not seem like that much, every point matters when calculating the average points per game for an entire season such as when trying to predict top scorers of season. Moreover, say nearly every player was either up or down by 1.56 points then their actual average, it would make a drastic difference in who the top scorers of the NBA are. Although generally we can look at this model and say overall it would guess decently about what a player's average PPG would be, it is not conclusive enough for us to guarantee that we can always accurately predict based on the variables that we chose.

Ways in which this model could be improved would be by looking at factors not observed here such as the age of the player or the weight of the player.

Conclusion

In the end, we were able to build a predictive model to find out what a player points per game average would be if we knew certain variables. The take away from this would be that although we have built a good model to accomplish such a goal, there are many other factors in sports that we can not control that must be considered. For example, injuries, fitness levels, and motivation of the players are aspects of the player that we cannot forsee or measure and they definitely do have an impact on how much a player would score per game. While basketball in general is a simple game of just scoring more than your opponent, one can see that there many complexities to the game especially when playing at the professional level of the NBA.

If you are not a basketball fan, I hope that I have sparked an interest in you towards basketball while also navigating you through steps of Data Science. From gathering, organizing, and analyzing the data to building a predictive model, every step was informational on the NBA and the relationship between different statistics in the game. I mentioned earlier that the NBA changed in the past 10 years with games becoming more and more high scoring every year and if you would like to read about the change and learn more you can do so here.

Thank you for your time to read my analysis of what affects a players average points per game in the NBA, and as a life long basketball fan, I hope you had as much entertainment reading this as I did writing it.