If you ask 50 basketball fans who would be in their Greatest Of All Time (GOAT) list, you will get 50 very "passionate" responses... but, let's be honest, 50 opinions based on purely on gut.
So we thought we would take a big-data approach to GOAT - take out all the emotion, the geography, the alumni-connection and make it only about the numbers.
And below is our journey....
Missed our NBA analytics webinar? Watch the video
Like our previous analytics deconstructions, our architecture resides in the cloud - AWS to be precise. And we do that for 2 main reasons.
1) speed + performance
For these analysis, we have a few days to turn this around, and we want to be able to spin-up to larger machines if we have to. Not knowing how large the data sets will be and if we need to add other data sets, it's nice to know that in the cloud, configuring the instance is as simple as one click.
Play-by-play (PBP) data is generated after every game. Building this analytics data flow in AWS means that Cloudwatch/Lambda can listen for new PBP data in S3 and automatically start and run the Xcalar dataflow. We don't have to manually ingest and run the dataflows - that saves time and headaches.
The play-by-play (PBP) data came from NBAStuffer. PBP data from the last 15 years was in CSV forms. These files were uploaded into our NBA S3 bucket.
After each game/evening, new PBP data is also uploaded into this S3 bucket.
To operationalize the "ingestion" of new PBP into the dataflow, Xcalar has integrated a AWS Cloudwatch/Lambda listener that will automatically start the dataflow every time a new data file is dropped into the bucket.
After last night's games PBP, we have 2.8 GB data representing 9 million records. (PBP data is also wide - 40+ columns/fields).
Getting GOAT Criteria - Offense
So we interviewed 50 basketball fans and we got their "algorithms" for GOAT. The key criterium included:
- 3 point percentages by season/all time
- Field goal percentages by season/all time
- Free throw by season/all time
- Total points by season/all time
- Total points in a single game ever
- Q4 total points by season/all time
- Playoff average points by season/all time
- Assists by season/all time
- Blocks by season/all time
- Steals by season/all time
- Turnover by season/all time
- Rebounds by season/all time
- # of championships
So we have to build the analytics behind these 24 criterium. Below is the OFFENSE PBP dataflow - The results of the dataflow are represented by the yellow nodes on the very right which are associated with each of the OFFENSE criteria listed above (eg. FG, 3 pt percentage, etc.)
Getting GOAT Criteria - Defense
Another dataflow was created to generate the analytics for Defense criteria. "DEFENSE" dataflow was created. To start this dataflow, a LINK-IN node was created on the very left which is a live link to the cleansed 9M original PBP data (in the OFFENSE dataflow).
The advantage of the LINK-IN node is that if/when the data underlying this node is changed, it will dynamically update the DEFENSE (or any other) dataflows that reference it. This allows for easy and clean collaboration across multiple dataflows.
Using Python to normalize data
Each PBP data was given a "data set name" in the format of either
- YEAR-YEAR Regular Season or
- YEAR Playoff
- 2018-2019 Regular Season
- 2019 Playoff
This makes perfect sense to a basketball follower, but.. to big data this data looks like separate seasons. So we built a quick Python to NORMALIZE this data so that
- "2018-2019" = "2018-2019 Regular Season" AND "2019 Playoff" PBP data.
With Xcalar, a data engineer can seamlessly integrate SQL, Python and visual modeling nodes all in a single data flow.
PUTTING THIS ALL TOGETHER WITH WEIGHTINGS
So once we got all 24 criteria generated across 3 dataflows (offense, defense, championship), we then wanted to put in weightings, as all criteria are not evaluated equally.
For this we used Xcalar parameters and integrated that into the SQL and Python code.
As you can see, Omar our resident NBA pundit, placed CHAMPIONSHIPS played, FIELD GOAL PERCENTAGES and TOTAL POINTS ALL TIME as the highest weightings. The weightings were then applied to the criteria via a SQL statement (below)
Xcalar NBA GOAT (at least for the last 15 years)
- Kevin Durant
- LeBron James
- Dirk Nowitzki
- James Harden
- Carmelo Anthony
- Stephen Curry
- Klay Thompson
- Dwayne Wade
- Kobe Bryant
- Ray Allen
Given the weightings, we are not too surprised by the results. Nowitzki and Allen were moderate surprises but on retrospect, deserved. However... our own Omar the Pundit had a special reaction to the inclusion of Melo (too lively for this staid blog)... but the data doesn't lie 🙂