Using Big Data to Determine NBA #GOAT

If you ask 50 basketball fans who would be in their Greatest Of All Time (GOAT) list, you will get 50 very "passionate" responses... but, let's be honest, 50 opinions based on purely on gut.

So we thought we would take a big-data approach to GOAT - take out all the emotion, the geography, the alumni-connection and make it only about the numbers.

And below is our journey....

TL:DR? Jump to Xcalar NBA GOAT

Missed our NBA analytics webinar? Watch the video


Like our previous analytics deconstructions, our architecture resides in the cloud - AWS to be precise. And we do that for 2 main reasons.

1) speed + performance

For these analysis, we have a few days to turn this around, and we want to be able to spin-up to larger machines if we have to. Not knowing how large the data sets will be and if we need to add other data sets, it's nice to know that in the cloud, configuring the instance is as simple as one click.

2) operationalizing

Play-by-play (PBP) data is generated after every game. Building this analytics data flow in AWS means that Cloudwatch/Lambda can listen for new PBP data in S3 and automatically start and run the Xcalar dataflow. We don't have to manually ingest and run the dataflows - that saves time and headaches.

Deployed in the cloud for speed, performance and operationalization.


The play-by-play (PBP) data came from NBAStuffer. PBP data from the last 15 years was in CSV forms. These files were uploaded into our NBA S3 bucket.

After each game/evening, new PBP data is also uploaded into this S3 bucket.

PBP files are dropped into AWS S3 where they are automatically ingested into Xcalar dataflows.

To operationalize the "ingestion" of new PBP into the dataflow, Xcalar has integrated a AWS Cloudwatch/Lambda listener that will automatically start the dataflow every time a new data file is dropped into the bucket.

After last night's games PBP, we have 2.8 GB data representing 9 million records. (PBP data is also wide - 40+ columns/fields).

Getting GOAT Criteria - Offense

So we interviewed 50 basketball fans and we got their "algorithms" for GOAT. The key criterium included:

  1. 3 point percentages by season/all time
  2. Field goal percentages by season/all time
  3. Free throw by season/all time
  4. Total points by season/all time
  5. Total points in a single game ever
  6. Q4 total points by season/all time
  7. Playoff average points by season/all time
  8. Assists by season/all time
  9. Blocks by season/all time
  10. Steals by season/all time
  11. Turnover by season/all time
  12. Rebounds by season/all time
  13. # of championships

So we have to build the analytics behind these 24 criterium.  Below is the OFFENSE PBP dataflow - The results of the dataflow are represented by the yellow nodes on the very right which are associated with each of the OFFENSE criteria listed above (eg. FG, 3 pt percentage, etc.)

Getting GOAT Criteria - Defense

Another dataflow was created to generate the analytics for Defense criteria. "DEFENSE" dataflow was created. To start this dataflow, a LINK-IN node was created on the very left which is a live link to the cleansed 9M original PBP data (in the OFFENSE dataflow).

The advantage of the LINK-IN node is that if/when the data underlying this node is changed, it will dynamically update the DEFENSE (or any other) dataflows that reference it. This allows for easy and clean collaboration across multiple dataflows.

Dataflows to build DEFENSE PBP criteria.

Using Python to normalize data

Each PBP data was given a "data set name" in the format of either

  • YEAR-YEAR Regular Season or
  • YEAR Playoff

for example

  • 2018-2019 Regular Season
  • 2019 Playoff

This makes perfect sense to a basketball follower, but.. to big data this data looks like separate seasons. So we built a quick Python to NORMALIZE this data so that

  • "2018-2019" =  "2018-2019 Regular Season" AND "2019 Playoff" PBP data.
Power of Xcalar is ability to seamlessly integrate Python with SQL in a single dataflow

With Xcalar, a data engineer can seamlessly integrate SQL, Python and visual modeling nodes all in a single data flow.


So once we got all 24 criteria generated across 3 dataflows (offense, defense, championship), we then wanted to put in weightings, as all criteria are not evaluated equally.

For this we used Xcalar parameters and integrated that into the SQL and Python code.

As you can see, Omar our resident NBA pundit, placed CHAMPIONSHIPS played, FIELD GOAL PERCENTAGES and TOTAL POINTS ALL TIME as the highest weightings. The weightings were then applied to the criteria via a SQL statement (below)

Parameters are indicated by <GOAT weightings>



So with the following weightings

  • All Time Field Goal weight = 6
  • All Time Total Playoff Points = 6
  • All Time Total Points = 6
  • All Time 3 Pt weight = 5
  • All Time Q4 Points weight = 5
  • All Time Assists = 4
  • All Time Free Throws = 4

we'd like to present to you the...

Xcalar NBA GOAT (at least for the last 15 years)

  1. Kevin Durant
  2. LeBron James
  3. Dirk Nowitzki
  4. James Harden
  5. Carmelo Anthony
  6. Stephen Curry
  7. Klay Thompson
  8. Dwayne Wade
  9. Kobe Bryant
  10. Ray Allen

Given the weightings, we are not too surprised by the results. Nowitzki and Allen were moderate surprises but on retrospect, deserved. However...  our own Omar the Pundit had a special reaction to the inclusion of Melo (too lively for this staid blog)...  but the data doesn't lie 🙂

Agree or disagree?

Have a bone to pick with Omar (who provided the weightings?)

Drop us a note  (below) if you want to put in your GOAT Criteria Weights!

Take Your Shot at #BigData Analytics

Sign-up for your own Xcalar instance for free.

Please enter your name.
Please enter a valid phone number.
Please enter a message.