Here's How I Used Data and Probability to Pick a Top 4% NCAA Bracket

I don’t know a thing about college basketball. Here’s how I beat 96% of all brackets posted to ESPN.

Yeah yeah, 611k others still beat mine. I only posted one!

Introduction

My picks and the results

Hearing that the odds of picking a perfect bracket are 1 in 9.2 quintillion, I was interested in the idea of creating one to compete against others and wondering if I could somehow get an edge by using past data. I’ve tried it before with the lotto (I know, I know, that data is independent and not representative of future drawings…because I still use coupons at Subway), but this time knowing that certain schools can consistently perform better than others would help a lot. Having no knowledge of the teams and the players, the best I could do was look at their previous games.

Dataset

Searching for historical data was difficult but I found what I needed on www.spreadsheetsports.com (historical game data) and www.rotoguru2.com (historical seed statistics).

The game data dated from 2013-2018 and had all the matches played during the season, EXCLUDING March Madness games.

Info included:

Date
Team
Team Location (Home or Away)
Team Score
Opponent
Opponent Score
Opponent Location (Home or Away)
Neutral Site (Affirmative if both teams are Away)
Team Result (Win or Loss)
Team Differential (The current per game scoring differential at the time of last update)
Opponent Differential (The current per game scoring differential for the opponent at the time of the last update)

It was nice that they listed each team’s perspective in the Team column. Meaning that each match had two rows, one for each playing team’s perspective. This way I only had to use the Team column to get all historical info for that team, instead of having to split out each match myself.

Historical Seed Statistics showed how each seed performed in past matchups. Here’s an example of a table taken from their site.

We know the 1 v 16 line isn’t 100% anymore. Only 0.6% of brackets on CBS Sports picked UMBC to win, and I think 100% of them were accidental.

Weaknesses

Team names were not consistent. Many with the word “State” were listed twice like “Alabama St” and “Alabama State”. Other examples include “Dallas Chr” and “Dallas Christian”, “Albany NY” and “Albany (NY)”, and sometimes just “Colorado” where I didn’t know if they were “Colorado Christian”, “Colorado Col”, “Colorado State”, or if they were a separate school on its own. I did my best to manually fix the names so they could get properly grouped together, but I’m sure there are still some mistakes.

Players, coaches, and funding change every year so this historical data is only somewhat representative of the current environment. I assumed it would capture the propensity for schools to do well if they’ve done well in recent years.

There were 1,142 schools listed, but only 451 had at least 50 games played. Only 266 had at least 100 games. Less played schools had less credible calculated measures.

Unfortunately, I couldn’t find free AND easy-to-use records of prior March Madness tournaments.

Methodology

I first scored each team based on their historical performance. Taking into account win percentage, average points, and relative experience (more games played = more experience), I gave them an index that would be used as their historical performance score. Then I expertly made up some variance statistic based on their win percentage for Home, Away, and Neutral games and expertly decided to use this for the standard deviation. It’s a pretty weak calculation, but it appeared to make sense in place for SD. Looking back, I could have improved this part by using the variance in their scores, and expertly generate another fake SD.

The basic premise is to use the performance index as the mu and my fake SD as the sigma in a normal distribution to randomly generate a number, then compare that to another team’s randomly generated performance index. The higher of the two would win and move onto the next round, and the formulas would randomly generate a number again all the way up to the championship.

Here’s an example:

Round 1 Cincinnati (seed 2) vs Georgia State (seed 15). Cincinnati won.

Here are their stats:

Stats	Cincinnati	Georgia State
Wins	131	110
Losses	35	52
Avg Points/Game	70.69	73.22
Performance Index (mean)	102.77	90.67
Performance Index (SD)	19.65	25.44

Cincinnati has a higher performance score due to their higher win ratio. They also appear to be more consistent with their wins among home and away games, leading to a tighter SD. Looking at the graphs, it’s clear that Cincinnati has the advantage as their probability of scoring higher in the range is greater than Georgia State’s. Since I am randomly drawing from these distributions and comparing them, there is still a chance that Georgia State generates a higher number than Cincinnati…so…

Sigh…yes, the sum of normally distributed random variables is also normally distributed ~N(x1+x2, SD1^2+SD2^2). The difference works too. Above is the cumulative distribution of Cincinnati’s distribution minus Georgia State’s. Our new distribution has mean 12.1 and SD 32.15. If the number generated is less than 0, Georgia State wins, and they have a 35.33% chance of doing so.

The number of points I earned in each round, totaling to 1150. The first round had 32 games so each correct pick gives you 10 points. As each next round has half the teams, the points per correct pick doubles. This is used to compare brackets.

I made additional adjustments like considering “grudges” where the teams have played each other before as well as integrating historical seed match up results. This may greatly alter the distributions. I weighted these components to prevent any particular piece from having too large of an effect.

Everything was written with formulas in Excel so if I refreshed the spreadsheet, it will generate new numbers and form a new bracket. I just kept punching F9 until I got a bracket that scientifically looked nice and recorded it into my ESPN account.

It’s admittedly not a sophisticated approach. All the little nonsense calculations give a false sense of precision and a significant portion of it is actually garbage.

But idk man…

2 Comments

Glo April 5, 2018 at 11:51 pm

Were you rewarded monetarily for your efforts??

Loading...

1. Anthony Ip April 6, 2018 at 12:47 am
  
  no, just opportunity to boast in the office
  
  Loading...

Categories

Tags

Archives

Here’s How I Used Data and Probability to Pick a Top 4% NCAA Bracket

Introduction

Dataset

Weaknesses

Methodology

Posted by Anthony Ip

2 Comments

Leave a ReplyCancel reply

Here’s How I Used Data and Probability to Pick a Top 4% NCAA Bracket

Introduction

Dataset

Weaknesses

Methodology

Posted by Anthony Ip

Share this:

2 Comments

Leave a ReplyCancel reply