I have a plan, that is, once the baseball season finally kicks off. Don't tell anyone, ok? I'm going to try to Beat The Streak. The streak is one of baseball's seemingly most unbeatable records. The record is 56 hits by a baseball player in consecutive games. Joe Dimaggio got a hit 56 games in a row in 1941. The only player to have even come anywhere near this record in my lifetime was Jimmy Rollins in 2005 and 2006, a continuation from one season to the next that some baseball purists would place an asterisk next to.
For those of us who mostly interact with the great game of baseball through computer screens, MGM has given fans a chance to hit 57 consecutive games in a row and win $5.6 million. MGM has even made it much easier for fans. You don't have to choose 1 player that you think will hit 57 games in a row, you can choose many MLB players to accumulate hits. The only requirement is that the player you've chosen must get a hit the day he is chosen. Choose correctly, you add another hit tally to your mark. Choose incorrectly, you go back down to zero. Your goal is to choose 1 player each day to get a hit and do that 57 times in a row. In case your confused, let's go through an example. Day 1: I choose Manny Machado, he get's a hit! Hit tally = 1 Day 2: I choose Nolan Arenado, he get's a hit! Hit tally = 2 Day 3: I choose Trea Turner, 0-4, no hits. Hit tally = 0 It's harder than you think. No one has succeeded in 16 years, but some have come close. I've played for a few years now and I don't think I've every gotten past 15. Now, rather than going in blind, here's my plan to use machine learning and predictive analytics to Beat, The, Streaaakkk!! * Hopefully baseball comes back and MGM continues this contest so that all this work is actually useful
Step 1.
What I need to do is.....wait....this is a contest in which I could potentially win $5.6 Million. Do you really think I'm going to spell it all out for you right here, right now? In truth, predicting a hit in a baseball game is difficult. How difficult? A few mathematically minded individuals posted on Reddit about just that. So, knowing that the chances of building a successful and sound predictive model to win this contest are, let's say, slim...what the heck! Hopefully if you're reading this and you see some flaws or suggestions, you can point them out and we can try to, "Beat, The, Streeeaakk!", together. Step 2. I've downloaded as much batted ball data as I can from Baseball Savant and I think it will be enough. With 40,000 events and 89 individual features, I've certainly got enough to get started.
Step 3.
Designate and create the target column. I'm trying to predict a hit. A hit can be a single, double, triple or home-run. Anything else, won't do. In my dat, I have two columns, "events" and "description", that need to be investigated. See code below:
Step 3 cont,.
I need to turn the 'events' column into a hit/non-hit categorical column that will be used as my target during modeling. Here's a function built to label these occurrences as a hit/non-hit.
Step 4.
Gathering this data was pretty easy. Now, I have clean it up and do some exploring. Tune in to my next post showcasing interesting visualizations and findings!
0 Comments
Leave a Reply. |