Andrew Hall


DDD EA – Session 4 – Data Science for fun and profit

This was live blogged during DDD ea so may be full of typos and issues.

It’s by Gary short @garyshort

Data science is maths and therefore dull. We will learn about data science in the context of horse racing

Naive probability is the type of probability that most people know. It’s the number of winning outcomes over the number of total outcomes.
Each head or tails flip for example would be 1/2 and this is independant as each coin flip does not impact the other.

But this doesn’t always work. The chance of there being life on the moon is not 50/50 just because there are 2 possible answers. This is called conditional probability.

A guess is picking which side of chance will occur.
A prediction is selecting and outcome based on known conditions

In horse racing, all things being equal how can we predict a winner?
With 8 equal runners, there’s only 12.5% chance of winning – not 50/50 (win/lose)

So if we need to do conditional probability, we need to build a model based on the known values – you cannot model the real world.
This is the same as building a scale model of any other idea so we can test how closely this reflects the real world

Because we have so much data these days, or lots of horse racing for example, we can check multiple models about horse racing from all time to test our model.

Average is the simplest model, for example if you want to predict the height of next years class, you can take the average from last year. In horse racing, we can average the position of the last 10 races. But this is wrong.

There are 2 types of data quantitative data and qualitative data.
Quantitative data are fixed numbers
Qualitative data is categorical and based on rules and not necessarily repeatable

Sometimes qualitative data pretends to be quantitative and was pretending to be numbers. Finishing 3rd doesn’t mean the horse was 3 times worse. This is actually a category not a number.

Standard deviation is an indication of how much the data varies. If you have a large standard deviation then there are outliers of data at either end and therefore is less accurate.

Categorical information needs to be assigned scores instead of using the arbitrary category names (even if they are numbers)
For example, give 3 points for 1st, 2 for 2nd and 1 for 3rd and nothing after.

Unifying the scores to 1 because we need to determine how close we are to something happening. 1 is definite and 0 is never. So unifying these numbers gives us a clearer representation of likelihood.

If we have lots of different pieces of data and the possibility of each of those individual things, how do we combine these into a total probability?
We can just unify these values as we did earlier. We can now also weight the items we think are more valuable by adding some extra value.

This is more art that science though. One persons opinion on weighting might be different to another and you will need to test each of these in turn and this could take a lot of time
How can we work out which variables are actually valuable?
We can use the coefficient of correlation to determine this. Excel has this functionality built in. And we are looking for a number above 0.6 to show correlation. A negative value shows inverse correlation. This gives us a good starting point

Confounding variables are variables that cause effects that we measure. However, if we only look at the effects, then we are missing a key point. Eg ice cream sales and drowning are correlated but the confounding variable is temperature – it’s the variable that we haven’t thought of.

If things are correlated we can use libear regression. Excel has this functionality built in. This will create a formula to predict one variable based on another. However for more complicated problems, linear regression doesn’t work. As we cannot have values over 1 or below 0. We can however use logistical regression to give us these values in the correct range.

If the relationship isn’t linear, we can use Bayesian theory.

Leave a comment or tweet me