And by “GP”, I of course mean guinea pigs!
Ok fine: it actually means a statistical method called a Gaussian Process. By placing fairly strong assumptions on the form of the data, you can do lots of useful things like extrapolate what will happen in the future, or even figure out the type of some unknown observation.
Regulars on this blog may remember that I’ve made a few previous stabs at quantifying my running progress. They turned out reasonably ok, but I was hard-pressed to do anything more sophisticated than a simple linear regression…mainly because I couldn’t figure out how to properly weight each run. But I digress!
Enter Gaussian Processes. They rely on pretty strong assumptions: typically, you assert that your data come from some combination of underlying parametric distributions. The nice thing is, your data don’t have to be linear. And running data is anything but linear.
So, with some guidance from the GP examples on scikit-learn, I wrote up a few scripts to automate the process.
- Download the data from Garmin Connect. This took some trial-and-error, since GC doesn’t exactly go out of its way to make accessing your data easy. I took a lot of hints from this guy’s ruby script and got something working in reasonably short order.
- Parse the data for distances and times. This was mostly in place as a result of my previous work. I had to tweak things a little bit so that I could process one file at a time, but otherwise it was pretty simple.
- Feed the data into a Gaussian Process. This didn’t take a whole lot of code, but it did take awhile to fully grasp what was going on. Or, more truthfully, grasp what was going on enough to use it somewhat correctly.
Here’s a snippet of the GP code:
# Loop through the sorted arrays, generating a graph. X = np.atleast_2d(np.linspace(0, numRuns, numRuns, endpoint = False)).T y = np.zeros(np.size(timestamps)) dy = 0.5 + 1.0 * np.random.random(y.shape) for i in range(0, np.size(sortInd)): ind = sortInd[i] d = running.metersToMiles(distances[ind]) s = running.secondsToMinutes(splits[ind]) y[i] = running.averagePace(d, s) process = gp.GaussianProcess(corr = 'squared_exponential', nugget = (dy / y) ** 2, theta0 = 1e-1, thetaL = 1e-3, thetaU = 1, random_start = 100) process.fit(X, y) # Set up a prediction. x = np.atleast_2d(np.linspace(0, numRuns, numRuns * 10)).T y_pred, MSE = process.predict(x, eval_MSE = True)
The loop goes through each run, converting the distances and durations into the correct units, and calculating the overall average pace. Herein lies one of my major assumptions: that a 20-mile run with an average pace of 9 minutes/mile “counts” just as much as a 5-mile run with an average pace of 9 minutes/mile. That’s simply not true; the latter is orders of magnitude easier. But as my understanding is still somewhat limited, that was an assumption I had to live with for the time being.
It computes the GP in, essentially, the two lines of code following the loop. The last two lines at the bottom then make a prediction of what the underlying distribution is that my pacings are drawn from.
Put another way: all things being equal, it’s what my pace would be if I was a robot and never had a bad day or a really good one.
Here’s what it spits out:
It’s kind of neat. But it’s still not perfect. Aside from the aforementioned weighting problem, there’s also the issue that the 95% confidence band doesn’t really seem to encompass very much of the data at all. That seems to be implying a very tight distribution when clearly that isn’t the case. So it’s possible some of my code is wrong there.
As for the running implications, I posted about this on my running blog, so head over there to read about it.
If you’re interested in the code, you can either have a look at its github home, or just be patient: this worked so well and smoothly that I’m thinking of turning it into a web service where folks can punch in their Garmin Connect login (only if they trust me, obviously) and get a graph like this in return. Shweet!