And by “GP”, I of course mean *guinea pigs!*

Ok fine: it actually means a statistical method called a Gaussian Process. By placing fairly strong assumptions on the form of the data, you can do lots of useful things like extrapolate what will happen in the future, or even figure out the type of some unknown observation.

Regulars on this blog may remember that I’ve made a few previous stabs at quantifying my running progress. They turned out reasonably ok, but I was hard-pressed to do anything more sophisticated than a simple linear regression…mainly because I couldn’t figure out how to properly weight each run. But I digress!

Enter Gaussian Processes. They rely on pretty strong assumptions: typically, you assert that your data come from some combination of underlying parametric distributions. The nice thing is, your data don’t have to be linear. And running data is anything but linear.

So, with some guidance from the GP examples on scikit-learn, I wrote up a few scripts to automate the process.

**Download the data from Garmin Connect.**This took some trial-and-error, since GC doesn’t exactly go out of its way to make accessing your data easy. I took a lot of hints from this guy’s ruby script and got something working in reasonably short order.

**Parse the data for distances and times.**This was mostly in place as a result of my previous work. I had to tweak things a little bit so that I could process one file at a time, but otherwise it was pretty simple.**Feed the data into a Gaussian Process.**This didn’t take a whole lot of code, but it did take awhile to fully grasp what was going on. Or, more truthfully, grasp what was going on*enough*to use it*somewhat*correctly.

Here’s a snippet of the GP code:

# Loop through the sorted arrays, generating a graph. X = np.atleast_2d(np.linspace(0, numRuns, numRuns, endpoint = False)).T y = np.zeros(np.size(timestamps)) dy = 0.5 + 1.0 * np.random.random(y.shape) for i in range(0, np.size(sortInd)): ind = sortInd[i] d = running.metersToMiles(distances[ind]) s = running.secondsToMinutes(splits[ind]) y[i] = running.averagePace(d, s) process = gp.GaussianProcess(corr = 'squared_exponential', nugget = (dy / y) ** 2, theta0 = 1e-1, thetaL = 1e-3, thetaU = 1, random_start = 100) process.fit(X, y) # Set up a prediction. x = np.atleast_2d(np.linspace(0, numRuns, numRuns * 10)).T y_pred, MSE = process.predict(x, eval_MSE = True)

The loop goes through each run, converting the distances and durations into the correct units, and calculating the overall average pace. Herein lies one of my major assumptions: that a 20-mile run with an average pace of 9 minutes/mile “counts” just as much as a 5-mile run with an average pace of 9 minutes/mile. That’s simply not true; the latter is orders of magnitude easier. But as my understanding is still somewhat limited, that was an assumption I had to live with for the time being.

It computes the GP in, essentially, the two lines of code following the loop. The last two lines at the bottom then make a prediction of what the underlying distribution is that my pacings are drawn from.

Put another way: all things being equal, it’s what my pace would be if I was a robot and never had a bad day or a really good one.

Here’s what it spits out:

It’s kind of neat. But it’s still not perfect. Aside from the aforementioned weighting problem, there’s also the issue that the 95% confidence band doesn’t really seem to encompass very much of the data at all. That seems to be implying a very tight distribution when clearly that isn’t the case. So it’s possible some of my code is wrong there.

As for the running implications, I posted about this on my running blog, so head over there to read about it.

If you’re interested in the code, you can either have a look at its github home, or just be patient: this worked so well and smoothly that I’m thinking of turning it into a web service where folks can punch in their Garmin Connect login (only if they trust me, obviously) and get a graph like this in return. Shweet!

Did you take into account grade and terrain? I’ve noticed running on unpaved ground or even grass has quite a noticeable difference in my gait and even more so when going up and down hill. Of course that’s just me running to catch the train, so no where near what you go through 😛

I had an idea for a poor-man’s terrain mapper a while back. It’s a little gadget you could wear on your arm with a controller attached to your palm with maybe 3-4 buttons and can be operated with just the fingers on that hand.

When you’re leaving asphalt on to grass, you can hit the “grass” button and it will turn on the “grass” timer and take note of the coordinates. When you’re leaving grass, you can hit the grass button again and it will stop timing and note the coordinates again.

Same for gravel, sand etc… so I guess 4 buttons should be enough. I like to tinker with electronics so this just sorta popped into my head while I was waiting for the train one day.

If the GPS doesn’t take precise measurements of grade, maybe you can add a 5th button for change in grade up and down. Of course all this is more stuff to keep in mind while trying to focus on your pace. 😛

I would love to find a way to take the terrain into account. A gadget like you mentioned would be a wonderful way of discretizing the level of difficulty; as it is, you can kind of infer that from the average pace, but not really considering the different kinds of runs, the kind of day you’re having, the length of the run, etc. The GPS actually records extremely detailed elevation information, but at the moment I’m just discarding it. I figure distance is the all-powerful metric that I need to figure out how to include first, then I’ll start adding all the other fancy features 😛