Functional PCA of the Golden Cheetah Power Duration Data

With 2,445 athlete seasons (inclusion criteria: at least 100 power files per season and at least PD data out past 2 hours), it makes sense to let the data speak without a predetermined narrative. One tool that works without a priori assumptions is principal component analysis. The basic idea is to start with the data mean, then explain as much variability, in as few components as possible. For time-series data (which is sort of the case for PD data) you can use functions to achieve the same goal; ie start with no assumptions, find the mean function, then explain the variability in a minimal number of orthogonal functions.

Fortunately R has a package fdapace makes this quite easy once you figure out how to format the data. So here we go;

(edited to add figure with transparency)

Start with a mess of data.

Run your functional PCA.

From the scree-plot what you are shown is that 90% of the variability can be explained by eigenfunction 1. That’s a lot explained. Eigenfunction 2 explains an additional 5%, and by the time we get to eigenfunction 3 its down to explaining 2.5% of the variability (and I would argue could be dropped).

The mean function is straightforward, it is the mean of all the data and looks believable. In the final box lower right art the first 3 eigenfunctions.

Zooming in on these, the black line is the eigenfunction 1, it is all positive so will act like a “gain” function and is somewhat proportional to power. The red line is eigenfunction 2 and it captures the anti-correlation between all out sprint vs endurance ability, sort of a twitchedness function. And eigenfunction 3 is a bit of a weird one that is tied to sprint endurance or lack there of.

To visualize that a bit easier we can take the mean function and add/subtract the eigenfunction. Here is eigenfunction 1 showing that 90% of the variability in 2,445 PD curves is a simple “gain” function, ie most of the difference is just that some people are better than others.

Eigenfucntion 2 shows the trade off between sprint and endurance ability. Basically its the are you a sprinter or are you everyone else function.

And last, eigenfunction 3, which I would argue could be dropped is maybe a sprint endurance function.

So what does this mean ?

My take is that out to 2 hours, 95% of performance variability can be empirically explained with just 2 paramaters (which is incredibly parsimonious); a gain function that indicates overall ability, and a twitchedness function that indicates sprint vs endurance phenotype. I need to cross-validate this claim, but it makes sense that such a simple parameterization (ie a sufficiently stiff model) should be very robust which I think is key to dealing with PD data which is generally fairly shitty due to submax spans and serial auto-correlation.

The way that you would apply this then to run the FPCA on your reference data set, then introduce your new PD data set of interest and see how strongly (and whether positive or negative) it loads on eigenfunction 1 (ability) and egenfunction 2 (twitchedness). From there it would be easy to interpret whether gains/losses are due changes in ability or twitchedness.