Part 2: Functional Principal Component Analysis of the Golden Cheetah Power Duration Data

After the first post on FPCA of the Golden Cheetah Open Data Dan Connelly (@djconnel) pointed out that since the FPCA uses basis functions the fit would improve after taking the log of power. Going back through the initial attempt there was heteroskedasticity in the residuals with errors increasing at long durations. Sure enough after taking the log things improved so here goes the updated post. The cleaned up code will be put up by Mark Liversedge in keeping with the open data collaborative approach.

Getting back to the post at hand, the GC data has 2,445 data sets with at least 100 power files and MMP extending past 7,200 seconds. These files were used for the analysis, all others were excluded. I ran the analysis in R using the fdapace package. The purpose of functional principal component analysis is to describe time series data in the fewest number of independent functions.

The results of the analysis:

The scree plot shows that the PC1 (the first function) accounts for about 75% of the variability in the data. PC2 is still notable at 15% of the variability, and PC3 falling off to 2.5% of the variability. For our purposes we stop after PC3 because from there improvements in fits are trivial and the functions become less intuitive or interpret-able in what they represent performance wise.

The mean function has a familiar sigmoid shape of the power duration curve in cyclists. This mean function is the starting point for the fits from which PC1, PC2, and PC3 are added or subtracted to optimize the fit.

Looking at PC1,2, and 3 may or may not be intuitive at first glance. So the mode of variance plots are generated which help visualize how they change the fit.

PC1 essentially raises or lowers the curve in a manner that is slightly greater at shorter than longer durations. PC1 can be understood as an overall ability. Higher PC1 indicates superior performance ability and PC2 indicates lower performance ability.

PC2 captures the anticorrelation between sprint and endurance ability or twitchedness (the ratio between fast twitch and slow twitch motor units). Higher PC2 indicate greater slow twitch and endurance while lower values indicate greater fast twitch and sprint ability.

PC3 appears to describe what might be described as sprint endurance. Higher values indicate a poorer sprint endurance (possibly anaerobic glycolysis or W’) relative to all out sprint and pure endurance while lower values indicate superior sprint endurance relative to all out sprint and endurance abilities.

Internal validation of the fits looks very good:

Note that the mean percent residual is near zero and flat with no weird hops or kinks. The blue line is 2 standard deviations of the percent residuals of model fits using the PC1, PC2, and PC3, while the red line uses just PC1, and PC2.

An outlier plot….

And how do the fits actually look on the individual level (below are 120 fits selected by random in R to give you some sense of how the individual fits look):

These fits look fantastic, particularly when it is considered that the models describes a span from 1 – 7,200 s with just 3 fitted parameters (and would work nearly as well with just 2 fitted parameters).

Some limitations of this approach (among others) are that we are not explicitly specifying an physiologic model so the model will at times produce physiologically implausible fits (for example a span where power increases at longer durations) for some poor quality data points. There isn’t really a way to prevent that other than to maybe throw a warning for non-monotonic predictions. Also, the quality of the model is dependent on the data set used to generate it. It may not generalize well to athlete phenotypes not represented in the reference data set. Likewise, interpretation of the PC values is relative to the reference data set. For example you can say an athlete is in the 90th percentile for PC1 but not that athlete has a critical power of 300 watts. The other thing I should have done here is to reserve a subset of the data for cross-validations or written a loop to check a leave one out cross validation.