Part 3: Reference ranges for the FPCA model Golden Cheetah open data

This is the third post on applying functional principal component analysis to the open GC project power duration data. Part 2 explains the approach and the resulting model. The purpose of this post is now to make the model and data useful.

The model itself outputs a score for each principal component. That score is hard to interpret in isolation other than maybe looking at change over time for a given athlete. One way to make the score easier to interpret is to convert it to a percentage. To do this the first step was to look at the distributions of the scores and see if they were normal: On the first principal component (PC1) the scores are skewed a little. I did try some basic transformations but they didn’t get the data looking sufficiently more normal to warrant the added complexity of converting (in my opinion anyway). PC2 and PC3 looked fine. So to get a percentage we can convert the PC to a z-score so that the mean is 0 and the standard deviation is 1. From there the pnorm function in are will give the percentile of the z-score. The percentile is still going to be relative to the reference data set but is intuitive. In terms of the scores then PC1 gives us the overall ability compared to the mean function. PC2 gives the ratio of type 1 (endurance) to type 2 (sprint dominance). For ease of interpretation, below I report both type 1 and type 2 separately even though one can be solved from the other: type1 = 1 – type2. PC3 is similarly a ratio that I interpret as anaerobic endurance vs sprint and endurance which seems most straightforward to report as a single anaerobic value.

To see how this looks, below are sets of 3 randomly selected fits with the power duration data in light gray circles, the fit line in red, and the mean reference function in black. Next to each PD plot is a corresponding radar chart showing overall ability and then the balance in relative abilities. Note that the radarchart is a bit imperfect in a couple of ways. The first is that it mixed the overall ability percentile with ratio percentiles. What I mean by that is that type 1 and type 2 show the individuals relative balance given their overall ability rather than their specific ability versus others. For example an athlete may be .99 on type 2 yet still be a terrible sprinter if their overall ability is very low. The other issue is treating anaerobic as independent from type 1 and 2. For most people this shouldn’t matter as the magnitude of scores on PC3 is much lower than PC2. But for outliers with a very high PC3 (anaerobic) it should squeeze in both type 1 and 2 on the radar chart a little and vice versa. I think this issue could be corrected with some math but since PC2 accounts for 15% of the variability and PC3 only 2.5% the correction would likely be trivial in cases other than extreme outliers on PC3. For most people the model and percentile values should be quite usable as is.

(update: it looks like we should be able to implement the model in GC as an Rchart) 