Influence of skin pigmentation on the accuracy and data quality of photoplethysmographic heart rate measurement during exercise

Experimental design

The present study is a criterion validity study. All procedures were approved by The University of Alabama’s institutional review board, and all participants provided written informed consent prior to commencing the study.

Participants

We performed an a priori power analysis in G*Power 3.1.9.6 (Faul et al. 2007). To detect a 5-bpm difference in HR between the criterion and photoplethysmographic devices for the proposed study design, an effect size equal to d = 0.18 was estimated using means and standard deviations from our pilot data (n = 12) (Mulholland et al. 2024). The magnitude of this small effect size is in agreement with those reported by Sañudo et al. (2019), which ranged from 0.05 to 0.27. We first used a repeated measures model, with two groups (representing two devices—criterion and test) and six measurements (representing six exercise intensities), assuming a high correlation among repeated measures (r = 0.9), an alpha level of 0.05, and a power level of 0.8. An effect size of 0.2 produced an estimated sample size of 30 participants, and similarly, power analyses for linear multiple regression (alpha level = 0.05, power = 0.8, predictors = 2) would require 52 observations.

Healthy men and women aged 18–59 y who were free of cardiovascular, metabolic, and renal disease and who did not need medical clearance according to the American College of Sports Medicine pre-participation screening algorithm (American College of Sports Medicine 2021) were recruited to participate. Exclusion criteria included the presence of hypertension [resting systolic blood pressure (BP) ≥ 130 mm Hg or diastolic BP ≥ 80 mm Hg, currently taking antihypertensive medication, or having been told by a medical provider they have high BP on ≥ 2 separate occasions]; tachycardia (resting HR > 100 bpm); current cigarette smoking or nicotine use, or quit < 6 months prior; the use of medication that alters HR, skin blood flow, sweat rate, or metabolic responses to exercise; and tattoos or scars at any wearable device measurement site (wrist and upper arm). To ensure adequate sample distribution across the spectrum of skin tones, we set an a priori target that at least 40% of included participants were self-reported persons of color and had an average ITA° of 10° or less (dark skin) at the device placement sites. ITA° is highly correlated (R2 = 0.96) with total melanin content of the epidermis (Del Bino et al. 2015). ITA° was calculated from colorimeter measurements using Eq. (1),

$$ITA^\circ =\mathit\left(\frac^-50}^}\right)\times \frac$$

(1)

where L* is the luminance value and b* is the yellow-blue component (Ly et al. 2020).

Participants were instructed to arrive well rested, hydrated, and having refrained from ingesting non-prescription drugs on the day of testing. In addition, participants refrained from consuming alcohol and caffeine, from participating in strenuous exercise for ≥ 24 h prior to testing, and from using artificial tanning lotion or similar products for ≥ 1 week prior to study participation.

Experimental procedures

Upon arrival at the laboratory, participants completed a self-reported physical activity history, medical history, and a 24-h history questionnaire. Women self-reported the first day of previous menses and were tested during the follicular phase of the menstrual cycle; menstrual cycle phase was not expected to influence study outcomes (Stone et al. 2021).

After resting quietly for 5 min in a seated position, resting BP was measured according to the procedures outlined by the American Heart Association (Whelton et al. 2018). Following BP measurement, a colorimeter (observer = 2°, illuminant = C; Chromameter CR-400, Konica Minolta, Ramsey, NJ) was used to measure skin pigmentation at the volar (inner) upper arm for standardization (Pershing et al. 2008) and at each photoplethysmographic device measurement site—the posterior wrist and lateral upper arm—following the procedures outlined in Ly et al. (2020). Prior to measurement, each site was cleaned with an alcohol swab, and excessive body hair was removed if necessary. Colorimeter measurements were taken in triplicate at each site by the same researcher for all participants, with the colorimeter held perpendicular to the skin, and the average value for each term was used for all calculations and analyses.

Next, participants provided a urine void that was used to measure urine specific gravity (USG) with a refractometer (PAL-10S, Atago, Tokyo, Japan); USG ≤ 1.020 was considered adequately hydrated (Sawka et al. 2007). After the urine sample was provided, nude body mass was measured using a digital scale (BWB-800, Tanita Corporation, Tokyo, Japan). Standard clothing (tank top and cycling shorts) was provided for all participants to wear during exercise. Height was measured using a stadiometer (model 213, seca, Hamburg, Germany) and body composition was estimated using the 7-site sum of skinfolds (Lange skinfold caliper, Beta Technology, Inc., Santa Cruz, CA) (Jackson and Pollock 1985).

Participants were then outfitted with a chest-strap HR monitor (H10, Polar, Finland) for the criterion measurement of HR (Gilgen-Ammann et al. 2019). Next, participants were instrumented according to manufacturers’ instructions with the wearable devices to be tested: BAND V2 (SlateSafety, Atlanta, GA) on the lateral upper non-dominant arm, the vivosmart 5 (Garmin, Kansas City, MO) on one wrist, and the Apple Watch Series 8 (Apple Inc., Cupertino, CA) on the other wrist; device placements between the dominant and non-dominant wrists were counterbalanced and the counterbalanced placement orders were randomly assigned. These particular watches were chosen because they were, at the beginning of the study in 2023, the newest models of popular brands for fitness trackers at different price points (Apple Watch Series 8, $399; Garmin vivosmart 5, $149 at the time of purchase in 2023). The SlateSafety BAND V2, also the newest model available at the onset of data collection, was included because of its marketed use (occupational monitoring), where potential bias in its HR measurement is important to identify from a health and safety standpoint.

Participants then entered a room maintained at an ambient temperature of 22.4 ± 0.1 °C and a relative humidity of 41% ± 4%. The participant was seated for 5 min while baseline measures for criterion HR and data from all wearable devices were recorded. Next, participants mounted the cycle ergometer (Excalibur Sport, Lode, Netherlands) and completed a graded protocol. Cycling was chosen as the mode of exercise to minimize the influence of movement artifact on HR measurement. The protocol consisted of four 10-min stages, for a total of 40 min of continuous exercise. The four stages were completed at 50%, 60%, 70%, and 80% of age-predicted maximal HR (HRmax) (Tanaka et al. 2001); these stages corresponded to very light (< 57% HRmax), light (57%–63% HRmax), moderate (64%–76% HRmax), and vigorous (> 76% HRmax) intensity exercise, respectively (American College of Sports Medicine 2021). The purpose of the graded protocol was to be able to compare device performance across a range of exercise intensities and absolute HR values.

During exercise, criterion HR and data from all wearable devices were recorded continuously. Work rate was adjusted so that target HR was achieved within 2 min of beginning each stage and further adjusted to maintain target HR throughout each stage. Oxygen uptake (V̇O2) was measured for 2 min starting at the 5th min of each stage using indirect calorimetry (TrueOne 2400, PARVOMedics, Salt Lake City, UT), then converted to metabolic equivalents (METs; 1 MET = 3.5 mL·kg−1·min−1). During the last min of each stage, rating of perceived exertion (RPE) was collected (Borg 1982). After 40 min, participants stopped cycling and moved to a chair where they remained seated for 10 min to measure responses during post-exercise recovery. Following the recovery period, participants were de-instrumented.

Device setup and data cleaning

Criterion HR from the Polar H10 monitor was recorded using the Polar Beat app (version 3.5.6) and each session was exported from the diary on the Polar Flow website. Polar records HR once per second. For the Apple (watchOS version 9.6) and Garmin (software versions 3.02, 3.13, 3.14, 3.22) watches, an indoor cycling workout was recorded for the duration of the protocol, started prior to resting measurements and ended after the 10-min recovery. Apple HR data were synced and exported from the iPhone Health App (iOS version 16.5.1). Apple records HR approximately once every 5 s during a workout. The Garmin workout was downloaded from the device using the Garmin Connect app (version 4.69.1.5) and then exported as a.tcx file from the Garmin Connect website. An open source .exe script was used to extract time and HR data for each workout (available from http://www.wartnaby.org/running/file_converters/index.html). Garmin records HR every 2 to 15 s during a workout. SlateSafety data were downloaded from their online BioTrac platform after each session; the SlateSafety BAND V2 (software versions 1.6.1, 1.6.2, 1.6.3) records HR once every 10 s. Throughout the study, Garmin (3 updates) and SlateSafety (2 updates) pushed firmware/software updates to their respective devices/platforms with no option to opt out. Updates were not expected to impact study results.

Prior to analysis, raw data files were checked for HR values < 50 bpm or > 200 bpm. Any instances of likely erroneous measurements were manually inspected and removed if appropriate. All HR data were then averaged into 30-s epochs to account for differences in sampling frequency. Missing data rates were counted as the number of 30-s epochs that were not calculated because there were no data available. Each device was synced with a wireless internet or cellular network-enabled iPhone immediately prior to each data collection session. Time stamps were exported with HR data and pooled in the same 30-s epochs as HR data. Summarized data from all devices were then matched using time stamps. Outliers were identified as 3.29 × SD of the mean error of HR (Field 2024) and were removed prior to analysis.

Data analysis

All statistical analyses were completed using R Statistical Software version 4.4.1 (R Core Team 2024). Descriptive statistics (mean ± SD) were generated for all indicated outcome measures. HR data were evaluated at six intensities: rest, very light, light, moderate, vigorous, and recovery. The magnitude of error at each intensity was characterized as mean absolute error of HR measurement (MAEHR), the absolute value of the difference in mean criterion HR and mean device HR for a 30-s epoch. Paired samples t tests were used to confirm target HR was achieved by the criterion HR measure during exercise. Overall device performance—HR measurement by each device versus criterion HR measurement—was described using a repeated measures correlation coefficient (rrm) (Bakdash and Marusich 2017) and Bland–Altman agreement analysis for repeated measures (Bland and Altman 2007) with modified Bland–Altman plots (Krouwer 2008). Proportional bias was calculated as the trend (Pearson’s r) between the difference in HR measurements and criterion HR from the modified Bland–Altman plots (Krouwer 2008). For this study, 95% limits of agreement (LoA; equal to bias ± 1.96 × SD) ≤ 5 bpm was considered “excellent,” > 5 to ≤ 10 bpm was “good,” and > 10 bpm was “unacceptable” agreement; these thresholds were chosen because they represent 5% error at a HR of 100 bpm and 200 bpm, respectively. In addition, the American National Standard Institute (2002) requires experimental devices intended for use in a medical setting to be within ± 10% or ± 5 bpm (whichever is greater) of a criterion measure for HR, and ± 5% is a commonly used threshold for wearable devices measuring step-related metrics that has also been applied to HR measurement (Fokkema et al. 2017; Shcherbina et al. 2017).

A linear mixed-effects model (Bates et al. 2015) was used to determine whether skin pigmentation (as ITA°) predicts MAEHR; a multilevel modeling approach was employed to account for repeated measures within participants. MAEHR for each device was evaluated in a separate model. The intraclass correlation coefficient (ICC) was calculated to describe the percentage of variance existing at the participant level (i.e., between-person variance). Model building was completed using a forward entry method to evaluate whether 1) ITA° or 2) criterion HR explained additional variance in MAEHR. A likelihood-ratio test was performed to determine whether each predictor should be kept in the model. The marginal R2 value was calculated to describe the proportion of variance explained by the fixed effects (coefficients) in each model. Normality was assessed by visual inspection of Q-Q plots, and heteroscedasticity was assessed by visual inspection of residuals vs. fitted plots. Variance inflation factor (VIF) statistics were calculated to test for multicollinearity, with VIF < 5 considered acceptable. Squared semi-partial correlations (sr2) were calculated for significant predictor variables to describe the percentage of model variance uniquely explained by each predictor. To compare physiological responses and MAEHR (within each device) across intensities, a one-way analysis of variance was used; in the event of a significant omnibus test, pairwise comparisons with a Bonferroni α correction were performed. If sphericity was violated, the Greenhouse–Geisser correction was applied. All statistical tests used an α level of 0.05.

Comments (0)

No login
gif