Journal Pre-proof Development of the CFQ-R-8D: Estimating Utilities From the Cystic Fibrosis Questionnaire-Revised

Abstract


Introduction
Cystic fibrosis (CF) affects >80 000 people worldwide and is the most common lifethreatening autosomal recessive disorder in populations of northern European ancestry, with an overall incidence of 1 in 3500 in European countries. 1-3 CF has a reduced life expectancy, with median survival estimates of 41 to 57 years across North America and Europe, depending on genotype and sex. 4,5 CF affects multiple organ systems, but for most people with CF, the largest health impact is progressive respiratory impairment. 3 Symptoms commonly include cough, shortness of breath, chest congestion, lack of energy, sinus discharge, and difficulty sleeping. 2 People with CF also commonly report psychological symptoms. 6,7 Health-related quality of life (HRQOL) is a multidimensional concept that includes dimensions related to physical, mental, and social functioning. [8][9][10][11] Many clinical and demographic characteristics have been associated with HRQOL in CF, including occurrence of pulmonary exacerbations, disease severity, sex, and socioeconomic status. [8][9][10][11] The Cystic Fibrosis Questionnaire-Revised (CFQ-R) is a validated instrument widely used to assess HRQOL in CF studies and clinical trials. 12,13 The CFQ-R has 12 dimensions: Physical Functioning, Emotional Functioning, Social Functioning/School Functioning, Body Image, Eating Problems, Treatment Burden, Respiratory Symptoms, Digestive Symptoms, Vitality, Health Perceptions, Weight, and Role Functioning. 12,13 Three versions of the CFQ-R have been developed: 1 for parents/caregivers to proxy-report for children aged 6 to 13, 1 that can be interviewer-administered for children aged 6 to 11 years or self-completed for children aged 12 or 13 years, and an adolescent/adult version for those aged ≥14 years. [12][13][14] Although there is some overlap in the items and dimensions included in each version, the total number of items differs between each of the three versions. 13 As the version with the broadest target population, this study focuses on the adolescent/adult version.

J o u r n a l P r e -p r o o f
Cost-effectiveness analyses, a framework used by some health technology appraisal agencies to evaluate novel healthcare interventions, require measures of HRQOL in the form of health state utilities to generate quality-adjusted life-years (QALYs), which combine the value of HRQOL with the length of life into a single index number. Health state utilities are typically generated from preference-based measures that use preference-elicitation techniques such as time trade-off (TTO), standard gamble, or discrete choice experiment to assign a value, anchored at 0 for dead and 1 for perfect health, to health states described by the underlying classification system. 15 The National Institute for Health and Care Excellence (NICE) strongly prefers utilities generated using the EQ-5D, 16,17 a generic preference-based measure of HRQOL, comprising 5 dimensions (mobility, self-care ability, ability to undertake usual activities, pain and discomfort, and anxiety and depression). 18 There are 2 versions of the EQ-5D: 1 with 3 severity levels in each dimension (EQ-5D-3L) and 1 with 5 levels (EQ-5D-5L). 19 The EQ-5D-3L lacks sensitivity to meaningful differences in lung function and HRQOL among people with CF, with individuals self-reporting mean utility of 0.923 for mild and 0.870 for severe lung function impairment, 20 which are higher than UK (0.856) and US (0.867) population norms. 21 Although the EQ-5D-5L was developed to increase sensitivity, 19 it has also been shown to lack sensitivity to changes in lung function among people with CF during pulmonary exacerbations. 22 Relatedly, a mapping study found that the respiratory dimension of the CFQ-R was not a significant predictor of EQ-5D-3L utility, 23 and utilities estimated from mapping to the EQ-5D-3L showed limited ability to discriminate between groups classified based on lung function in a disease largely characterized by respiratory symptoms.
Given this observed lack of sensitivity of EQ-5D in CF, an alternative approach to estimating utilities is required. Utilities generated from disease-specific measures that are sensitive to change, such as the CFQ-R, have the potential to effectively capture disease-relevant J o u r n a l P r e -p r o o f concepts. However, since the CFQ-R is not preference-based, it cannot be used directly to generate utilities. Here we derive the first preference-based scoring algorithm to generate utilities from CFQ-R data.

Methods
The study was conducted in 5 stages using methods previously described to estimate a preference-based measure from the Short Form-36 dimension survey (SF-36) 24,25 and other condition-specific, preference-based measures from existing patient-reported measures of HRQOL. 26 The 5 stages were: (1) assessing the dimensional structure of the CFQ-R using factor and Rasch analyses; (2) identifying suitable items for the health-state classification system using classical psychometric analyses; (3) using clinical and participant input to assess the face validity of the CFQ-R items and dimensions selected in stage 2; (4) valuation of the health states by members of the general public; and (5) developing the scoring algorithm for the classification system using regression modeling. The first 3 stages used existing clinical trial data (described below), while the latter 2 stages used primary data collected for this study. Rasch analysis was conducted using RUMM2030 27 ; all other analyses were conducted using Stata 14.2. 28

CFQ-R
The CFQ-R (adult and adolescent version) includes 50 items assessing 12 dimensions scored on 4-point Likert scales, including frequency (always to never), intensity (a great deal to not at all), difficulty (a lot of difficulty to no difficulty), and true-false (very true to very false).  [29][30][31] In brief, the trials enrolled participants aged ≥12 years with CF homozygous for the F508del-CFTR mutation or heterozygous for F508del-CFTR and a residual function mutation, who were randomized to active treatment versus placebo. The primary outcome was ppFEV1, a measure of lung function. Only participants who were administered the CFQ-R adult and adolescent version (ie, those aged ≥14 years) were included in this analysis. Three trials had a 24-week intervention period and were used for the main analysis; the EXPAND trial, a crossover trial with 2 intervention periods of 8 weeks, was used to replicate the main-item selection analysis.
All analyses were conducted by analysts blinded to treatment assignment to ensure item selection was driven by item performance independent of treatment effect. Data included clinical outcomes and other patient-reported measures, such as percent predicted forced expiratory volume in 1 second (ppFEV1), number of pulmonary exacerbations, and the patient-reported Cystic Fibrosis Respiratory Symptom Diary (CFRSD). 32

Assessment of the dimensional structure
The dimensional structure of the CFQ-R was assessed using exploratory factor analysis with the principal-components method and Rasch analysis 33 to identify potential health dimensions and their associated items. Factor analysis can be used to identify dimensional structures, while Rasch models allow unidimensional estimates of item location and ability to be made.
Results from factor analysis were assessed based on eigenvalues >1 (including review of scree plots), assessment of contribution of items to each factor and whether they contributed J o u r n a l P r e -p r o o f >1 factor (range 0 to 1 with higher values indicating greater contribution), and assessment of measurement error based on uniqueness where a value >0.6 indicated that an item may reflect other information not captured in the dimension (see Supplemental Methods for further details). Rasch analysis was undertaken for identified items in each factor to assess whether all items fit based on assessment of the residuals to identify potential divergence and assessment of local dependency (ie, where there was >1 item measuring the same construct in the factor (see Supplemental Methods for further details). Items excluded at this stage were those that did not contribute to the identified factors or that showed evidence of local dependency or divergence in the Rasch analysis. Items that were optional and those relating to general health were excluded.

Item selection
To identify items best representing each dimension, a combination of classical psychometric analysis and Rasch analysis was used (see Supplemental Methods for further details on item selection methods). Classical psychometric criteria were applied to each item in the CFQ-R using the following metrics: level of missing data, distribution of response across categories (floor and ceiling effects), correlation of item to its own dimension, and responsiveness (standardized response mean [SRM]) to change over time based on improved ppFEV1. 34 Rasch analysis was used to assess the performance of individual items. Items that did not fit the model, did not cover the full range of severity, had disordered response choices, or suffered from differential item functioning were candidates for exclusion from the health state classification system. 34 Item wording was required to be suitable for TTO valuation (eg, responses such as "somewhat true" vs "somewhat false" were not concrete enough, and items that combine concepts, such as rating walking function by level of tiredness, were not sufficiently independent).

Assessment of face validity of the classification system
The face validity of the proposed items and dimensions was assessed in interviews with clinicians and individuals with CF to ensure that selected items and dimensions were important, were relevant, and represented dimensions that may change following an effective treatment. Four clinicians practicing in Australia, Canada, the UK, and the US and 5 individuals with CF (2 from the UK, 2 from Australia, and 1 from Canada) participated in the validation process.

Health state selection and valuation
Not all possible health states from the classification system could be valued due to the many possible combinations of items; therefore, a subset of health states was valued and used to model the utilities for the complete classification system. An orthogonal array was generated using IBM SPSS statistics version 21, which selected 32 health states for valuation, including the best state ("full health"). The "full health" state was anchored on 1 as the combination of items in the CFQ-R classification system in which no problems were recorded in any dimension, leaving 31 states to be valued. The worst health state was valued by all participants. Each state was valued by multiple respondents; however, asking respondents to value all states would be excessively burdensome. Therefore, the states were allocated to 4 sets containing mild, moderate, and severe health states, with each respondent valuing 1 set of 8 or 9 health states. To avoid bias, no reference to CF was made in the interview. Once the health states for valuation were selected, cognitive debriefing interviews were conducted with 5 members of the UK general population to evaluate face validity of the states and understanding of the task.
The valuation sample was recruited from a UK general population research panel, aiming to reflect the most recent UK census population demographics. 35 The interviews were conducted J o u r n a l P r e -p r o o f face-to-face by 20 trained interviewers across 5 regions of the UK (Birmingham, Glasgow, Manchester, London, Swansea) in 2018. At the start of the interview, participants were shown the items and an example health state. Afterward, participants completed visual analog scale (VAS) tasks and TTO tasks, first for 2 practice health states, then for the assigned CFQ-R health state set. If the interviewer felt a participant did not understand or engage with the practice task, they did not continue to complete the main study health state exercise. For the VAS task, participants were asked to rate the presented health states, plus the best state and "dead," from 0 (very worst or least preferred) to 100 (very best) to familiarize themselves with the states they were valuing. Participants then completed the TTO, a standardized interview method for valuing health states to generate utility estimates. 36,37 The method was designed to determine the point at which participants considered 10 years in the target health state to be equivalent (or they were indifferent) to the prospect of x years in full health. Time in full health was varied between high and low values, changing by 6-month intervals, until this point of indifference was reached. If a participant indicated that they believed that being dead was preferable to any time living in a health state, the interviewer switched to a leadtime TTO exercise, 38 which asked participants whether they would prefer to live for 10 years in full health followed by 10 years in a health state or to live for x years of full health (where x<10). This lead-time procedure allowed the participant to trade more years of life to determine how much worse than dead they considered the health state to be and to estimate a utility below zero (worse than dead). Participants also completed sociodemographic information, experience of illness (their own or family and friends), and EQ-5D-5L (scored using UK population weights 39 ).
Prior to conducting the valuation study, the study protocol was reviewed and determined exempt from ethical review requirements by an independent review board in the US (Western J o u r n a l P r e -p r o o f Institutional Review Board); however, informed consent was still collected prior to interview participation.

Development of scoring algorithm
Data were reviewed prior to analysis and flagged for exclusion if responses were considered to reflect either a lack of understanding or engagement or if there were inconsistencies (ie, more severe states were given higher utilities than milder states). Responses were flagged if To produce utilities for every health state defined by the classification system, the utilities were modeled using regression analysis. The standard specification was: had an impact on the standard errors but not the coefficients. Random Effects (RE) Tobit models were estimated to take into account differences at the individual level as these models appropriately dealt with the structure of the data in which each respondent had multiple observations. 24 Mean level models using Tobit with robust standard errors clustered on participants were also estimated, as they reduced the impact of outliers in health state utilities present in individual-level data. Tobit models that accounted for heteroscedasticity were also estimated as TTO data typically has larger variance for more severe states. A test for heteroscedasticity in the linear model confirmed this. Inconsistent coefficients for adjacent severity levels of a dimension (for example, moving from "sometimes" to "often" experiencing a health problem) where health deterioration leads to a higher utility were contrary to expectations. To address this issue, models were also estimated that merged inconsistent adjacent severity levels to remove these inconsistencies to ensure that a health deterioration leads to a lower utility score.
Performance of regression models was assessed using the number of significant (P<0.05) and nonsignificant coefficients, the consistency of the coefficients with the classification system, and mean absolute error (MAE) at the health-state level. MAE was generated using the difference between observed and predicted utilities at the health-state level, and models with a lower MAE were preferred. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) were also examined, with lower values indicating a more

Selection of CFQ-R classification system
Baseline participant demographics and clinical characteristics are shown in Table 1. The 3 participant data sets used for the main analysis (EVOLVE, TRAFFIC, and TRANSPORT) were similar in terms of age, sex distribution, and ppFEV1, while EXPAND had a higher mean age and slightly higher percentage of female participants.
As summarized in Table 2 In summary, the psychometric analysis found that the level of missing data in the data sets was low; therefore, this criterion was not used for item selection. Evidence of ceiling effects was observed for some items, but none had floor effects. Most items (all but 1) had strong Of the 12 items identified for consideration, 9 were selected for clinician and participant validation. The Role Functioning "impact on daily activities" item (role36) was selected over the "goals" item (role37), and Vitality "exhaustion" (vital11) was selected over "tiredness" (vital9) due to the more concrete concepts referenced in the former items, despite the role36 item not having the strongest psychometric performance of all role items. Furthermore, the Treatment Burden item was removed due to conceptual overlap with the Role Functioning item. As worry (emot7) and sadness (emot12) were both considered relevant and had the same response options, these items were combined to represent the Emotion dimension (in line with the EQ-5D Anxiety/Depression dimension). Finally, the response option wording for the Body Image item (body26; very true, somewhat true, somewhat false, very false) was judged to be conceptually complex and was therefore dichotomized to a true or false response.

Assessment of face validity
The selection of the 9 out of 12 items outlined above was endorsed by all individuals with CF and clinicians, and all proposed items in the classification outlined in Supplementary Table   14 (see example health state in Figure 1) were considered valid and relevant by CF clinicians and participants interviewed. Because cough and breathing difficulties were judged to be relatively independent, these items were treated as independent concepts for the valuation,

Valuation results
Following cognitive debriefing interviews (n = 5) to check the understanding and interpretability of the health states, a total of 400 TTO interviews were conducted. Of the total sample, 14 TTO interviews were excluded for valuing all states as identical (but not at full health); 38 respondents were excluded for valuing the worst health state as equivalent to their highest TTO value; and 7 respondents were excluded for valuing all health states as worse than being dead. The main analyses focused on 345 respondents with robustness analyses using the full sample; data for the practice states were not analyzed.
The analysis sample did not meaningfully differ, based on measured characteristics, from the overall sample and was comparable to the most recent UK population census data in terms of sex, age, and ethnicity ( Table 3).  Table 4 shows estimates of preference weights based on Tobit, RE Tobit, and mean-level models using Tobit with and without accounting for heteroscedasticity. Models with a constant term were also tested, but the constants were not statistically significant and did not improve the model fit statistics. The coefficients were all positive as expected, indicating that less than full health resulted in an increase in disutility. Regression coefficients were logically consistent in most dimensions (ie, disutility values increased as severity increased), but where the levels within a dimension were disordered (eg, levels 3 and 4 for Role Functioning in the RE Tobit model), levels were combined to generate a logically consistent (ordered) model.

J o u r n a l P r e -p r o o f
All models better predicted mean disutility for each health state at the more severe end than at the milder end, indicating a relationship between error and predictive ability (Figure 2).
MAEs ranged from 0.025 to 0.039. Across the 31 health states, 5 to 12 had MAE >0.05, and only 1 was above 0.1, indicating small levels of error at the health-state level and thus good predictive ability ( Table 4). Results based on the full sample had more non-significant coefficients (3 compared to 1 in the Tobit heteroscedastic model; Supplementary Table 16) although this was reversed for other models (eg, Tobit model), and there was also some evidence of slightly more inconsistencies.

Discussion
Here we describe the development of the CFQ-R-8D, a novel preference-based scoring algorithm that allows utilities to be estimated for use in economic evaluations based on patient-reported CFQ-R data. The

J o u r n a l P r e -p r o o f
In TTO interviews, most of the states valued were considered to be better than dead, with only 5.5% of the TTO values below 0 and 6% at 0. These results are comparable to the recent EQ-5D-5L England valuation, where 5.1% were valued as worse than dead, and the US valuation where 5.1% were valued at 0. 40,41 Unlike the EuroQol Valuation Technology (EQ-VT) protocol used to value the EQ-5D-5L, 41 not all participants were shown both the worse than dead and better than dead procedure; participants were only shown the worse than dead procedure if their preferences took them there. It is unknown whether this may have impacted responses, but it is difficult to reason why knowing how a health state is valued as worse than dead would affect participants' responses in the better than or worse than dead TTO choice.
Overall, a few outliers were seen in the TTO data, but at the health-state level, mean TTO values did not vary widely, with most values between 0.5 and 0.6. This lack of variability may reflect the mix of severity levels across dimensions in each health state that was valued.
It may also reflect the challenge of valuing 8 dimensions using TTO; however international protocols for longer and more complex approaches have been successfully implemented (eg, EORTC-QLU-C10D). 42 The Tobit models that were estimated all had coefficients with the expected sign and most were statistically significant at the 10% level, but all had inconsistencies. The Tobit heteroscedastic-ordered model was selected as it addressed the problem of heteroscedasticity and only had 1 inconsistency. The values ranged from 0.236 to 1, which is a smaller range compared to the UK EQ-5D-3L (−0.594 to 1). 18 Prior to development of the CFQ-R-8D, a study was undertaken to estimate utilities in CF based on a mapping algorithm linking CFQ-R data to the 3-level version of the EQ-5D (EQ-5D-3L). 23 Evidence from this study suggested that the EQ-5D-3L may not be sensitive to meaningful changes in health status in the CF population. The core respiratory dimension of the CFQ-R was not a significant predictor of EQ-5D-3L utility and thus not included in the J o u r n a l P r e -p r o o f mapping algorithm. Perhaps not surprisingly, utilities estimated from the algorithm showed limited ability to discriminate between groups classified based on lung function. The CFQ-R-8D reflects a broad range of CF-specific health dimensions included in the CFQ-R, which is a well-validated and widely used measure to evaluate treatment benefit in CF. While the CFQ-R-8D uses a subset of the CFQ-R items in the scoring algorithm, as is generally necessary for scoring algorithms, this specificity does not suggest that the full CFQ-R should not be administered. Capturing the full impact of CF on HRQOL provides important evidence outside economic evaluation.
Limitations of the current work should be highlighted. Four trial data sets were used to select items for the classification system, and while the use of multiple data sets and larger combined sample size was advantageous in this context, all 4 samples included clinical trial participants for whom severity of CF may have been different from that in a typical CF population due to study inclusion criteria. In addition, most items demonstrated only small, SRMs, the notable exceptions being the respiratory items, and thus may not reflect their performance in other CF populations. As such, the items selected here ideally should be validated in another setting, such as a registry or observational study. The TTO sample was drawn from the UK population to reflect UK societal values as recommended by agencies, including NICE. 43 To use this classification system in another country, it may be desirable to repeat the TTO valuation and algorithm estimation with a local population; however, UK valuations for utility measures may be acceptable where local weights are unavailable. 44 Interviewers were trained, but there were no specific built-in interviewer quality checks during the data collection process, and the number of interviewers and variability in number of interviews conducted may have impacted data quality. Notably, 5 interviewers did not record values below zero, but as they equally did not record many values at zero (

Conclusion
The CFQ-R-8D allows direct estimation of CF-specific utilities from the CFQ-R, a wellvalidated measure that is used widely in CF clinical trials and clinical practice, thus enabling utilities for use in cost-effectiveness analyses to be generated from any existing or future CFQ-R data set. The ability to adequately capture the HRQOL in this population using a metric suitable for economic evaluation is essential to demonstrating the potential benefit and value of new CF treatments. An evaluation of the psychometric performance of the CFQ-R-8D compared with the generic EQ-5D-3L and SF-6 dimension survey is ongoing. 45