ICU dyspnoea assessment tools: Norwegian translation and inter-rater reliability
Summary
Background: Dyspnoea is a common and stressful symptom in critically ill patients, and healthcare professionals tend to underestimate both the incidence and degree of this symptom. Currently, there is no Norwegian symptom assessment scale to aid in evaluating dyspnoea in intensive care unit (ICU) patients without the ability to self-report. The Intensive Care Observation Scale (IC-RDOS) and Mechanical Ventilation-Respiratory Distress Observation Scale (MV-RDOS) are symptom assessment scales for identifying dyspnoea in non-mechanically ventilated and mechanically ventilated patients.
Objective: To translate the IC-RDOS and MV-RDOS into Norwegian and evaluate the inter-rater reliability of the two scales.
Method: We applied a prospective observational design to evaluate the inter-rater reliability and measurement errors of IC-RDOS and MV-RDOS. The translation process focused on linguistic and cultural adaption following international guidelines. Data for the inter-rater reliability analysis were collected in four ICUs in Norway. Two ICU nurses performed 110 assessments, and inter-rater reliability was evaluated using the intraclass correlation coefficient (ICC) for the total score and continuous items, while Gwet’s agreement coefficient 1 was applied for the dichotomous items.
Results: IC-RDOS and MV-RDOS were translated into Norwegian. The total score of the scales, used in the evaluation of dyspnoea, showed a high ICC for both scales: ICC = 0.98 for IC-RDOS and ICC = 0.92 for MV-RDOS.
Conclusion: IC-RDOS and MV-RDOS had ‘very good‘ inter-rater reliability. Based on the findings in this project, the scales may be ready for implementation in Norwegian ICUs.
Cite the article
Martinsen P, Haug R, Bådsvik S, Vinje H, Hofsø K. ICU dyspnoea assessment tools: Norwegian translation and inter-rater reliability. Sykepleien Forskning. 2026;21(105043):e-105043. DOI: 10.4220/Sykepleienf.2026.105043en
Introduction
In Norway, approximately 18,000 people need intensive care treatment annually (1). Critically ill patients are exposed to several sources of discomfort in the intensive care unit (ICU) (2). Pain, thirst, anxiety, dyspnoea and inadequate sleep are the five most common and stressful symptoms ICU patients can experience (3).
While pain has traditionally been the primary focus of symptom research, dyspnoea has received increasing attention in recent years (2, 4). Like pain, dyspnoea can be experienced by the patient despite analgosedation and should be evaluated daily (3). Dyspnoea is defined as ‘a subjective experience of breathing discomfort that consists of qualitatively distinct sensations that vary in intensity‘ (5, p. 436).
Studies report that 34–55% of adult ICU patients reported significant dyspnoea (6–8). A recent multi-centre study that also included patients from our site found that one in three patients self-reported difficulty breathing during the first seven days of their ICU stay (9). Being a stressful symptom for the ICU patient, dyspnoea can also negatively impact treatment and result in failed or delayed weaning from mechanical ventilation (10, 11). Furthermore, ICU-related dyspnoea is associated with post-traumatic stress disorder (8).
The trend in recent decades has been for patients to remain more awake in the ICU, be mobilised earlier and receive lower doses of sedatives. The prevalence of dyspnoea may increase when patients receive less sedation or are ventilated with lower tidal volumes; however, many ICU patients are unable to communicate their symptoms (11–13).
Patients unable to communicate due to factors such as delirium, use of sedatives and endotracheal tubes make dyspnoea evaluation challenging (12). These factors, combined with the feeling of not being able to breathe as much as the body requires, may lead to anxiety for patients (4, 7, 14). Despite its impact, dyspnoea is often underestimated and underreported by healthcare professionals (6, 14, 15).
Haugdahl et al. enrolled 100 ICU patients and found that nurses and physicians underestimated dyspnoea in 56% and 48% of patients, respectively (15). These findings highlight the need for validated tools to identify dyspnoea in non-communicative ICU patients (16).
The Intensive Care-Respiratory Distress Observation Scale (IC-RDOS) and the Mechanical Ventilation-Respiratory Distress Observation Scale (MV-RDOS) are two hetero-evaluation symptom assessment scales intended to aid healthcare professionals in identifying dyspnoea in non-communicative critically ill patients (17, 18).
Both scales are meant to be used with ICU patients who are unable to self-report, IC-RDOS for patients not receiving MV and MV-RDOS for patients receiving MV. IC-RDOS and MV-RDOS are based on Campbell’s work with the original RDOS to identify dyspnoea in palliative patients (17, 19).
Both scales consist of five items: heart rate, use of neck muscles during inspiration, abdominal paradox during inspiration, and facial expression of fear. Use of supplemental oxygen only applies for IC-RDOS, and respiratory rate only applies for MV-RDOS. Together, these items constitute a total score to assess dyspnoea. Patient self-report using a visual analogue scale (D-VAS), an alternative to a numeric rating scale (NRS) for alert and verbal patients, is the gold standard for assessing dyspnoea in the ICU (11).
Both IC-RDOS and MV-RDOS demonstrate acceptable to excellent predictive value for identifying dyspnoea on a visual analogue scale (D-VAS ≥ 4) with area under the curve values of 0.83 and 0.78, respectively (16, 17). IC-RDOS is recommended to identify dyspnoea in spontaneously breathing patients, and a score ≥ 2.4 predicts D-VAS ≥ 4 with a sensitivity and specificity of 72% (17).
MV-RDOS is designed for mechanically ventilated patients, and a score of ≥ 2.6 predicts dyspnoea with D-VAS ≥ 4 with a specificity of 57% and a sensitivity of 94% (16). Comparable metrics have been shown in a recent study (20). The scales have been used in recent studies, including to predict weaning outcomes and to investigate an association with mortality (10, 20).
Like other symptom assessment scales, both instruments are used as surrogates for self-reporting and should only be used when patients are unable to communicate (12). This is justified by dyspnoea being a subjective symptom, and self-reporting is the gold standard (5, 11).
Currently, there is no validated method to evaluate the dyspnoea of ICU patients without the ability to self-report in Norwegian ICUs. The need for a more systematic approach to identifying and managing dyspnoea in our ICUs has become evident. We believe that a Norwegian adaptation of IC-RDOS and MV-RDOS could support healthcare professionals in making an evidenced based evaluation of dyspnoea in ICU patients without the ability to self-report.
Aim of the project
The overall aim of the project was to translate both IC-RDOS and MV-RDOS into Norwegian and evaluate the inter-rater reliability (IRR) of the Norwegian versions.
Method
To ensure the correct method and design, the study design checklist for patient-reported outcome measurement instruments from COnsensus‐based Standards for the selection of health Measurement INstruments (COSMIN) was chosen as a guideline for the IRR testing (22). STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) was applied to ensure appropriate reporting (23).
The project was conducted as a quality improvement project involving two phases. First, the translation process of IC-RDOS and MV-RDOS was performed (24). We then applied a prospective observational design to evaluate the IRR and measurement errors of the scales. A quality improvement project was chosen to provide a more systematic approach to evaluating dyspnoea in ICU patients who are unable to self-report in our ICUs (25).
To translate IC-RDOS and MV-RDOS, we used an international recognised guideline that emphasises the importance of linguistically and culturally adapted translation related to the setting in which the scale is supposed to be used (24).
Setting and population
Data for the reliability analysis were collected in four ICUs in a referral university hospital in Norway including both medical and surgical patients. Two of the units have ten beds, and the other two have six beds; both are categorised as level three units (1). Level three units have the capacity to manage advanced organ-supportive therapy for multiple organ systems including invasive mechanical ventilation (26).
The inclusion criteria for this project were patients over the age of 18 years, admitted to the ICU, and defined as ICU patients (ICU stay > 24 hours, use of vasoactive medication, or in need of mechanical ventilation) (1). Patients receiving muscle relaxants, declared brain dead, or able to self-report dyspnoea were excluded. We chose to include patients with various scores on the Richmond Agitation Sedation Scale (RASS) to ensure representation of all non-communicative ICU patients.
Prior to the data collection, we set a maximum threshold of 10% of mechanically ventilated patients (five patients) to be unresponsive, defined as RASS of −5, to ensure the instrument was tested on heavily sedated patients, despite it being likely that these patients have physical signs of dyspnoea. Some patients were assessed twice with a minimum of 48 hours between assessments to ensure independence.
Data collection
Patient ratings were performed using the Norwegian versions of IC-RDOS and MV-RDOS by two ICU nurses with six and eight years of ICU experience. This approach was chosen to minimise external factors influencing measurement errors. Dyspnoea assessment was performed simultaneously and independently, with no discussion or access to view each other’s results until the project was completed.
Both raters participated in the translation of the scales. Descriptive clinical characteristics (i.e. Simplified Acute Physiology Score (SAPS) and RASS) of the patients were collected from the electronic health record. We included patients from November 2021 to March 2022.
Sample size
We estimated the sample size using power analysis, a priori, to be 50 assessments per scale with two raters. The sample size for Cohen’s kappa was estimated using the Cicchetti and Fleiss (27) formula n = 2k2, where n is the sample size and k is the number of items in the scale.
A parallel power analysis for intraclass correlation coefficient (ICC) indicated a smaller required sample. Consequently, we chose the power analysis of Cohen’s kappa to increase the strength of the project. The intention was to use Cohen’s kappa in the analysis, but after the data collection was completed, we changed this analysis to Gwet’s agreement coefficient 1 (AC1) due to a better fit for the data collected. Therefore, the sample size was based on Cohen’s kappa instead of Gwet’s AC1.
A sample size of 50 assessments per scale is, however, in accordance with the recommendations of COSMIN (22). During the project, we increased the sample size of MV-RDOS to 60 due to the low frequency of some of the dichotomous traits.
Statistical analysis
Descriptive statistics were used to summarise the patients’ characteristics: continuous variables are reported as median with interquartile range (IQR) and categorical variables as frequencies (%) and counts. The IRR was analysed separately for each scale. For the total score and continuous items, we used a two-way random, average score, absolute agreement ICC (2, k) (28, 29). Normality of the residuals was assessed with Q-Q plots. Gwet’s AC1 was chosen for the dichotomous item due to robustness against trait prevalence (30).
We used Altman’s scale to benchmark both Gwet’s AC1 and the ICC. A Gwet’s AC1/ICC below 0.2 is considered ‘poor‘, 0.21–0.4 as ‘fair‘, 0.41–0.60 as ‘moderate‘, 0.61–0.8 as ‘good‘, and above 0.81 as ‘very good‘ (31). The measurement errors are presented with 95% upper and lower limits of agreement (LoA) with confidence intervals for the total score and the continuous items (32).
In addition to total percentage agreement, we also calculated positive and negative agreement of the dichotomous items (33). Systematic differences between raters were tested using the paired t-test and McNemar’s test (30).
The data were analysed using R version 4.1 (R Foundation for Statistical Computing, Austria, 2021), with a significance level at 0.05 and a 95% confidence interval (CI) where applicable.
Ethical considerations
This quality improvement project was first presented to the Regional Committees for Medical Research Ethics in Southeast Norway (reference number 2021/325676) but fell outside their mandate. The data protection office at the hospital approved the project, and the need for consent was waived (reference number 21/19413).
Results
Translation of IC-RDOS and MV-RDOS
We describe each step of the translation process in Table 1, which constitutes the final report of the process. We want to emphasise the choice to use medical terminology over layman’s terms to describe the items in the present project as a cultural adaption. For example, ‘Use of neck muscles during inspiration‘ was translated to ‘Use of accessory muscles during inspiration‘, as the instrument will be used by health care professionals.
Furthermore, it is standard practice in Norway not to translate the name or abbreviation of scales used in the ICU. The translated versions of the IC-RDOS and MV-RDOS are presented in Figure 1. The back-translated version was accepted by the developer of the scales without disagreement.

Patient characteristics
This project enrolled 81 ICU patients. The median patient age was 60 (IQR 51, 69), and 64% were male. Most of the included patients were acute surgical cases (42%). The most frequent reason for admission was ‘respiratory‘ (24%).
We performed 50 and 60 pairs of assessments for IC-RDOS and MV-RDOS, respectively. The median RASS at the time of the assessment was −1 (IQR −2, 0) for the non-mechanically ventilated patients and −3 (IQR −4, −2) for the mechanically ventilated patients, indicating drowsy and moderately sedated, respectively. The patient characteristics are summarised in Table 2.
Reliability of IC-RDOS
The total score of IC-RDOS had an ICC of 0.98 (CI 95% 0.96 to 0.99), indicating ‘very good‘ IRR. The mean difference between the raters for IC-RDOS was 0.08 (t = 0.21, df = 49, p = 0.98), with LoA ranging from 1.19 to −1.03.
Reliability of MV-RDOS
The total score of MV-RDOS had an ICC of 0.92 (CI 95% 0.86 to 0.95), indicating ‘very good‘ IRR. The mean difference between the raters for MV-RDOS was 0.003 (t = 0.01, df = 59, p = 0.99), with LoA ranging from 1.77 to −1.77.
For both scales, the item with the lowest degree of IRR was ‘use of neck muscles during inspiration‘ with an AC1 of 0.87 (CI 95% 0.74 to 1) for IC-RDOS and 0.77 for MV-RDOS (CI 95% 0.61 to 0.93). All ICC and Gwet’s AC1-values, LoAs, and percentage agreements are presented in Table 3. Measurement errors for the total scores are illustrated in Figure 2.
Discussion
In this project, we translated IC-RDOS and MV-RDOS (Figure 1) and evaluated the Norwegian versions’ inter-rater reliability. The main findings of this project are that the total scores of both instruments demonstrated ‘very good‘ IRR. All the dichotomous items, except ‘use of neck muscles during inspiration‘ had total agreement for the non-mechanically ventilated patients with the use of IC-RDOS. The total score of IC-RDOS had higher ICC values than MV-RDOS. All dichotomous items without complete agreement had greater negative than positive agreement.
To our knowledge, only one project has investigated IRR in IC-RDOS (17). Persichini et al. (17) found moderate consistency for the total score and variable IRR across the individual items. However, their study included multiple raters, which may have contributed to lower agreement than in our project, which had only two raters.
In our project, we found that the total score had a ‘very good‘ IRR for both scales, but IC-RDOS had a slightly higher ICC than MV-RDOS. The differences in the ICC for the total scores of the two scales may be explained by an increase in subjectivity in the interpretation of the items. It is also possible that these behavioural signs are more subtle in mechanically ventilated patients than in non-mechanically ventilated patients due to a lower RASS or use of sedation. While this project did not explore the underlying differences in ICC, it is a known challenge to identify behavioural signs in mechanically ventilated patients (34).
The item with the lowest degree of agreement on both scales was ‘use of neck muscles during inspiration‘. Nevertheless, the item still had ‘good‘ and ‘very good‘ agreement. The raters may have slightly different perceptions of what the item characterises. Alternatively, the item may be inherently more difficult to interpret than the other items on the scales. It is important to consider these findings in the further implementation of the scales as extra focus may be needed for this item.
Previous studies have found varying degrees of agreement for this item (17, 35). To address this, we designed a user guide to aid clinicians in using the scales correctly and thus minimise these events during future use. However, this item does not affect the total score in such a way as to make the scales less reliable.
A facial expression of fear is characterised by wide-open eyes with visible irises (36). We observed a lower IRR for this item for mechanically ventilated patients Thus, these patients may not have had the ability or may have shown it more subtly because they were more sedated than non-mechanically ventilated patients (34). The subtle display of the sign may have led to a greater variation between the raters in this project.
Facial expressions can be a powerful predictor of dyspnoea (17) but they were uncommon in our sample. In mechanically ventilated patients, ‘facial expression of fear‘ had the highest discrepancy between positive and negative agreement. This discrepancy may be influenced by the low prevalence of the sign in mechanically ventilated patients and was the reason we applied Gwet’s AC1, which is resilient to trait prevalence (30). Gwet’s AC1 favours the overall agreement, but the low agreement in the actual presence of ‘facial expression of fear‘ must, therefore, be taken into consideration in future implementation processes.
The measurement errors of IC-RDOS and MV-RDOS have not been examined in previous studies. To clarify, if both raters are in complete agreement except for on one dichotomous item, the total score will deviate by as much as two points. In this project, the LoAs of the total score of IC-RDOS and MV-RDOS are below two, which indicates most of the differences in the total score can be explained by the difference in only one dichotomous item or less.
In fact, the raters disagreed on two dichotomous items in only one assessment. We were not able to follow the recommendation to set a desired LoA prior to data analysis as our project is the first to describe the measurement error (37). Our findings offer a base for future studies to compare the measurement errors of the scales.
Following data collection, we observed clear differences between total percent agreement and Cohen’s kappa, explained by the kappa paradox (33). This prompted a search for alternative methods. Cohen’s kappa assumes that the raters evaluate all observations independently from the other evaluations, which was not the case in our population. The dichotomous items would have been falsely deemed to have low IRR explained by low prevalence (38). In such cases, Gwet’s AC1 provides a more stable estimate (30).
We chose to use Gwet’s AC1 in the evaluation of the dichotomous items, which is an advantage of this project due to a more precise presentation of the results (38). If we had solely relied on Cohen’s kappa, it might have led to delayed or failed implementation of a validated method for evaluating dyspnoea in ICU patients without the ability to self-report in Norway.
The patients included in this project do not fully represent the general ICU population. The first consideration relates to our inclusion of only patients from a referral hospital. This could have resulted in the inclusion of more severely ill patients or patients with rare conditions. A recent review demonstrated patients receiving non-invasive ventilation (NIV) self-reported both high prevalence and intensity of dyspnoea (39).
In addition, out of the 60 assessments performed on mechanically ventilated patients, only 4 (7%) used NIV. As a result of this we cannot say the Norwegian version of the scale has been fully tested on patients receiving NIV. A final consideration is that the assessments were done on median day 6 and 10 in the ICU for IC-RDOS and MV-RDOS respectively. This differs from our ICUs in general with a median length of stay of 3 days, indicating that the instruments have been tested in patients with a prolonged ICU length of stay (1).
The median SAPS II score of 51 indicates that the included patients were severely ill, reflecting the patient population at the hospital. Less severely ill ICU patients are more likely to be able to self-report their symptoms and consequently were excluded from the project. These factors combined point out that our sample of patients were severely ill ICU patients, but we argue that the scales still can be applicable to less severely ill ICU patients. Still, we have demonstrated the possible use of the translated version of IC-RDOS and MV-RDOS in Norwegian ICUs.
A recent statement from the European society of intensive medicine recommends the use of IC-RDOS and MV-RDOS in the evaluation of dyspnoea in ICU patients who cannot self-report (10).
Providing a Norwegian version of the scales will make it easier for clinicians to assess dyspnoea in ICU patients unable to self-report. Introducing these scales into clinical practise may promote a common language in the assessment of dyspnoea, hopefully resulting in improved patient care and reduced symptom burden. Further, being able to measure dyspnoea in a valid and reliable manner can facilitate future research on the associated factors related to dyspnoea in ICU patients.
Limitations
We chose not to perform dyspnoea evaluations on all non-communicative patients to ensure validity for a wider patient selection. Including all non-communicative patients may have biased the results in favour of deeply sedated patients, resulting in items such as ‘facial expressions of fear‘ having a lower prevalence. The raters gathered background information before every assessment.
In retrospect, knowing the background information before the assessment may have introduced confirmation bias, and this information therefore should have been collected after the assessments. The sample size was based on Cohen’s kappa instead of Gwet’s AC1. However, the sample sizes of 50 and 60 assessments are more than the recommended threshold (24).
Both raters were involved in the translation process. However, we believe the behavioural signs would have shown a higher prevalence or a systematic difference between the raters if confirmation bias was introduced. In future clinical use, a lower IRR must be expected due to a greater diversity of raters.
Further research
Regarding the validity and reliability of the Norwegian version of IC-RDOS and MV-RDOS in a larger population of ICU patients, further research is warranted after implementation with a greater diversity of raters. Power analysis should be based on Gwet’s AC1 due to the expected low prevalence in the dichotomous items.
Conclusion
In this article, we presented a Norwegian version of the IC-RDOS and MV-RDOS and evaluated the scales’ IRR. The primary result indicates the total score of the scales demonstrated ‘very good‘ IRR. Further research regarding the Norwegian version of IC-RDOS and MV-RDOS should focus on a wider range of psychometric properties.
Dyspnoea is a subjective experience and is often underestimated by clinicians. We believe that IC-RDOS and MV-RDOS can contribute to both increased awareness and evidence-based assessment of dyspnoea experienced by ICU patients.
Based on the findings of this project, the scales may be suitable for implementation in Norwegian ICUs. Using the IC-RDOS and MV-RDOS in clinical practice is likely to increase awareness and improve symptom management.
*Peder Sebastian Martinsen and Runa Austad Haug share first authorship.
The authors declare no conflicts of interest.
Open access CC BY 4.0
The Study's Contribution of New Knowledge













Comments