Interclass Correlation Coefficient (ICC) Calculator

Number of Subjects

Number of Ratings per Subject

ICC Model

ICC Type

Consistency or Agreement

Mean Square Between (MSB)

Mean Square Within (MSW)

Mean Square Rows (MSR)

Mean Square Error (MSE)

ICC Value: 0.872

95% Confidence Interval: 0.785 – 0.924

F-Statistic: 3.53

Interpretation: Excellent reliability

Comprehensive Guide to ICC Calculation

Module A: Introduction & Importance

The Intraclass Correlation Coefficient (ICC) is a statistical measure that quantifies the reliability and agreement between measurements taken by different raters or methods on the same subjects. ICC is particularly valuable in:

Clinical research: Assessing the consistency of diagnostic tests or measurements across different clinicians
Psychometrics: Evaluating the reliability of psychological assessments and surveys
Biomechanics: Determining the repeatability of movement analysis measurements
Quality control: Verifying the consistency of manufacturing processes or product measurements

ICC values range from 0 to 1, where:

0.00-0.50: Poor reliability
0.50-0.75: Moderate reliability
0.75-0.90: Good reliability
0.90-1.00: Excellent reliability

Visual representation of ICC reliability scale showing color-coded ranges from poor to excellent reliability

Module B: How to Use This Calculator

Follow these steps to calculate ICC using our interactive tool:

Enter basic parameters: Input the number of subjects and ratings per subject in the first two fields
Select ICC model: Choose between one-way random effects, two-way random effects, or two-way mixed effects based on your study design
Choose ICC type: Select either single measures (for individual ratings) or average measures (for mean ratings)
Specify analysis type: Decide between consistency (relative agreement) or absolute agreement
Input ANOVA results: Enter the Mean Square values from your ANOVA table (MSB, MSW, MSR, MSE)
Calculate: Click the “Calculate ICC” button or let the tool compute automatically
Interpret results: Review the ICC value, confidence interval, and interpretation

Pro Tip: For most clinical research applications, ICC(3,1) (two-way mixed effects, single measures, absolute agreement) is commonly recommended as it accounts for systematic differences between raters.

Module C: Formula & Methodology

The ICC calculation depends on several factors including the ANOVA model, whether you’re using single or average measures, and whether you’re assessing consistency or absolute agreement. Here are the core formulas:

1. One-Way Random Effects Model (ICC(1,k))

For single measures (k=1):

ICC = (MSB – MSW) / (MSB + (k-1)*MSW)

2. Two-Way Random Effects Model (ICC(2,k))

For absolute agreement:

ICC = (MSB – MSE) / (MSB + (k-1)*MSE + k*(MSR – MSE)/n)

3. Two-Way Mixed Effects Model (ICC(3,k))

For consistency:

ICC = (MSB – MSR) / (MSB + (k-1)*MSR)

Where:

MSB = Mean Square Between subjects
MSW = Mean Square Within subjects (one-way model)
MSR = Mean Square for Raters
MSE = Mean Square Error
k = Number of raters
n = Number of subjects

The 95% confidence intervals are calculated using the Fisher’s z-transformation method to normalize the distribution of ICC values before applying the confidence interval formula.

Module D: Real-World Examples

Case Study 1: Physical Therapy Assessment

Scenario: 15 physical therapists evaluated 20 patients’ range of motion using a goniometer. Each patient was assessed by 3 different therapists.

ANOVA Results: MSB = 48.2, MSW = 10.5, MSR = 3.8, MSE = 6.7

ICC Model: Two-way mixed effects, absolute agreement (ICC(3,1))

Calculated ICC: 0.89 (95% CI: 0.82-0.94)

Interpretation: Excellent inter-rater reliability, suggesting the goniometer measurements are highly consistent across different therapists.

Case Study 2: Psychological Survey Validation

Scenario: 50 participants completed a new anxiety scale. Each participant was rated by 2 clinical psychologists using the scale.

ANOVA Results: MSB = 32.1, MSW = 8.4, MSR = 2.1, MSE = 6.3

ICC Model: Two-way random effects, consistency (ICC(2,1))

Calculated ICC: 0.78 (95% CI: 0.69-0.85)

Interpretation: Good reliability, but suggests some room for improvement in the scale’s consistency between raters.

Case Study 3: Radiological Image Analysis

Scenario: 8 radiologists evaluated 50 MRI scans for tumor size measurements. Each scan was independently measured by all 8 radiologists.

ANOVA Results: MSB = 125.3, MSW = 42.8, MSR = 18.6, MSE = 24.2

ICC Model: Two-way random effects, average measures, absolute agreement (ICC(2,8))

Calculated ICC: 0.96 (95% CI: 0.94-0.97)

Interpretation: Exceptional reliability, indicating the measurement protocol produces highly consistent results across different radiologists.

Module E: Data & Statistics

Comparison of ICC Models and Their Applications

ICC Model	Notation	Description	When to Use	Key Consideration
One-Way Random	ICC(1,1), ICC(1,k)	Each subject rated by different raters randomly selected from population	When raters are randomly sampled and you want to generalize to entire rater population	Most conservative estimate of reliability
Two-Way Random	ICC(2,1), ICC(2,k)	Each subject rated by same set of raters randomly selected from population	When same raters evaluate all subjects and you want to generalize to rater population	Accounts for systematic differences between raters
Two-Way Mixed	ICC(3,1), ICC(3,k)	Each subject rated by same fixed set of raters	When using specific raters who are the only ones of interest (not generalizing)	Most liberal estimate of reliability

ICC Interpretation Guidelines by Field

Field of Study	Poor	Moderate	Good	Excellent	Source
Clinical Medicine	<0.40	0.40-0.59	0.60-0.74	≥0.75	NCBI Guidelines
Psychometrics	<0.50	0.50-0.69	0.70-0.84	≥0.85	APA Standards
Biomechanics	<0.60	0.60-0.74	0.75-0.89	≥0.90	ISB Recommendations
Educational Testing	<0.70	0.70-0.79	0.80-0.89	≥0.90	ETS Standards

Module F: Expert Tips

Designing Your Study for Optimal ICC Calculation

Sample size matters: Aim for at least 30 subjects and 3-5 raters for stable ICC estimates. Small samples can lead to artificially high ICC values.
Rater training: Ensure all raters are properly trained and calibrated before data collection to minimize systematic differences.
Blinding: Keep raters blinded to each other’s scores and to previous ratings of the same subject to prevent bias.
Randomization: Randomize the order of subject evaluation to control for order effects.
Pilot testing: Conduct a pilot study with 5-10 subjects to identify potential issues with your measurement protocol.

Common Pitfalls to Avoid

Ignoring model assumptions: ICC calculations assume normality of measurements and homogeneity of variance. Check these assumptions with appropriate tests.
Using wrong ICC type: Selecting ICC(1,1) when you should use ICC(3,k) can lead to misleadingly low reliability estimates.
Overinterpreting high ICC: An ICC of 0.9 doesn’t mean perfect agreement – examine the actual measurement differences.
Neglecting confidence intervals: Always report CIs. An ICC of 0.75 with CI [0.65, 0.83] is more informative than just 0.75.
Pooling heterogeneous groups: Calculating ICC across groups with different variances (e.g., healthy and diseased) can inflate reliability estimates.

Advanced Considerations

Generalizability theory: For complex designs, consider G-theory which extends ICC to multiple facets (raters, items, occasions).
Missing data: Use multiple imputation or maximum likelihood methods rather than complete-case analysis.
Non-normal data: For ordinal data, consider weighted kappa or polychoric ICC instead of Pearson-based ICC.
Software validation: Cross-validate your ICC calculations using at least two different statistical packages.
Longitudinal designs: For test-retest reliability, ensure appropriate time intervals between measurements.

Module G: Interactive FAQ

What’s the difference between ICC and Pearson correlation?

While both measure relationships between variables, Pearson correlation assesses the linear relationship between two continuous variables (e.g., height and weight), while ICC specifically evaluates the consistency or absolute agreement between measurements of the same underlying quantity by different raters or methods.

Key differences:

Pearson r ranges from -1 to 1; ICC ranges from 0 to 1
Pearson measures association; ICC measures agreement/reliability
Pearson is sensitive to scaling; ICC is scale-invariant
ICC accounts for systematic differences between raters; Pearson does not

Use Pearson when comparing distinct variables; use ICC when comparing multiple measurements of the same construct.

How many raters do I need for a reliable ICC estimate?

The number of raters affects both the ICC value and the precision of its estimate. General recommendations:

Minimum: 2 raters (but provides limited information)
Recommended: 3-5 raters for most applications
High-stakes decisions: 5-10 raters for critical measurements

More raters generally:

Increase the ICC value (especially for average measures)
Narrow the confidence intervals
Provide more stable estimates

For ICC(k) with k raters, the reliability for a single rater would be lower. The Spearman-Brown prophecy formula can estimate how ICC would change with different numbers of raters:

ICC_k = (k × ICC₁) / (1 + (k-1) × ICC₁)

Can ICC be negative? What does that mean?

Yes, ICC can theoretically be negative, though this is rare in practice. A negative ICC occurs when:

The between-subject variability (MSB) is smaller than the within-subject variability (MSW or MSE)
There’s more variability within subjects than between subjects
Raters are systematically disagreeing (e.g., one rater consistently scores high while another scores low for the same subjects)

Interpretation of negative ICC:

ICC < 0: No reliability; measurements are worse than random
ICC = 0: No reliability; measurements are random
0 < ICC < 0.5: Poor reliability

If you get a negative ICC:

Check for data entry errors
Examine your measurement protocol for systematic issues
Consider whether your raters need additional training
Verify that your ANOVA model is correctly specified

How does ICC relate to Cronbach’s alpha?

Both ICC and Cronbach’s alpha measure reliability, but they’re used in different contexts:

Feature	ICC	Cronbach’s Alpha
Purpose	Inter-rater reliability	Internal consistency
Data Structure	Multiple raters measuring same subjects	Multiple items measuring same construct
ANOVA Based	Yes	No (based on item covariances)
Range	Can be negative	0 to 1 (negative values set to 0)
When to Use	When different raters measure same subjects	When multiple test items measure same latent construct

In some special cases with balanced designs, ICC(3,1) can be mathematically equivalent to Cronbach’s alpha, but this requires:

All raters measure all subjects
No missing data
Items/raters are parallel (equal means and variances)

For most practical purposes, they serve different reliability assessment needs.

What’s the minimum acceptable ICC for publication?

The minimum acceptable ICC depends on your field and the stakes of your measurements:

General Guidelines by Context:

Exploratory research: ≥0.60 (moderate reliability)
Confirmatory research: ≥0.70 (good reliability)
Clinical decision-making: ≥0.80 (good to excellent)
Diagnostic tests: ≥0.90 (excellent reliability)

Journal Requirements:

Many top journals in medicine and psychology require:

ICC ≥ 0.70 for primary outcome measures
ICC ≥ 0.80 for diagnostic instruments
Confidence intervals reported for all ICC values
Justification if ICC < 0.70 is reported

Regulatory Standards:

For FDA submissions and clinical trials:

Primary endpoints typically require ICC ≥ 0.80
Safety measurements often require ICC ≥ 0.90
Full reporting of measurement error (SEM, MDC) alongside ICC

Important Note: Always check the author guidelines of your target journal and the standards of your specific field. Some specialized areas (like radiology or forensic analysis) may have stricter requirements.

How do I calculate ICC in R, Python, and SPSS?

R (using psych and irr packages):

# For ICC(1,1)
library(psych)
ICC1 <- ICC(your_data)$ICC[1]

# For ICC(2,1) or ICC(3,1)
library(irr)
icc_result <- icc(your_data, model = "twoway", type = "agreement", unit = "average")

# For detailed ANOVA-based ICC
library(lme4)
model <- lmer(score ~ 1 + (1|subject) + (1|rater), data = your_data)
VarCorr(model)  # Extract variance components for manual ICC calculation

Python (using pingouin):

import pingouin as pg

# For ICC(1)
icc1 = pg.intraclass_corr(data=df, targets='subject', raters='rater', ratings='score').round(3)

# For ICC(2) or ICC(3)
icc = pg.intraclass_corr(data=df, targets='subject', raters='rater',
                        ratings='score', raters_are_random=False).round(3)

SPSS:

Go to Analyze → Scale → Reliability Analysis
Move your items to the "Items" box
Click "Statistics" and check "Intraclass correlation coefficient"
Select your desired ICC model (1, 2, or 3) and type (single or average)
Click "Continue" then "OK"

Pro Tip: Always verify your software's ICC calculation method against the theoretical formulas, as different packages may use slightly different computational approaches.

What are the limitations of ICC?

While ICC is a powerful reliability metric, it has several important limitations:

Statistical Limitations:

Assumes normality: ICC calculations assume normally distributed measurements, which may not hold for ordinal or skewed data
Sensitive to outliers: Extreme values can disproportionately influence ICC estimates
Sample size dependent: Small samples can produce unstable ICC values with wide confidence intervals
Fixed vs random effects: Choosing the wrong model (fixed when should be random or vice versa) can lead to incorrect conclusions

Interpretation Challenges:

High ICC ≠ perfect agreement: An ICC of 0.9 still allows for meaningful differences between measurements
Context-dependent: What's "good" in one field may be "poor" in another
Masking systematic bias: ICC can be high even if raters systematically differ (e.g., one always scores 10% higher)
Ignores measurement error: Doesn't directly inform you about the absolute size of measurement errors

Practical Considerations:

Resource intensive: Requires multiple raters and subjects, which can be expensive
Not for all designs: Assumes each subject is rated by multiple raters, which isn't always feasible
Static measure: Doesn't account for learning effects or rater drift over time
Limited diagnostic value: A low ICC doesn't tell you why reliability is poor or how to improve it

Alternatives and Complements:

Consider using these alongside ICC:

Bland-Altman plots: For visualizing agreement and systematic bias
Standard Error of Measurement (SEM): For understanding absolute measurement error
Smallest Detectable Change (SDC): For determining meaningful individual changes
Kappa statistics: For categorical data
Generalizability theory: For complex multi-facet designs

Formula For Calculating Icc