Multiclass G-Mean Calculator

Calculate the geometric mean (G-Mean) for multiclass classification with precision

Number of Classes

Introduction & Importance of Multiclass G-Mean

The geometric mean (G-Mean) for multiclass classification is a crucial performance metric that provides a balanced measure of model accuracy across all classes, particularly when dealing with imbalanced datasets. Unlike arithmetic mean, G-Mean is less sensitive to variance in class distribution, making it an essential tool for evaluating classification models in real-world scenarios where class imbalance is common.

Visual representation of multiclass classification evaluation metrics showing G-Mean calculation process

In machine learning applications ranging from medical diagnosis to fraud detection, G-Mean offers several advantages:

Balanced Evaluation: Provides equal importance to both majority and minority classes
Robustness: Less affected by extreme values compared to arithmetic mean
Interpretability: Directly relates to the product of all class-wise performance metrics
Comparability: Allows fair comparison between models on imbalanced datasets

How to Use This Calculator

Follow these step-by-step instructions to calculate the multiclass G-Mean for your classification model:

Select Number of Classes: Choose how many classes your classification problem contains (2-6 classes supported)
Enter Class Metrics: For each class, input either:
- True Positives (TP) and False Negatives (FN) -or-
- Recall/Sensitivity values directly
Calculate: Click the “Calculate G-Mean” button to compute the result
Interpret Results: Review the G-Mean value and interpretation guidance
Visual Analysis: Examine the chart showing class-wise performance distribution

Pro Tip: For imbalanced datasets, focus on improving the recall of minority classes to boost your overall G-Mean score. Even small improvements in underrepresented classes can significantly impact the geometric mean.

Formula & Methodology

The multiclass G-Mean is calculated using the following mathematical formulation:

G-Mean = ∏_i=1ⁿ (Recall_i)^1/n

Where:

n = number of classes
Recall_i = recall/sensitivity for class i (TP_i / (TP_i + FN_i))
∏ = product operator (multiplication of all terms)

The calculation process involves these key steps:

Recall Calculation: For each class, compute recall as TP/(TP+FN)
Product Computation: Multiply all class recall values together
Root Extraction: Take the nth root of the product (where n = number of classes)
Normalization: Ensure the result falls between 0 and 1

Unlike arithmetic mean which sums values and divides by count, geometric mean multiplies values and takes the root. This makes G-Mean particularly sensitive to low values – if any class has poor recall, the overall G-Mean will be significantly reduced, which is desirable for identifying weak points in multiclass classification.

Real-World Examples

Example 1: Medical Diagnosis (3 Classes)

A cancer detection system classifying tumors as benign, malignant, or uncertain with the following confusion matrix:

Class	TP	FN	Recall
Benign	450	50	0.90
Malignant	180	20	0.90
Uncertain	90	60	0.60

Calculation: G-Mean = (0.90 × 0.90 × 0.60)^1/3 = 0.7746

Interpretation: The uncertain class is dragging down overall performance. Focus on improving recall for uncertain cases through better feature engineering or class-specific model tuning.

Example 2: Fraud Detection (4 Classes)

A financial fraud detection system with these performance metrics:

Fraud Type	TP	FN	Recall
Credit Card	2,450	50	0.98
Identity Theft	850	150	0.85
Money Laundering	45	55	0.45
Account Takeover	320	80	0.80

Calculation: G-Mean = (0.98 × 0.85 × 0.45 × 0.80)^1/4 = 0.6934

Interpretation: Money laundering detection is the critical weakness. The G-Mean reveals that improving this class should be the top priority, even though other classes perform well.

Example 3: Customer Segmentation (5 Classes)

An e-commerce customer classification model with these results:

Segment	TP	FN	Recall
High Value	1,200	300	0.80
Medium Value	3,500	500	0.88
Low Value	8,000	2,000	0.80
Churn Risk	450	50	0.90
New Customers	950	50	0.95

Calculation: G-Mean = (0.80 × 0.88 × 0.80 × 0.90 × 0.95)^1/5 = 0.8512

Interpretation: The model performs consistently across segments. The G-Mean confirms there are no severe imbalances, though low-value customer recall could be improved for better marketing efficiency.

Data & Statistics

Comparison of Evaluation Metrics for Imbalanced Datasets

Metric	Balanced Data	Imbalanced Data (10:1)	Imbalanced Data (100:1)	Strengths	Weaknesses
Accuracy	95%	91%	99%	Easy to understand	Misleading for imbalanced data
F1 Score	0.95	0.45	0.09	Balances precision/recall	Still biased toward majority class
G-Mean	0.95	0.63	0.30	Sensitive to minority classes	Can be too strict
AUC-ROC	0.98	0.87	0.85	Threshold-independent	Can be optimistic
MCC	0.90	0.42	0.10	Considers all confusion matrix elements	Hard to interpret

This comparison demonstrates why G-Mean is particularly valuable for imbalanced multiclass problems. While accuracy remains high even with severe imbalance, G-Mean dramatically drops, properly reflecting the model’s struggle with minority classes.

G-Mean Benchmarks by Industry

Industry/Application	Excellent (>0.90)	Good (0.80-0.90)	Fair (0.70-0.80)	Poor (<0.70)	Typical Class Imbalance
Medical Diagnosis	92%	85%	75%	65%	1:10 to 1:100
Fraud Detection	88%	78%	68%	55%	1:100 to 1:1000
Manufacturing QA	95%	90%	82%	70%	1:5 to 1:50
Customer Segmentation	85%	75%	65%	55%	1:2 to 1:20
Network Intrusion	90%	80%	70%	60%	1:50 to 1:500

These benchmarks help contextualize your G-Mean results. For instance, a G-Mean of 0.75 might be considered poor in manufacturing quality assurance but excellent in fraud detection with 1:1000 class imbalance.

Expert Tips for Improving Multiclass G-Mean

Model Optimization Strategies

Class Weighting: Implement class-weighted loss functions to give more importance to minority classes during training
- In scikit-learn: class_weight='balanced'
- In TensorFlow: class_weight={0:1, 1:10, 2:5}
Resampling Techniques: Apply appropriate sampling methods
- Oversampling minority classes (SMOTE, ADASYN)
- Undersampling majority classes (Random, Tomek links)
- Hybrid approaches (SMOTE + ENN)
Threshold Adjustment: Optimize decision thresholds per-class rather than using default 0.5
- Use precision-recall curves to find optimal thresholds
- Implement cost-sensitive learning
Feature Engineering: Create features that better distinguish minority classes
- Interaction terms between rare features
- Class-specific feature transformations
- Anomaly detection features for rare classes
Ensemble Methods: Leverage ensemble techniques that naturally handle imbalance
- Balanced Random Forest
- Easy Ensemble
- RUSBoost

Evaluation Best Practices

Stratified Cross-Validation: Always use stratified k-fold CV to maintain class distribution in splits
Multiple Metrics: Track G-Mean alongside precision, recall, and F1 for comprehensive evaluation
Confidence Intervals: Calculate confidence intervals for G-Mean to assess statistical significance
Class-Specific Analysis: Examine which classes contribute most to low G-Mean scores
Business Context: Consider class importance when interpreting G-Mean (not all classes are equally critical)

Common Pitfalls to Avoid

Ignoring Class Distribution: Assuming default metrics work for imbalanced data
Overfitting to Minority Classes: Sacrificing majority class performance too much
Improper Data Leakage: Applying resampling before train-test split
Neglecting Feature Scaling: Some algorithms (like SVM) are sensitive to feature scales
Using Inappropriate Metrics: Relying on accuracy for imbalanced multiclass problems

Interactive FAQ

What’s the difference between G-Mean and F1 score for multiclass problems?

While both metrics consider class imbalance, they differ fundamentally in their calculation and interpretation:

F1 Score: Harmonic mean of precision and recall (arithmetic relationship). Can be calculated as macro, micro, or weighted average for multiclass.
G-Mean: Geometric mean of recall values (multiplicative relationship). Always considers all classes equally regardless of size.

Key difference: G-Mean is more sensitive to poor performance in any single class. For example, if one class has 0 recall, G-Mean becomes 0 regardless of other classes’ performance, while F1 would still reflect the average performance.

Use F1 when you care about the balance between precision and recall. Use G-Mean when you need to ensure no class performs poorly, regardless of its size.

How does G-Mean handle classes with zero recall?

G-Mean has a critical property regarding zero values: if any class has zero recall (meaning the model failed to identify any instances of that class), the entire G-Mean becomes zero. This mathematical property comes from the geometric mean’s definition as the nth root of the product of values.

Practical implications:

Strength: Forces attention to classes that are completely missed
Weakness: Can be overly punitive in cases where near-zero recall might be acceptable
Solution: Consider adding a small epsilon value (e.g., 1e-10) to avoid numerical instability while maintaining the metric’s spirit

This behavior makes G-Mean particularly valuable for applications where missing any class is catastrophic (e.g., medical diagnosis where failing to detect a rare but serious condition is unacceptable).

Can G-Mean be used for binary classification?

Yes, G-Mean can absolutely be used for binary classification, where it becomes particularly elegant. For two classes, the G-Mean is simply the square root of the product of the two class recalls:

G-Mean = √(Recall_positive × Recall_negative)

In binary cases, G-Mean is equivalent to:

The geometric mean of sensitivity and specificity
The square root of the product of true positive rate and true negative rate

Many researchers prefer G-Mean over ROC AUC for imbalanced binary problems because it gives equal weight to both classes and isn’t affected by the “accuracy paradox” where high accuracy can mask poor minority class performance.

How should I interpret different G-Mean value ranges?

G-Mean interpretation depends on your specific problem domain, but here’s a general guideline for multiclass problems:

G-Mean Range	Interpretation	Recommended Action
0.90 – 1.00	Excellent performance across all classes	Model is well-balanced; focus on deployment
0.80 – 0.90	Good overall performance	Identify and improve weakest 1-2 classes
0.70 – 0.80	Fair performance with notable imbalances	Apply class-specific optimization techniques
0.60 – 0.70	Poor performance – significant class imbalance	Major model redesign needed; consider resampling
Below 0.60	Very poor – model fails on multiple classes	Re-evaluate feature selection and algorithm choice

Remember that these are general guidelines. In domains like fraud detection where class imbalance might be 1:1000, a G-Mean of 0.7 might be considered excellent, while in balanced medical diagnosis, 0.7 would be unacceptable.

What are the limitations of G-Mean?

While G-Mean is a powerful metric for imbalanced multiclass problems, it has several important limitations:

Sensitivity to Zero Values: As mentioned, any zero recall makes G-Mean zero, which can be overly punitive in some scenarios.
Ignores Precision: G-Mean focuses only on recall (true positive rate), completely ignoring false positives.
Equal Class Importance: Treats all classes equally, which may not align with business priorities.
Numerical Instability: Can underflow with many classes or very small values.
Threshold Dependency: Like all recall-based metrics, it depends on classification thresholds.
No Probability Information: Doesn’t consider confidence scores, only hard classifications.

Best practice: Use G-Mean alongside other metrics like:

Macro F1-score (for precision-recall balance)
MCC (Matthews Correlation Coefficient for overall quality)
Class-specific precision (to understand false positive rates)

Are there alternatives to G-Mean for multiclass imbalance?

Several alternative metrics address multiclass imbalance, each with different strengths:

Metric	Formula	When to Use	Limitations
Macro F1	Mean of per-class F1 scores	When you need balance between precision and recall	Can be dominated by classes with more false positives
MCC	Correlation between observed and predicted	When you want a single metric considering all confusion matrix elements	Hard to interpret; ranges from -1 to 1
Cohen’s Kappa	Agreement adjusted for chance	When you need to account for random agreement	Less intuitive for imbalanced data
Balanced Accuracy	Mean of per-class recalls	When you want simple recall average	Same as G-Mean for 2 classes but different for n>2
Fβ Score	Weighted harmonic mean (adjustable β)	When you need to emphasize recall over precision (β>1)	Requires choosing β parameter

Recommendation: For most imbalanced multiclass problems, track G-Mean alongside MCC and macro F1 for comprehensive evaluation. The choice depends on whether you prioritize:

Recall focus: G-Mean or Balanced Accuracy
Precision-recall balance: Macro F1
Overall correlation: MCC

How can I calculate confidence intervals for G-Mean?

Calculating confidence intervals for G-Mean requires bootstrapping or other resampling techniques due to its non-linear nature. Here’s a practical approach:

Bootstrap Sampling: Create B bootstrap samples (typically B=1000) by resampling your test set with replacement
Calculate G-Mean: Compute G-Mean for each bootstrap sample
Sort Results: Sort the B G-Mean values in ascending order
Determine Interval: For 95% CI, use the 25th and 975th values (for B=1000)

Python implementation example:

from sklearn.utils import resample
import numpy as np

def bootstrap_gmean_ci(y_true, y_pred, n_bootstraps=1000, ci=95):
    gmeans = []
    for _ in range(n_bootstraps):
        y_true_resampled, y_pred_resampled = resample(y_true, y_pred)
        # Calculate G-Mean for this sample
        gmean = calculate_gmean(y_true_resampled, y_pred_resampled)
        gmeans.append(gmean)

    gmeans.sort()
    lower = (100 - ci) / 2
    upper = 100 - lower
    return np.percentile(gmeans, lower), np.percentile(gmeans, upper)

Alternative approaches include:

Delta Method: For large samples, but requires complex derivative calculations
Bayesian Methods: If you have prior distributions for your recall estimates
Normal Approximation: For very large samples where CLT might apply

Confidence intervals help assess whether observed differences in G-Mean between models are statistically significant, which is crucial for proper model comparison.

Authoritative Resources

For deeper understanding of multiclass evaluation metrics and G-Mean:

NIST Guide to Classification Metrics (NISTIR 5696) – Comprehensive government resource on evaluation metrics
“Machine Learning for the Detection of Oil Spills in Satellite Radar Images” (CMU) – Seminal paper on imbalanced learning
FDA Guidelines on ML in Medical Devices – Regulatory perspective on evaluation metrics for healthcare applications

Comparison chart showing G-Mean versus other multiclass evaluation metrics with their mathematical formulations and appropriate use cases

Formula To Calculate Gmean For Multiclass