Multiclass G-Mean Calculator
Calculate the geometric mean (G-Mean) for multiclass classification with precision
Introduction & Importance of Multiclass G-Mean
The geometric mean (G-Mean) for multiclass classification is a crucial performance metric that provides a balanced measure of model accuracy across all classes, particularly when dealing with imbalanced datasets. Unlike arithmetic mean, G-Mean is less sensitive to variance in class distribution, making it an essential tool for evaluating classification models in real-world scenarios where class imbalance is common.
In machine learning applications ranging from medical diagnosis to fraud detection, G-Mean offers several advantages:
- Balanced Evaluation: Provides equal importance to both majority and minority classes
- Robustness: Less affected by extreme values compared to arithmetic mean
- Interpretability: Directly relates to the product of all class-wise performance metrics
- Comparability: Allows fair comparison between models on imbalanced datasets
How to Use This Calculator
Follow these step-by-step instructions to calculate the multiclass G-Mean for your classification model:
- Select Number of Classes: Choose how many classes your classification problem contains (2-6 classes supported)
- Enter Class Metrics: For each class, input either:
- True Positives (TP) and False Negatives (FN) -or-
- Recall/Sensitivity values directly
- Calculate: Click the “Calculate G-Mean” button to compute the result
- Interpret Results: Review the G-Mean value and interpretation guidance
- Visual Analysis: Examine the chart showing class-wise performance distribution
Pro Tip: For imbalanced datasets, focus on improving the recall of minority classes to boost your overall G-Mean score. Even small improvements in underrepresented classes can significantly impact the geometric mean.
Formula & Methodology
The multiclass G-Mean is calculated using the following mathematical formulation:
G-Mean = ∏i=1n (Recalli)1/n
Where:
- n = number of classes
- Recalli = recall/sensitivity for class i (TPi / (TPi + FNi))
- ∏ = product operator (multiplication of all terms)
The calculation process involves these key steps:
- Recall Calculation: For each class, compute recall as TP/(TP+FN)
- Product Computation: Multiply all class recall values together
- Root Extraction: Take the nth root of the product (where n = number of classes)
- Normalization: Ensure the result falls between 0 and 1
Unlike arithmetic mean which sums values and divides by count, geometric mean multiplies values and takes the root. This makes G-Mean particularly sensitive to low values – if any class has poor recall, the overall G-Mean will be significantly reduced, which is desirable for identifying weak points in multiclass classification.
Real-World Examples
Example 1: Medical Diagnosis (3 Classes)
A cancer detection system classifying tumors as benign, malignant, or uncertain with the following confusion matrix:
| Class | TP | FN | Recall |
|---|---|---|---|
| Benign | 450 | 50 | 0.90 |
| Malignant | 180 | 20 | 0.90 |
| Uncertain | 90 | 60 | 0.60 |
Calculation: G-Mean = (0.90 × 0.90 × 0.60)1/3 = 0.7746
Interpretation: The uncertain class is dragging down overall performance. Focus on improving recall for uncertain cases through better feature engineering or class-specific model tuning.
Example 2: Fraud Detection (4 Classes)
A financial fraud detection system with these performance metrics:
| Fraud Type | TP | FN | Recall |
|---|---|---|---|
| Credit Card | 2,450 | 50 | 0.98 |
| Identity Theft | 850 | 150 | 0.85 |
| Money Laundering | 45 | 55 | 0.45 |
| Account Takeover | 320 | 80 | 0.80 |
Calculation: G-Mean = (0.98 × 0.85 × 0.45 × 0.80)1/4 = 0.6934
Interpretation: Money laundering detection is the critical weakness. The G-Mean reveals that improving this class should be the top priority, even though other classes perform well.
Example 3: Customer Segmentation (5 Classes)
An e-commerce customer classification model with these results:
| Segment | TP | FN | Recall |
|---|---|---|---|
| High Value | 1,200 | 300 | 0.80 |
| Medium Value | 3,500 | 500 | 0.88 |
| Low Value | 8,000 | 2,000 | 0.80 |
| Churn Risk | 450 | 50 | 0.90 |
| New Customers | 950 | 50 | 0.95 |
Calculation: G-Mean = (0.80 × 0.88 × 0.80 × 0.90 × 0.95)1/5 = 0.8512
Interpretation: The model performs consistently across segments. The G-Mean confirms there are no severe imbalances, though low-value customer recall could be improved for better marketing efficiency.
Data & Statistics
Comparison of Evaluation Metrics for Imbalanced Datasets
| Metric | Balanced Data | Imbalanced Data (10:1) | Imbalanced Data (100:1) | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Accuracy | 95% | 91% | 99% | Easy to understand | Misleading for imbalanced data |
| F1 Score | 0.95 | 0.45 | 0.09 | Balances precision/recall | Still biased toward majority class |
| G-Mean | 0.95 | 0.63 | 0.30 | Sensitive to minority classes | Can be too strict |
| AUC-ROC | 0.98 | 0.87 | 0.85 | Threshold-independent | Can be optimistic |
| MCC | 0.90 | 0.42 | 0.10 | Considers all confusion matrix elements | Hard to interpret |
This comparison demonstrates why G-Mean is particularly valuable for imbalanced multiclass problems. While accuracy remains high even with severe imbalance, G-Mean dramatically drops, properly reflecting the model’s struggle with minority classes.
G-Mean Benchmarks by Industry
| Industry/Application | Excellent (>0.90) | Good (0.80-0.90) | Fair (0.70-0.80) | Poor (<0.70) | Typical Class Imbalance |
|---|---|---|---|---|---|
| Medical Diagnosis | 92% | 85% | 75% | 65% | 1:10 to 1:100 |
| Fraud Detection | 88% | 78% | 68% | 55% | 1:100 to 1:1000 |
| Manufacturing QA | 95% | 90% | 82% | 70% | 1:5 to 1:50 |
| Customer Segmentation | 85% | 75% | 65% | 55% | 1:2 to 1:20 |
| Network Intrusion | 90% | 80% | 70% | 60% | 1:50 to 1:500 |
These benchmarks help contextualize your G-Mean results. For instance, a G-Mean of 0.75 might be considered poor in manufacturing quality assurance but excellent in fraud detection with 1:1000 class imbalance.
Expert Tips for Improving Multiclass G-Mean
Model Optimization Strategies
- Class Weighting: Implement class-weighted loss functions to give more importance to minority classes during training
- In scikit-learn:
class_weight='balanced' - In TensorFlow:
class_weight={0:1, 1:10, 2:5}
- In scikit-learn:
- Resampling Techniques: Apply appropriate sampling methods
- Oversampling minority classes (SMOTE, ADASYN)
- Undersampling majority classes (Random, Tomek links)
- Hybrid approaches (SMOTE + ENN)
- Threshold Adjustment: Optimize decision thresholds per-class rather than using default 0.5
- Use precision-recall curves to find optimal thresholds
- Implement cost-sensitive learning
- Feature Engineering: Create features that better distinguish minority classes
- Interaction terms between rare features
- Class-specific feature transformations
- Anomaly detection features for rare classes
- Ensemble Methods: Leverage ensemble techniques that naturally handle imbalance
- Balanced Random Forest
- Easy Ensemble
- RUSBoost
Evaluation Best Practices
- Stratified Cross-Validation: Always use stratified k-fold CV to maintain class distribution in splits
- Multiple Metrics: Track G-Mean alongside precision, recall, and F1 for comprehensive evaluation
- Confidence Intervals: Calculate confidence intervals for G-Mean to assess statistical significance
- Class-Specific Analysis: Examine which classes contribute most to low G-Mean scores
- Business Context: Consider class importance when interpreting G-Mean (not all classes are equally critical)
Common Pitfalls to Avoid
- Ignoring Class Distribution: Assuming default metrics work for imbalanced data
- Overfitting to Minority Classes: Sacrificing majority class performance too much
- Improper Data Leakage: Applying resampling before train-test split
- Neglecting Feature Scaling: Some algorithms (like SVM) are sensitive to feature scales
- Using Inappropriate Metrics: Relying on accuracy for imbalanced multiclass problems
Interactive FAQ
What’s the difference between G-Mean and F1 score for multiclass problems?
While both metrics consider class imbalance, they differ fundamentally in their calculation and interpretation:
- F1 Score: Harmonic mean of precision and recall (arithmetic relationship). Can be calculated as macro, micro, or weighted average for multiclass.
- G-Mean: Geometric mean of recall values (multiplicative relationship). Always considers all classes equally regardless of size.
Key difference: G-Mean is more sensitive to poor performance in any single class. For example, if one class has 0 recall, G-Mean becomes 0 regardless of other classes’ performance, while F1 would still reflect the average performance.
Use F1 when you care about the balance between precision and recall. Use G-Mean when you need to ensure no class performs poorly, regardless of its size.
How does G-Mean handle classes with zero recall?
G-Mean has a critical property regarding zero values: if any class has zero recall (meaning the model failed to identify any instances of that class), the entire G-Mean becomes zero. This mathematical property comes from the geometric mean’s definition as the nth root of the product of values.
Practical implications:
- Strength: Forces attention to classes that are completely missed
- Weakness: Can be overly punitive in cases where near-zero recall might be acceptable
- Solution: Consider adding a small epsilon value (e.g., 1e-10) to avoid numerical instability while maintaining the metric’s spirit
This behavior makes G-Mean particularly valuable for applications where missing any class is catastrophic (e.g., medical diagnosis where failing to detect a rare but serious condition is unacceptable).
Can G-Mean be used for binary classification?
Yes, G-Mean can absolutely be used for binary classification, where it becomes particularly elegant. For two classes, the G-Mean is simply the square root of the product of the two class recalls:
G-Mean = √(Recallpositive × Recallnegative)
In binary cases, G-Mean is equivalent to:
- The geometric mean of sensitivity and specificity
- The square root of the product of true positive rate and true negative rate
Many researchers prefer G-Mean over ROC AUC for imbalanced binary problems because it gives equal weight to both classes and isn’t affected by the “accuracy paradox” where high accuracy can mask poor minority class performance.
How should I interpret different G-Mean value ranges?
G-Mean interpretation depends on your specific problem domain, but here’s a general guideline for multiclass problems:
| G-Mean Range | Interpretation | Recommended Action |
|---|---|---|
| 0.90 – 1.00 | Excellent performance across all classes | Model is well-balanced; focus on deployment |
| 0.80 – 0.90 | Good overall performance | Identify and improve weakest 1-2 classes |
| 0.70 – 0.80 | Fair performance with notable imbalances | Apply class-specific optimization techniques |
| 0.60 – 0.70 | Poor performance – significant class imbalance | Major model redesign needed; consider resampling |
| Below 0.60 | Very poor – model fails on multiple classes | Re-evaluate feature selection and algorithm choice |
Remember that these are general guidelines. In domains like fraud detection where class imbalance might be 1:1000, a G-Mean of 0.7 might be considered excellent, while in balanced medical diagnosis, 0.7 would be unacceptable.
What are the limitations of G-Mean?
While G-Mean is a powerful metric for imbalanced multiclass problems, it has several important limitations:
- Sensitivity to Zero Values: As mentioned, any zero recall makes G-Mean zero, which can be overly punitive in some scenarios.
- Ignores Precision: G-Mean focuses only on recall (true positive rate), completely ignoring false positives.
- Equal Class Importance: Treats all classes equally, which may not align with business priorities.
- Numerical Instability: Can underflow with many classes or very small values.
- Threshold Dependency: Like all recall-based metrics, it depends on classification thresholds.
- No Probability Information: Doesn’t consider confidence scores, only hard classifications.
Best practice: Use G-Mean alongside other metrics like:
- Macro F1-score (for precision-recall balance)
- MCC (Matthews Correlation Coefficient for overall quality)
- Class-specific precision (to understand false positive rates)
Are there alternatives to G-Mean for multiclass imbalance?
Several alternative metrics address multiclass imbalance, each with different strengths:
| Metric | Formula | When to Use | Limitations |
|---|---|---|---|
| Macro F1 | Mean of per-class F1 scores | When you need balance between precision and recall | Can be dominated by classes with more false positives |
| MCC | Correlation between observed and predicted | When you want a single metric considering all confusion matrix elements | Hard to interpret; ranges from -1 to 1 |
| Cohen’s Kappa | Agreement adjusted for chance | When you need to account for random agreement | Less intuitive for imbalanced data |
| Balanced Accuracy | Mean of per-class recalls | When you want simple recall average | Same as G-Mean for 2 classes but different for n>2 |
| Fβ Score | Weighted harmonic mean (adjustable β) | When you need to emphasize recall over precision (β>1) | Requires choosing β parameter |
Recommendation: For most imbalanced multiclass problems, track G-Mean alongside MCC and macro F1 for comprehensive evaluation. The choice depends on whether you prioritize:
- Recall focus: G-Mean or Balanced Accuracy
- Precision-recall balance: Macro F1
- Overall correlation: MCC
How can I calculate confidence intervals for G-Mean?
Calculating confidence intervals for G-Mean requires bootstrapping or other resampling techniques due to its non-linear nature. Here’s a practical approach:
- Bootstrap Sampling: Create B bootstrap samples (typically B=1000) by resampling your test set with replacement
- Calculate G-Mean: Compute G-Mean for each bootstrap sample
- Sort Results: Sort the B G-Mean values in ascending order
- Determine Interval: For 95% CI, use the 25th and 975th values (for B=1000)
Python implementation example:
from sklearn.utils import resample
import numpy as np
def bootstrap_gmean_ci(y_true, y_pred, n_bootstraps=1000, ci=95):
gmeans = []
for _ in range(n_bootstraps):
y_true_resampled, y_pred_resampled = resample(y_true, y_pred)
# Calculate G-Mean for this sample
gmean = calculate_gmean(y_true_resampled, y_pred_resampled)
gmeans.append(gmean)
gmeans.sort()
lower = (100 - ci) / 2
upper = 100 - lower
return np.percentile(gmeans, lower), np.percentile(gmeans, upper)
Alternative approaches include:
- Delta Method: For large samples, but requires complex derivative calculations
- Bayesian Methods: If you have prior distributions for your recall estimates
- Normal Approximation: For very large samples where CLT might apply
Confidence intervals help assess whether observed differences in G-Mean between models are statistically significant, which is crucial for proper model comparison.
Authoritative Resources
For deeper understanding of multiclass evaluation metrics and G-Mean:
- NIST Guide to Classification Metrics (NISTIR 5696) – Comprehensive government resource on evaluation metrics
- “Machine Learning for the Detection of Oil Spills in Satellite Radar Images” (CMU) – Seminal paper on imbalanced learning
- FDA Guidelines on ML in Medical Devices – Regulatory perspective on evaluation metrics for healthcare applications