Machine-Learning/Python/Python-for-ML-and-AI/Evaluating Multi-Class Classification Models in Python.md at main · xbeat/Machine-Learning

Evaluating Multi-Class Classification Models in Python

Slide 1: Introduction to Multi-Class Classification Evaluation Metrics

Multi-class classification is a task where we predict one of several possible classes for each input. Evaluating the performance of such models requires specialized metrics. This presentation will cover key evaluation metrics for multi-class classification, including accuracy, confusion matrix, precision, recall, F1-score, and more advanced measures.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load iris dataset as an example
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a simple SVM classifier
clf = SVC(kernel='linear', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Slide 2: Accuracy: The Simplest Metric

Accuracy is the ratio of correct predictions to the total number of predictions. While simple, it can be misleading for imbalanced datasets.

from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Manual calculation
correct_predictions = sum(y_test == y_pred)
total_predictions = len(y_test)
manual_accuracy = correct_predictions / total_predictions
print(f"Manual Accuracy: {manual_accuracy:.2f}")

Slide 3: Confusion Matrix: The Foundation of Many Metrics

A confusion matrix shows the counts of correct and incorrect predictions for each class, providing a detailed breakdown of model performance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Print confusion matrix
print("Confusion Matrix:")
print(cm)

Slide 4: Precision: Measure of Exactness

Precision is the ratio of true positive predictions to the total number of positive predictions for a specific class. It answers the question: "Of all instances predicted as positive, how many are actually positive?"

from sklearn.metrics import precision_score

# Calculate precision for each class
precision = precision_score(y_test, y_pred, average=None)

for i, p in enumerate(precision):
    print(f"Precision for class {i}: {p:.2f}")

# Calculate macro-averaged precision
macro_precision = precision_score(y_test, y_pred, average='macro')
print(f"Macro-averaged precision: {macro_precision:.2f}")

Slide 5: Recall: Measure of Completeness

Recall is the ratio of true positive predictions to the total number of actual positive instances for a specific class. It answers the question: "Of all actual positive instances, how many were correctly identified?"

from sklearn.metrics import recall_score

# Calculate recall for each class
recall = recall_score(y_test, y_pred, average=None)

for i, r in enumerate(recall):
    print(f"Recall for class {i}: {r:.2f}")

# Calculate macro-averaged recall
macro_recall = recall_score(y_test, y_pred, average='macro')
print(f"Macro-averaged recall: {macro_recall:.2f}")

Slide 6: F1-Score: Harmonic Mean of Precision and Recall

The F1-score is the harmonic mean of precision and recall, providing a single score that balances both metrics. It's particularly useful when you have an uneven class distribution.

from sklearn.metrics import f1_score

# Calculate F1-score for each class
f1 = f1_score(y_test, y_pred, average=None)

for i, f in enumerate(f1):
    print(f"F1-score for class {i}: {f:.2f}")

# Calculate macro-averaged F1-score
macro_f1 = f1_score(y_test, y_pred, average='macro')
print(f"Macro-averaged F1-score: {macro_f1:.2f}")

Slide 7: Macro vs. Micro Averaging

Macro averaging calculates the metric independently for each class and then takes the average, while micro averaging calculates the metric globally by counting the total true positives, false negatives, and false positives.

from sklearn.metrics import precision_recall_fscore_support

# Calculate macro and micro averaged metrics
macro_p, macro_r, macro_f, _ = precision_recall_fscore_support(y_test, y_pred, average='macro')
micro_p, micro_r, micro_f, _ = precision_recall_fscore_support(y_test, y_pred, average='micro')

print(f"Macro-averaged - Precision: {macro_p:.2f}, Recall: {macro_r:.2f}, F1-score: {macro_f:.2f}")
print(f"Micro-averaged - Precision: {micro_p:.2f}, Recall: {micro_r:.2f}, F1-score: {micro_f:.2f}")

Slide 8: Cohen's Kappa: Agreement Beyond Chance

Cohen's Kappa measures the agreement between two raters, considering the possibility of agreement occurring by chance. In classification, it compares the observed accuracy with the expected accuracy.

from sklearn.metrics import cohen_kappa_score

# Calculate Cohen's Kappa
kappa = cohen_kappa_score(y_test, y_pred)
print(f"Cohen's Kappa: {kappa:.2f}")

# Interpret Kappa
if kappa < 0:
    interpretation = "Poor agreement"
elif kappa < 0.20:
    interpretation = "Slight agreement"
elif kappa < 0.40:
    interpretation = "Fair agreement"
elif kappa < 0.60:
    interpretation = "Moderate agreement"
elif kappa < 0.80:
    interpretation = "Substantial agreement"
else:
    interpretation = "Almost perfect agreement"

print(f"Interpretation: {interpretation}")

Slide 9: Matthews Correlation Coefficient (MCC)

MCC is a balanced measure that can be used even if the classes are of very different sizes. It returns a value between -1 and +1, where +1 represents a perfect prediction, 0 no better than random prediction, and -1 indicates total disagreement.

from sklearn.metrics import matthews_corrcoef

# Calculate Matthews Correlation Coefficient
mcc = matthews_corrcoef(y_test, y_pred)
print(f"Matthews Correlation Coefficient: {mcc:.2f}")

# Interpret MCC
if mcc > 0.7:
    interpretation = "Strong positive relationship"
elif mcc > 0.4:
    interpretation = "Moderate positive relationship"
elif mcc > 0:
    interpretation = "Weak positive relationship"
elif mcc == 0:
    interpretation = "No relationship"
else:
    interpretation = "Negative relationship"

print(f"Interpretation: {interpretation}")

Slide 10: Log Loss (Cross-Entropy Loss)

Log Loss measures the performance of a classification model where the prediction is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label.

from sklearn.metrics import log_loss
import numpy as np

# Get probability predictions
y_pred_proba = clf.predict_proba(X_test)

# Calculate Log Loss
logloss = log_loss(y_test, y_pred_proba)
print(f"Log Loss: {logloss:.4f}")

# Demonstrate impact of confidence on Log Loss
correct_confident = np.array([[0.05, 0.05, 0.9],   # High confidence, correct
                              [0.05, 0.05, 0.9]])  # High confidence, correct
correct_unsure = np.array([[0.3, 0.3, 0.4],        # Low confidence, correct
                           [0.3, 0.3, 0.4]])       # Low confidence, correct
y_true = [2, 2]  # True labels

print(f"Log Loss (High Confidence): {log_loss(y_true, correct_confident):.4f}")
print(f"Log Loss (Low Confidence): {log_loss(y_true, correct_unsure):.4f}")

Slide 11: ROC AUC for Multi-Class

The Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) can be extended to multi-class problems using one-vs-rest or one-vs-one approaches.

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Binarize the output for multi-class ROC AUC
y_test_bin = label_binarize(y_test, classes=np.unique(y))
y_pred_proba = clf.predict_proba(X_test)

# Calculate ROC AUC for each class
roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr', average='macro')
print(f"Macro-averaged ROC AUC: {roc_auc:.2f}")

# Calculate ROC AUC for each class separately
for i in range(len(np.unique(y))):
    roc_auc = roc_auc_score(y_test_bin[:, i], y_pred_proba[:, i])
    print(f"ROC AUC for class {i}: {roc_auc:.2f}")

Slide 12: Real-Life Example: Handwritten Digit Recognition

Let's evaluate a multi-class classifier for recognizing handwritten digits (0-9) using the MNIST dataset.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the digits dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

Slide 13: Real-Life Example: Plant Species Classification

Let's evaluate a multi-class classifier for identifying plant species based on leaf characteristics using the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier
clf = SVC(kernel='rbf', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Iris Species Classification')
plt.show()

Slide 14: Additional Resources

For more information on multi-class classification evaluation metrics, consider exploring the following resources:

"A Survey of Multi-Class Classification Methods" by Carlos N. Silla Jr. and Alex A. Freitas ArXiv URL: https://arxiv.org/abs/1105.0710
"The Foundations of Cost-Sensitive Learning" by Charles Elkan ArXiv URL: https://arxiv.org/abs/1302.3175
"On the Use of the Confusion Matrix for Improving Classification Accuracy" by Nitesh V. Chawla ArXiv URL: https://arxiv.org/abs/1802.07170

These papers provide in-depth discussions on various aspects of multi-class classification evaluation and can help deepen your understanding of the topic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluating Multi-Class Classification Models in Python

Uh oh!

FilesExpand file tree

Evaluating Multi-Class Classification Models in Python.md

Latest commit

History

Evaluating Multi-Class Classification Models in Python.md

File metadata and controls

Evaluating Multi-Class Classification Models in Python