Skip to content

Latest commit

 

History

History
521 lines (400 loc) · 17.3 KB

File metadata and controls

521 lines (400 loc) · 17.3 KB

Choosing the Right Classification Algorithm in Python

Slide 1: Introduction to Classification Algorithms

Classification is a fundamental task in machine learning where we predict the category of input data. Choosing the right algorithm is crucial for accurate results. This presentation will guide you through various classification algorithms, their strengths, and how to implement them using Python.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu)
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.title('Iris Dataset')
plt.show()

Slide 2: Understanding the Problem

Before choosing a classification algorithm, it's essential to understand your data and problem. Consider factors like the number of classes, dataset size, feature dimensionality, and whether the problem is linear or non-linear.

import pandas as pd

# Load and examine the dataset
data = pd.read_csv('your_dataset.csv')
print(data.head())
print(data.info())
print(data.describe())

# Check class distribution
print(data['target'].value_counts(normalize=True))

# Correlation matrix
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
plt.imshow(correlation_matrix, cmap='coolwarm')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Feature Correlation Matrix')
plt.show()

Slide 3: Logistic Regression

Logistic Regression is a simple yet effective algorithm for binary classification problems. It works well for linearly separable classes and provides probabilistic outputs.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Slide 4: Decision Trees

Decision Trees are versatile algorithms that can handle both binary and multi-class classification problems. They are easy to interpret and can capture non-linear relationships in the data.

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# Train the model
dt_classifier = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_classifier.fit(X_train, y_train)

# Visualize the tree
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, feature_names=iris.feature_names[2:], 
          class_names=iris.target_names, filled=True, rounded=True)
plt.show()

# Make predictions
y_pred = dt_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Slide 5: Random Forests

Random Forests are ensemble learning methods that combine multiple decision trees to create a more robust and accurate model. They work well for high-dimensional data and can handle both linear and non-linear problems.

from sklearn.ensemble import RandomForestClassifier

# Train the model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Feature importance
importances = rf_classifier.feature_importances_
feature_names = iris.feature_names[2:]
for name, importance in zip(feature_names, importances):
    print(f"{name}: {importance:.4f}")

Slide 6: Support Vector Machines (SVM)

SVMs are powerful algorithms that work well for both linear and non-linear classification problems. They are particularly effective in high-dimensional spaces and when there's a clear margin of separation between classes.

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
svm_classifier = SVC(kernel='rbf', C=1.0, random_state=42)
svm_classifier.fit(X_train_scaled, y_train)

# Make predictions
y_pred = svm_classifier.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Slide 7: K-Nearest Neighbors (KNN)

KNN is a simple, intuitive algorithm that classifies a data point based on the majority class of its k nearest neighbors. It works well for low-dimensional spaces and when decision boundaries are irregular.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'n_neighbors': [3, 5, 7, 9, 11],
              'weights': ['uniform', 'distance'],
              'metric': ['euclidean', 'manhattan']}

# Perform grid search
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best model
best_knn = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions
y_pred = best_knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Slide 8: Naive Bayes

Naive Bayes classifiers are based on Bayes' theorem and assume feature independence. They work well for text classification and when the independence assumption holds reasonably well.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Train the model
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Make predictions
y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Slide 9: Neural Networks

Neural Networks, particularly Multi-Layer Perceptrons (MLPs), can handle complex, non-linear classification problems. They work well with large datasets and can capture intricate patterns in the data.

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
mlp_classifier = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
mlp_classifier.fit(X_train_scaled, y_train)

# Make predictions
y_pred = mlp_classifier.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Learning curve
plt.plot(mlp_classifier.loss_curve_)
plt.title('MLP Learning Curve')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.show()

Slide 10: Gradient Boosting Machines

Gradient Boosting Machines, like XGBoost and LightGBM, are powerful ensemble methods that often achieve state-of-the-art performance on various classification tasks.

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# Train the model
xgb_classifier = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_classifier.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(xgb_classifier, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.2f} (+/- {cv_scores.std() * 2:.2f})")

# Feature importance
feature_importance = xgb_classifier.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.barh(pos, feature_importance[sorted_idx])
plt.yticks(pos, np.array(iris.feature_names)[sorted_idx])
plt.title('Feature Importance in XGBoost')
plt.show()

Slide 11: Real-life Example: Spam Email Classification

Let's apply our knowledge to a practical problem: classifying emails as spam or not spam. We'll use a bag-of-words approach with logistic regression.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Sample data (replace with your own dataset)
emails = ["Get rich quick!", "Meeting at 3 PM", "Buy now, limited offer", "Project deadline tomorrow"]
labels = [1, 0, 1, 0]  # 1 for spam, 0 for not spam

# Create a pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])

# Train the model
pipeline.fit(emails, labels)

# Predict new emails
new_emails = ["Free money, claim now!", "Don't forget the team lunch"]
predictions = pipeline.predict(new_emails)

for email, prediction in zip(new_emails, predictions):
    print(f"Email: {email}")
    print(f"Prediction: {'Spam' if prediction == 1 else 'Not Spam'}\n")

Slide 12: Real-life Example: Image Classification

Another common application of classification algorithms is image recognition. Let's use a simple CNN to classify handwritten digits from the MNIST dataset.

import tensorflow as tf
from tensorflow.keras import layers, models

# Load and preprocess the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255

# Build the CNN model
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Compile and train the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=5, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")

# Plot training history
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Slide 13: Model Evaluation and Selection

When choosing a classification algorithm, it's crucial to evaluate and compare different models. Use techniques like cross-validation, ROC curves, and confusion matrices to assess performance.

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# List of classifiers
classifiers = [
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(probability=True),
    KNeighborsClassifier(),
    GaussianNB(),
    MLPClassifier()
]

# Cross-validation
for clf in classifiers:
    scores = cross_val_score(clf, X, y, cv=5)
    print(f"{clf.__class__.__name__}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# ROC curve for multi-class
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]

# Train a multi-class classifier
classifier = OneVsRestClassifier(RandomForestClassifier())
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_bin[y_test], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curves
plt.figure()
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label=f'ROC curve (class {i}) (area = {roc_auc[i]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Slide 14: Hyperparameter Tuning

Fine-tuning your chosen algorithm can significantly improve performance. Use techniques like Grid Search, Random Search, or Bayesian Optimization to find the best hyperparameters.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.ensemble import RandomForestClassifier

# Define the parameter space
param_dist = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 11),
    'max_features': uniform(0, 1)
}

# Create a random forest classifier
rf = RandomForestClassifier(random_state=42)

# Random search
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, 
                                   n_iter=100, cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", random_search.best_params_)
print("Best cross-validation score:", random_search.best_score_)

# Evaluate on test set
best_model = random_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test set score with best model:", test_score)

Slide 15: Handling Imbalanced Datasets

In many real-world classification problems, class distributions are imbalanced. This can lead to poor performance on minority classes. Techniques like oversampling, undersampling, or SMOTE can help address this issue.

from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report

# Assume we have an imbalanced dataset X_imb, y_imb
# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_imb, y_imb)

# Train a classifier on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_resampled, y_resampled)

# Make predictions and print classification report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Slide 16: Ensemble Methods

Ensemble methods combine multiple models to create a more robust classifier. Techniques like bagging, boosting, and stacking can often outperform individual models.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Create base classifiers
clf1 = LogisticRegression(random_state=42)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = SVC(probability=True, random_state=42)

# Create voting classifier
voting_clf = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='soft'
)

# Fit the ensemble
voting_clf.fit(X_train, y_train)

# Evaluate
ensemble_score = voting_clf.score(X_test, y_test)
print(f"Ensemble Score: {ensemble_score:.4f}")

# Compare with individual classifiers
for clf, label in zip([clf1, clf2, clf3, voting_clf], 
                      ['Logistic Regression', 'Decision Tree', 'SVC', 'Ensemble']):
    scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)
    print(f"{label}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

Slide 17: Feature Selection and Engineering

Choosing the right features is crucial for building effective classification models. Feature selection helps identify the most relevant features, while feature engineering can create new, more informative features.

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import PolynomialFeatures

# Feature selection
selector = SelectKBest(f_classif, k=2)
X_selected = selector.fit_transform(X, y)

# Print selected feature indices
selected_features = selector.get_support(indices=True)
print("Selected features:", selected_features)

# Feature engineering: polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("Original feature shape:", X.shape)
print("Polynomial feature shape:", X_poly.shape)

# Train a model with engineered features
clf = LogisticRegression()
clf.fit(X_poly, y)
poly_score = clf.score(poly.transform(X_test), y_test)
print(f"Score with polynomial features: {poly_score:.4f}")

Slide 18: Additional Resources

For further exploration of classification algorithms and machine learning techniques, consider the following resources:

  1. "Pattern Recognition and Machine Learning" by Christopher Bishop ArXiv: https://arxiv.org/abs/2303.10566
  2. "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman ArXiv: https://arxiv.org/abs/2305.10786
  3. "Introduction to Machine Learning with Python" by Andreas C. Müller and Sarah Guido
  4. Scikit-learn documentation: https://scikit-learn.org/stable/
  5. "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville ArXiv: https://arxiv.org/abs/2303.10742

These resources provide in-depth coverage of various classification algorithms, their theoretical foundations, and practical implementations.