Skip to content

Latest commit

 

History

History
350 lines (258 loc) · 11.9 KB

File metadata and controls

350 lines (258 loc) · 11.9 KB

Essential Data Visualization Plots for Data Scientists Using Python

Slide 1: Introduction to Essential Data Science Plots

11 Essential Plots for Data Scientists

This presentation covers 11 key plots that data scientists frequently use in their work. We'll explore each plot's purpose, interpretation, and provide Python code to create them. These visualizations are crucial for data exploration, model evaluation, and communicating insights effectively.

Slide 2: KS Plot (Kolmogorov-Smirnov Plot)

KS Plot: Comparing Distributions

The Kolmogorov-Smirnov (KS) plot is used to compare two cumulative distribution functions. It's particularly useful in binary classification problems to assess the model's ability to separate classes. The KS statistic represents the maximum distance between the two distributions.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate sample data
class_0 = np.random.normal(0, 1, 1000)
class_1 = np.random.normal(2, 1, 1000)

# Calculate KS statistic and p-value
ks_statistic, p_value = stats.ks_2samp(class_0, class_1)

# Create KS plot
plt.figure(figsize=(10, 6))
plt.hist(class_0, bins=50, density=True, alpha=0.5, label='Class 0')
plt.hist(class_1, bins=50, density=True, alpha=0.5, label='Class 1')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title(f'KS Plot (KS Statistic: {ks_statistic:.3f}, p-value: {p_value:.3e})')
plt.legend()
plt.show()

Slide 3: SHAP Plot

SHAP Plot: Explaining Model Predictions

SHAP (SHapley Additive exPlanations) plots help interpret machine learning models by showing the impact of each feature on the model's output. They provide both global feature importance and local explanations for individual predictions.

import shap
import xgboost as xgb
from sklearn.datasets import load_boston

# Load data and train a model
X, y = load_boston(return_X_y=True)
model = xgb.XGBRegressor().fit(X, y)

# Calculate SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Create SHAP summary plot
shap.summary_plot(shap_values, X, plot_type="bar")

Slide 4: ROC Curve

ROC Curve: Evaluating Binary Classifiers

The Receiver Operating Characteristic (ROC) curve visualizes the performance of a binary classifier across various thresholds. It plots the True Positive Rate against the False Positive Rate. The Area Under the Curve (AUC) summarizes the model's performance in a single metric.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Generate and split data
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model and make predictions
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Slide 5: Precision-Recall Curve

Precision-Recall Curve: Balancing Precision and Recall

The Precision-Recall curve shows the tradeoff between precision and recall for different thresholds. It's particularly useful for imbalanced datasets where ROC curves might be overly optimistic. The Area Under the Precision-Recall Curve (AUC-PR) summarizes the model's performance.

from sklearn.metrics import precision_recall_curve, average_precision_score

# Calculate precision-recall curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)

# Plot precision-recall curve
plt.figure()
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(f'Precision-Recall Curve: AP={avg_precision:.2f}')
plt.show()

Slide 6: QQ Plot

QQ Plot: Assessing Normality

Quantile-Quantile (QQ) plots compare the distribution of a dataset to a theoretical distribution, typically the normal distribution. They help assess whether a dataset follows a particular distribution and identify deviations from normality.

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate sample data
data = np.random.normal(0, 1, 1000)

# Create QQ plot
fig = sm.qqplot(data, line='45')
plt.title('QQ Plot')
plt.show()

Slide 7: Cumulative Explained Variance Plot

Cumulative Explained Variance Plot: Dimensionality Reduction

This plot shows the cumulative proportion of variance explained by principal components in Principal Component Analysis (PCA). It helps determine the number of components to retain while preserving most of the data's variance.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data and perform PCA
digits = load_digits()
pca = PCA().fit(digits.data)

# Calculate cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, 'bo-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Plot')
plt.grid(True)
plt.show()

Slide 8: Elbow Curve

Elbow Curve: Determining Optimal Clusters

The Elbow curve helps determine the optimal number of clusters in clustering algorithms like K-means. It plots the within-cluster sum of squares (WCSS) against the number of clusters. The "elbow" point suggests the optimal number of clusters.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Calculate WCSS for different numbers of clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, 'bo-')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Curve')
plt.show()

Slide 9: Silhouette Curve

Silhouette Curve: Evaluating Cluster Quality

The Silhouette curve visualizes how well each data point fits into its assigned cluster. It helps evaluate the quality of clustering and compare different clustering algorithms or parameters. Higher silhouette scores indicate better-defined clusters.

from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

# Assuming X is your data and you've chosen n_clusters
n_clusters = 3
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)

# Compute the silhouette scores
silhouette_avg = silhouette_score(X, cluster_labels)
sample_silhouette_values = silhouette_samples(X, cluster_labels)

# Plot
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(10, 7)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

y_lower = 10
for i in range(n_clusters):
    ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
    ith_cluster_silhouette_values.sort()
    
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i
    
    color = cm.nipy_spectral(float(i) / n_clusters)
    ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)
    
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
    y_lower = y_upper + 10

ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

plt.show()

Slide 10: Gini Impurity and Entropy Plot

Gini Impurity and Entropy: Decision Tree Splitting Criteria

This plot compares Gini impurity and entropy, two common splitting criteria used in decision trees. Both measure the impurity or disorder in a set of class labels, helping determine the best feature to split on at each node.

import numpy as np
import matplotlib.pyplot as plt

def gini(p):
    return p * (1 - p) + (1 - p) * (1 - (1 - p))

def entropy(p):
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)

p = np.linspace(0, 1, 100)
gini_values = [gini(pi) for pi in p]
entropy_values = [entropy(pi) if pi not in [0, 1] else 0 for pi in p]

plt.figure(figsize=(10, 6))
plt.plot(p, gini_values, label='Gini Impurity')
plt.plot(p, entropy_values, label='Entropy')
plt.xlabel('Probability of Class 1')
plt.ylabel('Impurity Measure')
plt.title('Gini Impurity vs Entropy')
plt.legend()
plt.grid(True)
plt.show()

Slide 11: Bias-Variance Tradeoff Plot

Bias-Variance Tradeoff: Balancing Model Complexity

The bias-variance tradeoff plot illustrates how model complexity affects bias and variance. It helps in understanding overfitting and underfitting, guiding the selection of an optimal model complexity that minimizes both bias and variance.

import numpy as np
import matplotlib.pyplot as plt

# Simulate bias-variance tradeoff
model_complexity = np.linspace(1, 10, 100)
bias = 1 / model_complexity
variance = np.exp(-1 + model_complexity / 10) - 1
total_error = bias + variance

plt.figure(figsize=(10, 6))
plt.plot(model_complexity, bias, label='Bias')
plt.plot(model_complexity, variance, label='Variance')
plt.plot(model_complexity, total_error, label='Total Error')
plt.xlabel('Model Complexity')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True)
plt.show()

Slide 12: PDP (Partial Dependence Plot)

PDP: Visualizing Feature Impact on Predictions

Partial Dependence Plots (PDPs) show the marginal effect of a feature on the predicted outcome of a machine learning model. They help understand how a feature influences predictions while accounting for the average effects of other features.

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import partial_dependence, plot_partial_dependence

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

# Train a model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Create PDP plot
fig, ax = plt.subplots(figsize=(10, 6))
plot_partial_dependence(model, X, [0, 1], ax=ax)
plt.tight_layout()
plt.show()

Slide 13: Additional Resources

Further Reading and Resources

For more in-depth information on these plots and advanced data visualization techniques, consider exploring the following resources:

  1. "Interpretable Machine Learning" by Christoph Molnar ArXiv: https://arxiv.org/abs/1901.04592
  2. "Visualizing Statistical Models: Removing the Blindfold" by Hadley Wickham et al. ArXiv: https://arxiv.org/abs/1409.7533
  3. "A Survey on Visualization for Machine Learning" by Jianping Kelvin Li et al. ArXiv: https://arxiv.org/abs/2007.01067

These papers provide comprehensive overviews and advanced techniques in data visualization for machine learning and statistics.