Slide 1: Docstring Structure and Basic Usage
The foundational element of Python documentation is the docstring, which must be the first statement after a definition. It uses triple quotes for multi-line strings and provides essential information about the purpose, parameters, and return values of functions, classes, or modules.
def calculate_fibonacci(n: int) -> int:
"""
Calculate the nth number in the Fibonacci sequence.
Args:
n (int): Position in Fibonacci sequence (must be >= 0)
Returns:
int: The nth Fibonacci number
Raises:
ValueError: If n is negative
"""
if n < 0:
raise ValueError("Position must be non-negative")
if n <= 1:
return n
return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
# Example usage
result = calculate_fibonacci(10)
print(f"10th Fibonacci number: {result}") # Output: 55Slide 2: Google Style Docstrings
Google style docstrings provide a clean, readable format that's become increasingly popular in the Python community. This format uses indentation and section headers to organize information, making it particularly suitable for complex functions.
def process_data(data_frame, columns=None, aggregation='mean'):
"""
Process a pandas DataFrame using specified columns and aggregation method.
Args:
data_frame (pd.DataFrame): Input DataFrame to process
columns (list, optional): List of column names. Defaults to None
aggregation (str, optional): Aggregation method. Defaults to 'mean'
Returns:
pd.DataFrame: Processed DataFrame with aggregated results
Example:
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> process_data(df, columns=['A'], aggregation='sum')
"""
if columns is None:
columns = data_frame.columns
return data_frame[columns].agg(aggregation)Slide 3: NumPy Style Docstrings
NumPy style documentation is particularly well-suited for scientific computing and data analysis functions. It uses a structured format with sections separated by dashes and provides detailed mathematical descriptions when needed.
def compute_correlation(x: np.ndarray, y: np.ndarray) -> float:
"""
Compute Pearson correlation coefficient between two arrays.
Parameters
----------
x : numpy.ndarray
First input array
y : numpy.ndarray
Second input array
Returns
-------
float
Correlation coefficient between x and y
Notes
-----
The correlation coefficient is calculated as:
$$r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2\sum(y - \bar{y})^2}}$$
"""
return np.corrcoef(x, y)[0, 1]Slide 4: Class Docstrings
A comprehensive class docstring should describe the class purpose, attributes, and behavior. It should also include examples demonstrating typical usage patterns and any important implementation details.
class DataProcessor:
"""
A class for processing and transforming data sets.
This class provides methods for common data preprocessing tasks
including normalization, encoding, and handling missing values.
Attributes:
data (pd.DataFrame): The input dataset
features (list): List of feature columns
target (str): Name of target variable
Methods:
normalize(): Normalize numerical features
encode_categorical(): Encode categorical variables
handle_missing(): Handle missing values
"""
def __init__(self, data, features, target):
self.data = data
self.features = features
self.target = targetSlide 5: Module Level Docstrings
Module level docstrings provide high-level documentation about the purpose, dependencies, and usage of a Python module. They should be placed at the beginning of the file and include comprehensive information about the module's functionality.
"""
Data Processing Utilities
========================
This module provides utilities for processing large datasets efficiently.
Key Features:
- Parallel data processing
- Memory-efficient operations
- Progress tracking
- Error handling and logging
Dependencies:
- numpy>=1.20.0
- pandas>=1.3.0
- dask>=2022.1.0
Example:
>>> from data_utils import DataProcessor
>>> processor = DataProcessor(data_path='data.csv')
>>> result = processor.process()
"""
import numpy as np
import pandas as pd
import dask.dataframe as ddSlide 6: Property Docstrings
Property docstrings require special attention as they document both attribute-like access and potential computations. They should clearly indicate the property's purpose, computation method, and any caching behavior.
class DataSet:
"""Main dataset container with property documentation examples."""
def __init__(self, data):
self._data = data
self._cached_stats = None
@property
def statistics(self):
"""
Calculate and cache descriptive statistics of the dataset.
The property computes mean, median, and standard deviation.
Results are cached after first access for performance.
Returns:
dict: Statistical measures of the dataset
Keys: 'mean', 'median', 'std'
"""
if self._cached_stats is None:
self._cached_stats = {
'mean': np.mean(self._data),
'median': np.median(self._data),
'std': np.std(self._data)
}
return self._cached_statsSlide 7: Real-World Example - Machine Learning Pipeline
A practical example demonstrating comprehensive docstring usage in a machine learning pipeline, including data preprocessing, model training, and evaluation components.
class MLPipeline:
"""
End-to-end machine learning pipeline with comprehensive documentation.
This pipeline handles:
1. Data preprocessing
2. Feature engineering
3. Model training
4. Evaluation
"""
def preprocess_data(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess raw data for model training.
Args:
df (pd.DataFrame): Raw input data
Returns:
pd.DataFrame: Cleaned and preprocessed data
Example:
>>> pipeline = MLPipeline()
>>> processed_df = pipeline.preprocess_data(raw_df)
"""
# Implementation of preprocessing steps
return cleaned_df
def train_model(self, X: np.ndarray, y: np.ndarray) -> Any:
"""
Train machine learning model using preprocessed data.
Args:
X (np.ndarray): Feature matrix
y (np.ndarray): Target values
Returns:
model: Trained model instance
"""
# Model training implementation
return trained_modelSlide 8: Advanced Exception Documentation
Proper documentation of exceptions is crucial for API design. This example shows how to document complex exception handling scenarios with clear guidance for users.
def validate_input_data(data: dict) -> bool:
"""
Validate input data against schema requirements.
Args:
data (dict): Input data dictionary
Returns:
bool: True if validation passes
Raises:
ValueError: If required fields are missing
Error message includes the specific missing fields
TypeError: If field types don't match schema
Error message includes field name and expected type
ValidationError: If business logic validation fails
Includes detailed validation failure description
"""
if not isinstance(data, dict):
raise TypeError(f"Expected dict, got {type(data)}")
required_fields = ['id', 'name', 'value']
missing = [f for f in required_fields if f not in data]
if missing:
raise ValueError(f"Missing required fields: {missing}")
# Additional validation logic
return TrueSlide 9: Documentation Generator Integration
Docstrings should be written to work seamlessly with documentation generators. This example shows how to structure docstrings for optimal Sphinx integration.
class DataAnalyzer:
"""
Analyze datasets using statistical methods.
.. note::
This class requires NumPy >= 1.20.0
.. warning::
Not thread-safe due to caching behavior
Examples:
>>> analyzer = DataAnalyzer()
>>> result = analyzer.analyze([1, 2, 3])
"""
def analyze(self, data: list) -> dict:
"""
Perform statistical analysis on input data.
:param data: Input data for analysis
:type data: list
:return: Analysis results
:rtype: dict
.. seealso:: :func:`DataAnalyzer.summarize`
"""
return {'mean': sum(data) / len(data)}Slide 10: Type Hints in Docstrings
Modern Python documentation combines type hints with docstrings to provide comprehensive type information while maintaining backward compatibility and detailed descriptions of complex types.
from typing import List, Dict, Optional, Union
def process_time_series(
data: List[float],
window_size: int,
aggregation_func: Optional[callable] = None
) -> Dict[str, Union[float, List[float]]]:
"""
Process time series data with sliding window aggregation.
Args:
data: List of numerical time series values
window_size: Size of the sliding window for aggregation
aggregation_func: Custom aggregation function, defaults to mean
Must accept List[float] and return float
Returns:
Dictionary containing:
- 'processed': List[float] - Processed series
- 'stats': float - Aggregate statistic
"""
if aggregation_func is None:
aggregation_func = lambda x: sum(x) / len(x)
results = []
for i in range(len(data) - window_size + 1):
window = data[i:i + window_size]
results.append(aggregation_func(window))
return {
'processed': results,
'stats': aggregation_func(results)
}Slide 11: Docstrings for Async Functions
Special considerations are needed when documenting asynchronous functions, including concurrent execution behavior and potential timing issues.
async def fetch_data_batch(
urls: List[str],
timeout: float = 30.0
) -> List[Dict[str, any]]:
"""
Asynchronously fetch data from multiple URLs.
This coroutine manages concurrent HTTP requests with timeout
and error handling for each URL in the batch.
Args:
urls: List of URLs to fetch
timeout: Request timeout in seconds
Returns:
List of response dictionaries containing:
- 'url': Original URL
- 'data': Response data or None if failed
- 'error': Error message if failed
Note:
- Uses aiohttp for concurrent requests
- Maintains connection pool
- Implements exponential backoff
"""
async with aiohttp.ClientSession() as session:
tasks = [
asyncio.create_task(fetch_url(session, url, timeout))
for url in urls
]
return await asyncio.gather(*tasks, return_exceptions=True)Slide 12: Internal Implementation Documentation
Documentation for internal implementations requires special attention to implementation details while maintaining clarity about private nature and usage restrictions.
class _DataValidator:
"""
Internal data validation implementation.
This class is not part of the public API and should not be
used directly. It implements the core validation logic used
by public-facing validation methods.
Warning:
This is an internal class that may change without notice.
Do not use directly.
Implementation Notes:
- Uses caching to optimize repeated validations
- Thread-safe through lock mechanisms
- Implements validation chains pattern
"""
def __init__(self):
self._cache = {}
self._lock = threading.Lock()
def _validate_internal(self, data: Any) -> bool:
"""
Internal validation method with caching.
Args:
data: Data to validate
Returns:
bool: Validation result
Note:
This method is not protected against recursion.
Maximum validation depth is controlled by caller.
"""
cache_key = hash(str(data))
with self._lock:
if cache_key in self._cache:
return self._cache[cache_key]
result = self._perform_validation(data)
self._cache[cache_key] = result
return resultSlide 13: Mathematical Documentation
Complex mathematical operations require detailed documentation including formulas, variable definitions, and implementation considerations.
def kalman_filter(
measurements: np.ndarray,
initial_state: float,
measurement_variance: float,
process_variance: float
) -> np.ndarray:
"""
Implement a 1D Kalman filter for time series smoothing.
The implementation follows these equations:
Prediction step:
$$x_{t|t-1} = x_{t-1|t-1}$$
$$P_{t|t-1} = P_{t-1|t-1} + Q$$
Update step:
$$K_t = P_{t|t-1}/(P_{t|t-1} + R)$$
$$x_{t|t} = x_{t|t-1} + K_t(z_t - x_{t|t-1})$$
$$P_{t|t} = (1 - K_t)P_{t|t-1}$$
Args:
measurements: Array of noisy measurements
initial_state: Initial state estimate
measurement_variance: Measurement noise (R)
process_variance: Process noise (Q)
Returns:
Array of filtered state estimates
"""
n = len(measurements)
filtered_states = np.zeros(n)
prediction = initial_state
prediction_variance = 1.0
for t in range(n):
# Prediction step
prediction_variance += process_variance
# Update step
kalman_gain = prediction_variance / (prediction_variance + measurement_variance)
prediction = prediction + kalman_gain * (measurements[t] - prediction)
prediction_variance = (1 - kalman_gain) * prediction_variance
filtered_states[t] = prediction
return filtered_statesSlide 14: Additional Resources
- arXiv:1904.02610 - "Automated Python Documentation Generation: A Survey of Tools and Techniques"
- arXiv:2007.15287 - "Best Practices for Scientific Software Documentation"
- https://peps.python.org/pep-0257/ - Python PEP 257 Docstring Conventions
- https://google.github.io/styleguide/pyguide.html - Google Python Style Guide
- https://numpydoc.readthedocs.io/en/latest/format.html - NumPy Documentation Guide