Skip to content

vadimtyuryaev/Beyond-NLS-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Beyond Non-Linear Least Squares: Enhancing Epidemiological Inference with Chi-Square Fitting and Bayesian Informed Generalized Profiling

Robust Inference in Epidemiological Dynamics: A Computational Benchmarking Framework

MATLAB Linux Static Badge Static Badge


Intervention Measures Simulation


πŸ“Œ Abstract & Overview

This repository contains a simulation study benchmarking the structural identifiability and forecasting accuracy of parameter estimation frameworks in SIR compartmental epidemiological model. The study utilizes synthetic data with Poisson observation noise to evaluate performance under two regimes: correct specification (constant transmission parameters) and model misspecification (time-varying transmission modeling non-pharmaceutical interventions).

While the Non-Linear Least Squares (NLS) estimator coincides with the Maximum Likelihood Estimator (MLE) under Gaussian homoscedastic assumptions, providing asymptotic efficiency for correctly specified models, epidemiological incidence data are inherently discrete with variance scaling with the mean (heteroscedasticity). Consequently, this framework evaluates Chi-Square (ChiSq) minimization to address the distributional violation, and semi-parametric Generalized Profiling (GP) to address structural misspecification. Our results indicate that while standard parametric models excel in ideal scenarios, model misspecification demands a paradigm shift toward the spline-based GP framework, which provides significantly more accurate forecasting and robust parameter recovery under realistic, time-varying conditions.


βš™οΈ Methodological Framework

The repository implements two fundamentally different computational approaches to parameter estimation:

1. The Standard ODE Approach (NLS and Chi-Square fitting)

  • Methodology: This framework assumes the data generation process aligns exactly with the deterministic SIR equations. Parameter estimation is formulated as a constrained non-linear optimization problem solved via the Trust-Region-Reflective (TRR) algorithm.
  • Mechanism: In each iteration, the solver performs numerical forward integration of the differential equations via Runge-Kutta 4(5) method to map the current candidate parameters to a theoretical trajectory. Based on the residual gradients, the TRR algorithm proposes parameter updates by minimizing a local quadratic approximation of the objective function, restricting the step size to a "trust region" where this approximation remains valid. Constrained optimization is employed to strictly enforce biological plausibility, ensuring that all estimated parameters, such as transmission rates and population counts, remain non-negative and physically meaningful throughout the convergence process.
  • Limitations: This method enforces a rigid structural constraint. If the true transmission rate $\beta$ varies over time (e.g., due to intervention measures), the estimator converges to a "biased average" that fits neither the past nor the future accurately.

2. The Spline-Based Approach (Generalized Profiling)

  • Methodology: Unlike standard methods that solve the Initial Value Problem (IVP) sequentially using iterative time-stepping algorithms (e.g., Runge-Kutta), this approach approximates state trajectories globally using a linear combination of B-spline basis functions.
  • Mechanism: Explicit forward integration is replaced by a bilevel optimization framework that treats the differential equations as "soft" algebraic constraints.
    • Inner Loop: Fits the B-splines to the observed data subject to an ODE penalty, effectively smoothing the trajectory while allowing for structural deviations.
    • Outer Loop: Optimizes the structural parameters ($\beta, \gamma$) by minimizing the error of this "profiled" trajectory returned by the inner loop.
  • Advantage: This formulation allows for admissible deviations from the deterministic trajectory. By algebraically solving the inverse ODE problem using the spline's derivative at the forecasting boundary $t_{end}$, the method recovers the effective instantaneous transmission rate $\beta(t_{end})$, ignoring historical bias. Critically, if the fixed-parameter ODEs diverge from reality (structural misspecification), the optimization minimizes the objective function by favoring empirical data adherence over model compliance. This flexibility allows the framework to capture the true underlying dynamics and generate accurate forecasts, even when the theoretical model structure is violated.

πŸ”¬ Experimental Design: Estimation & Forecasting Strategies

The study evaluates four primary estimation frameworks. Because Generalized Profiling allows for structural decoupling, the GP methods generate two distinct forecasting types, resulting in six total strategies evaluated.

Forecasting Methodologies

The distinction between "Standard" and "Hybrid" projections is central to the analysis of model misspecification:

  • Standard Projection (Global History):

    • Workflow: Estimates global parameter set $\hat{\theta} = (\hat{\beta}, \hat{\gamma})$ that minimizes error over the full historical window $[0, t_{end}]$. The forecast is generated by integrating the SIR ODEs from the original initial conditions $S(t_0), I(t_0), R(t_0)$ forward to the forecast horizon.
    • Implication: The trajectory is rigidly constrained by the historical average. If transmission changed during the window (e.g., intervention), $\hat{\beta}$ becomes a "biased average," causing the forecast to miss the recent trend.
  • Hybrid Projection (Instantaneous / Adaptive):

    • Workflow: This approach leverages the flexibility of the GP splines. Instead of using the global average, it calculates the instantaneous effective transmission rate $\beta(t_{end})$ algebraically from the spline's derivative and state estimates at the final data point. The forecast is generated by integrating the ODEs forward starting from $t_{end}$ (not $t_0$), using the spline-smoothed states $S(t_{end}), I(t_{end}), R(t_{end})$ as the new initial conditions.
    • Implication: This ensures $C^1$ continuity at the forecasting boundary. By projecting the current dynamics rather than the historical average, this method is robust to structural breaks in the transmission rate.|

Strategy Summary

Strategy ID Estimation Framework Objective Function Forecasting Logic Theoretical Rationale
1. NLS Standard ODE Fit Sum of Squared Errors (SSE) Standard Projection: Forward integration from $t(0)$ using global $\hat{\beta}, \hat{\gamma}$. Assumes homoscedastic Gaussian errors. Theoretically optimal only for constant variance.
2. ChiSq Standard ODE Fit Weighted SSE (Chi-Square) Standard Projection: Forward integration from $t(0)$ using global $\hat{\beta}, \hat{\gamma}$. Approximates Maximum Likelihood for Poisson data (Variance $\approx$ Mean).
3. GP+NLS Generalized Profiling Outer: Profiled SSE
Inner: Penalized SSE
Standard Projection: Forward integration from $t(0)$ using global $\hat{\beta}, \hat{\gamma}$. Utilizes spline smoothing for state estimation but relies on global parameters for projection.
4. GP+ChiSq Generalized Profiling Outer: Profiled Weighted SSE
Inner: Penalized Weighted SSE
Standard Projection: Forward integration from $t(0)$ using global $\hat{\beta}, \hat{\gamma}$. Utilizes spline smoothing for state estimation but relies on global parameters for projection.
5. GP+NLS (Proj.) Generalized Profiling Outer: Profiled SSE
Inner: Penalized SSE
Hybrid Projection: Forward integration from $t_{end}$ using instantaneous $\beta(t_{end})$ and spline states at $t_{end}$. Adaptive. Decouples the forecast from the historical average of the parameter.
6. GP+ChiSq (Proj.) Generalized Profiling Outer: Profiled Weighted SSE
Inner: Penalized Weighted SSE
Hybrid Projection: Forward integration from $t_{end}$ using instantaneous $\beta(t_{end})$ and spline states at $t_{end}$. Adaptive. Decouples the forecast from the historical average of the parameter.

πŸ’» System Architecture & Requirements

This codebase is architected for High-Performance Computing (HPC) environments. While fully compatible with Windows, it is "Linux-Native" by design to facilitate headless execution on cluster nodes.

Prerequisites

  • MATLAB: Version R2021a or later (Recommended: R2024a).
    • Note: Codebase utilizes arguments validation blocks and tiledlayout introduced in R2019b.
  • Required Toolboxes:
    • Optimization Toolbox: Required for core solvers and Bayesian optimization (lsqnonlin, fmincon, bayesopt).
    • Global Optimization Toolbox: Required for the MultiStart solver class and stochastic search drivers.
    • Statistics and Machine Learning Toolbox: Required for statistical metrics and hypothesis testing (normpdf, chi2pdf, anova1, multcompare, boxplot).
    • Parallel Computing Toolbox: Required for parfor loops and parpool management during simulations.

Operating System Compatibility

  • Linux (Recommended): The workflow is optimized for HPC environments. Scripts utilize fileparts and fullfile for agnostic path handling, but the logging examples assume a Bash shell (nohup).
  • Windows: Fully compatible.
    • Execution: Run scripts interactively via the MATLAB GUI or use PowerShell for background execution (the nohup commands in the examples are Linux-specific).

▢️ Execution & Reproducibility

Please execute commands from the base directory of the repository.

1. Running the Simulation Study (Headless/HPC)

The primary driver run_sir_simulation_study.m executes the Monte Carlo simulations. The following command runs the full study (250 simulations per scenario) in the background using 32 parallel workers.

# 1. Create directory for console logs
mkdir -p logs

# 2. Execute Simulation Driver
# Note: 'addpath' ensures dependencies are loaded before execution.
nohup matlab -batch "addpath('main_script'); run_sir_simulation_study('NumSims', 250, 'UseParallel', true, 'MaxWorkers', 32)" > logs/sim_run_$(date +%F).log 2>&1 &

# 3. Monitor Progress
tail -f logs/sim_run_*.log

2. Post-Processing & Analysis

Simulation artifacts are saved to data/run_YYYYMMDD_HHMMSS. The analysis scripts automatically locate the most recent dataset to generate figures.

A. Performance Metrics & Visualization Generates boxplots of RMSE/sMAPE and parameter bias tables.

nohup matlab -batch "addpath(genpath('src')); run('src/analysis/analyze_sir_simulation_study_results.m')" > logs/analysis_$(date +%F).log 2>&1 &

B. Statistical Hypothesis Testing Performs One-Way ANOVA and Tukey-Kramer post-hoc analysis.

nohup matlab -batch "addpath(genpath('src')); run('src/analysis/perform_statistical_analysis.m')" > logs/stats_$(date +%F).log 2>&1 &

To exactly reproduce the results presented in the study, check the logs folder for the specific seed used.


πŸ“ Repository Structure

Repo_Root/
β”œβ”€β”€ data/                    # [Artifacts] Stores .mat simulation files
β”œβ”€β”€ figures/                 # [Artifacts] Stores generated compartmental plots (Hosted on Google Drive)
β”‚   β”œβ”€β”€ compartmental_plots/ #   - Individual fit visualizations (per simulation)
β”‚   └── post_simulation/     #   - Aggregated analysis (Boxplots, ANOVA results)
β”œβ”€β”€ logs/                    # [Logs] Console outputs from batch jobs
β”œβ”€β”€ main_script/
β”‚   └── run_sir_simulation_study.m  # <--- ENTRY POINT: Main Simulation Driver
└── src/
    β”œβ”€β”€ analysis/            # Scripts for metric aggregation and hypothesis testing
    β”œβ”€β”€ models/              # ODE equations and forward integration definitions
    β”œβ”€β”€ optimization/        # Optimization wrappers (Inner/Outer residuals, Bayesian optimization, splines)
    └── plotting/            # Compartmental plots generator (Saves plots after each simulations)

Due to GitHub storage constraints, high-resolution figure sets are hosted on Google Drive.

πŸ“œ Citation & License

If you utilize this codebase for your research, please cite this repository:

Tyuryaev, V., Jankowski, H., & Heffernan, J.M. Beyond Non-Linear Least Squares: Enhancing Epidemiological Inference with Chi-Square Fitting and Bayesian Informed Generalized Profiling. GitHub repository, 2026. Available at: https://github.com/vadimtyuryaev/Beyond-NLS-Inference

This project is licensed under the MIT License. See the LICENSE file for details.

About

A computational benchmarking framework for epidemiological inference. Evaluates spline-based Generalized Profiling and Chi-Square fitting against standard NLS to overcome structural misspecification and improve forecasting in time-varying SIR models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages