Beyond Non-Linear Least Squares: Enhancing Epidemiological Inference with Chi-Square Fitting and Bayesian Informed Generalized Profiling
Robust Inference in Epidemiological Dynamics: A Computational Benchmarking Framework
This repository contains a simulation study benchmarking the structural identifiability and forecasting accuracy of parameter estimation frameworks in SIR compartmental epidemiological model. The study utilizes synthetic data with Poisson observation noise to evaluate performance under two regimes: correct specification (constant transmission parameters) and model misspecification (time-varying transmission modeling non-pharmaceutical interventions).
While the Non-Linear Least Squares (NLS) estimator coincides with the Maximum Likelihood Estimator (MLE) under Gaussian homoscedastic assumptions, providing asymptotic efficiency for correctly specified models, epidemiological incidence data are inherently discrete with variance scaling with the mean (heteroscedasticity). Consequently, this framework evaluates Chi-Square (ChiSq) minimization to address the distributional violation, and semi-parametric Generalized Profiling (GP) to address structural misspecification. Our results indicate that while standard parametric models excel in ideal scenarios, model misspecification demands a paradigm shift toward the spline-based GP framework, which provides significantly more accurate forecasting and robust parameter recovery under realistic, time-varying conditions.
The repository implements two fundamentally different computational approaches to parameter estimation:
- Methodology: This framework assumes the data generation process aligns exactly with the deterministic SIR equations. Parameter estimation is formulated as a constrained non-linear optimization problem solved via the Trust-Region-Reflective (TRR) algorithm.
- Mechanism: In each iteration, the solver performs numerical forward integration of the differential equations via Runge-Kutta 4(5) method to map the current candidate parameters to a theoretical trajectory. Based on the residual gradients, the TRR algorithm proposes parameter updates by minimizing a local quadratic approximation of the objective function, restricting the step size to a "trust region" where this approximation remains valid. Constrained optimization is employed to strictly enforce biological plausibility, ensuring that all estimated parameters, such as transmission rates and population counts, remain non-negative and physically meaningful throughout the convergence process.
-
Limitations: This method enforces a rigid structural constraint. If the true transmission rate
$\beta$ varies over time (e.g., due to intervention measures), the estimator converges to a "biased average" that fits neither the past nor the future accurately.
- Methodology: Unlike standard methods that solve the Initial Value Problem (IVP) sequentially using iterative time-stepping algorithms (e.g., Runge-Kutta), this approach approximates state trajectories globally using a linear combination of B-spline basis functions.
-
Mechanism: Explicit forward integration is replaced by a bilevel optimization framework that treats the differential equations as "soft" algebraic constraints.
- Inner Loop: Fits the B-splines to the observed data subject to an ODE penalty, effectively smoothing the trajectory while allowing for structural deviations.
-
Outer Loop: Optimizes the structural parameters (
$\beta, \gamma$ ) by minimizing the error of this "profiled" trajectory returned by the inner loop.
-
Advantage: This formulation allows for admissible deviations from the deterministic trajectory. By algebraically solving the inverse ODE problem using the spline's derivative at the forecasting boundary
$t_{end}$ , the method recovers the effective instantaneous transmission rate$\beta(t_{end})$ , ignoring historical bias. Critically, if the fixed-parameter ODEs diverge from reality (structural misspecification), the optimization minimizes the objective function by favoring empirical data adherence over model compliance. This flexibility allows the framework to capture the true underlying dynamics and generate accurate forecasts, even when the theoretical model structure is violated.
The study evaluates four primary estimation frameworks. Because Generalized Profiling allows for structural decoupling, the GP methods generate two distinct forecasting types, resulting in six total strategies evaluated.
The distinction between "Standard" and "Hybrid" projections is central to the analysis of model misspecification:
-
Standard Projection (Global History):
-
Workflow: Estimates global parameter set
$\hat{\theta} = (\hat{\beta}, \hat{\gamma})$ that minimizes error over the full historical window$[0, t_{end}]$ . The forecast is generated by integrating the SIR ODEs from the original initial conditions$S(t_0), I(t_0), R(t_0)$ forward to the forecast horizon. -
Implication: The trajectory is rigidly constrained by the historical average. If transmission changed during the window (e.g., intervention),
$\hat{\beta}$ becomes a "biased average," causing the forecast to miss the recent trend.
-
Workflow: Estimates global parameter set
-
Hybrid Projection (Instantaneous / Adaptive):
-
Workflow: This approach leverages the flexibility of the GP splines. Instead of using the global average, it calculates the instantaneous effective transmission rate
$\beta(t_{end})$ algebraically from the spline's derivative and state estimates at the final data point. The forecast is generated by integrating the ODEs forward starting from$t_{end}$ (not$t_0$ ), using the spline-smoothed states$S(t_{end}), I(t_{end}), R(t_{end})$ as the new initial conditions. -
Implication: This ensures
$C^1$ continuity at the forecasting boundary. By projecting the current dynamics rather than the historical average, this method is robust to structural breaks in the transmission rate.|
-
Workflow: This approach leverages the flexibility of the GP splines. Instead of using the global average, it calculates the instantaneous effective transmission rate
| Strategy ID | Estimation Framework | Objective Function | Forecasting Logic | Theoretical Rationale |
|---|---|---|---|---|
| 1. NLS | Standard ODE Fit | Sum of Squared Errors (SSE) |
Standard Projection: Forward integration from |
Assumes homoscedastic Gaussian errors. Theoretically optimal only for constant variance. |
| 2. ChiSq | Standard ODE Fit | Weighted SSE (Chi-Square) |
Standard Projection: Forward integration from |
Approximates Maximum Likelihood for Poisson data (Variance |
| 3. GP+NLS | Generalized Profiling |
Outer: Profiled SSE Inner: Penalized SSE |
Standard Projection: Forward integration from |
Utilizes spline smoothing for state estimation but relies on global parameters for projection. |
| 4. GP+ChiSq | Generalized Profiling |
Outer: Profiled Weighted SSE Inner: Penalized Weighted SSE |
Standard Projection: Forward integration from |
Utilizes spline smoothing for state estimation but relies on global parameters for projection. |
| 5. GP+NLS (Proj.) | Generalized Profiling |
Outer: Profiled SSE Inner: Penalized SSE |
Hybrid Projection: Forward integration from |
Adaptive. Decouples the forecast from the historical average of the parameter. |
| 6. GP+ChiSq (Proj.) | Generalized Profiling |
Outer: Profiled Weighted SSE Inner: Penalized Weighted SSE |
Hybrid Projection: Forward integration from |
Adaptive. Decouples the forecast from the historical average of the parameter. |
This codebase is architected for High-Performance Computing (HPC) environments. While fully compatible with Windows, it is "Linux-Native" by design to facilitate headless execution on cluster nodes.
- MATLAB: Version R2021a or later (Recommended: R2024a).
- Note: Codebase utilizes
argumentsvalidation blocks andtiledlayoutintroduced in R2019b.
- Note: Codebase utilizes
- Required Toolboxes:
- Optimization Toolbox: Required for core solvers and Bayesian optimization (
lsqnonlin,fmincon,bayesopt). - Global Optimization Toolbox: Required for the
MultiStartsolver class and stochastic search drivers. - Statistics and Machine Learning Toolbox: Required for statistical metrics and hypothesis testing (
normpdf,chi2pdf,anova1,multcompare,boxplot). - Parallel Computing Toolbox: Required for
parforloops andparpoolmanagement during simulations.
- Optimization Toolbox: Required for core solvers and Bayesian optimization (
- Linux (Recommended): The workflow is optimized for HPC environments. Scripts utilize
filepartsandfullfilefor agnostic path handling, but the logging examples assume a Bash shell (nohup). - Windows: Fully compatible.
- Execution: Run scripts interactively via the MATLAB GUI or use PowerShell for background execution (the
nohupcommands in the examples are Linux-specific).
- Execution: Run scripts interactively via the MATLAB GUI or use PowerShell for background execution (the
Please execute commands from the base directory of the repository.
The primary driver run_sir_simulation_study.m executes the Monte Carlo simulations. The following command runs the full study (250 simulations per scenario) in the background using 32 parallel workers.
# 1. Create directory for console logs
mkdir -p logs
# 2. Execute Simulation Driver
# Note: 'addpath' ensures dependencies are loaded before execution.
nohup matlab -batch "addpath('main_script'); run_sir_simulation_study('NumSims', 250, 'UseParallel', true, 'MaxWorkers', 32)" > logs/sim_run_$(date +%F).log 2>&1 &
# 3. Monitor Progress
tail -f logs/sim_run_*.logSimulation artifacts are saved to data/run_YYYYMMDD_HHMMSS. The analysis scripts automatically locate the most recent dataset to generate figures.
A. Performance Metrics & Visualization Generates boxplots of RMSE/sMAPE and parameter bias tables.
nohup matlab -batch "addpath(genpath('src')); run('src/analysis/analyze_sir_simulation_study_results.m')" > logs/analysis_$(date +%F).log 2>&1 &B. Statistical Hypothesis Testing Performs One-Way ANOVA and Tukey-Kramer post-hoc analysis.
nohup matlab -batch "addpath(genpath('src')); run('src/analysis/perform_statistical_analysis.m')" > logs/stats_$(date +%F).log 2>&1 &To exactly reproduce the results presented in the study, check the logs folder for the specific seed used.
Repo_Root/
βββ data/ # [Artifacts] Stores .mat simulation files
βββ figures/ # [Artifacts] Stores generated compartmental plots (Hosted on Google Drive)
β βββ compartmental_plots/ # - Individual fit visualizations (per simulation)
β βββ post_simulation/ # - Aggregated analysis (Boxplots, ANOVA results)
βββ logs/ # [Logs] Console outputs from batch jobs
βββ main_script/
β βββ run_sir_simulation_study.m # <--- ENTRY POINT: Main Simulation Driver
βββ src/
βββ analysis/ # Scripts for metric aggregation and hypothesis testing
βββ models/ # ODE equations and forward integration definitions
βββ optimization/ # Optimization wrappers (Inner/Outer residuals, Bayesian optimization, splines)
βββ plotting/ # Compartmental plots generator (Saves plots after each simulations)
Due to GitHub storage constraints, high-resolution figure sets are hosted on Google Drive.
If you utilize this codebase for your research, please cite this repository:
Tyuryaev, V., Jankowski, H., & Heffernan, J.M. Beyond Non-Linear Least Squares: Enhancing Epidemiological Inference with Chi-Square Fitting and Bayesian Informed Generalized Profiling. GitHub repository, 2026. Available at: https://github.com/vadimtyuryaev/Beyond-NLS-Inference
This project is licensed under the MIT License. See the LICENSE file for details.
