vp314 · etontackett · Mar 2, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 17, 2026
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,15 @@
 name = "RidgeRegression"
 uuid = "739161c8-60e1-4c49-8f89-ff30998444b1"
-authors = ["Vivak Patel <vp314@users.noreply.github.com>"]
 version = "0.1.0"
+authors = ["Eton Tackett <etont@icloud.com>", "Vivak Patel <vp314@users.noreply.github.com>"]
+
+[deps]
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
 
 [compat]
+CSV = "0.10.15"
+DataFrames = "1.8.1"
+Downloads = "1.7.0"
 julia = "1.12.4"
diff --git a/docs/make.jl b/docs/make.jl
@@ -14,6 +14,7 @@ makedocs(;
     ),
     pages=[
         "Home" => "index.md",
+        "Design" => "design.md",
     ],
 )
 

diff --git a/docs/src/design.md b/docs/src/design.md
@@ -0,0 +1,198 @@
+# Motivation and Background
+Many modern science problems involve regression problems with extremely large numbers of
+predictors. Genome-wide association studies (GWAS), for example, try to identify genetic
+variants associated with a disease phenotype using hundreds of thousands or millions of
+genomic features. In such settings, traditional least squares methods fail. Penalized Least Squares (PLS) extends ordinary least squares (OLS)
+regression by adding a penalty term to shrink parameter estimates. Ridge regression, an
+approach within PLS, adds a regularization term, producing a regularized estimator. 
+
+Mathematically, ridge regression estimates the regression coefficients by solving the penalized least squares problem
+```math
+\hat{\boldsymbol{\beta}} =
+\arg\min_{\boldsymbol{\beta}}
+\left(
+\| \mathbf{y} - X\boldsymbol{\beta} \|_2^2
++
+\lambda \| \boldsymbol{\beta} \|_2^2
+\right)
+```
+where $\lambda > 0$ is a regularization parameter that controls the strength of the penalty.
+
+The purpose of ridge regression is to stabilize regression estimates when the predictors are
+highly correlated or the design matrix $X$ is nearly singular. Ridge regression modifies the
+least squares objective by adding a penalty on the squared $\ell_2$-norm of the coefficient
+vector. The estimator is obtained by minimizing a penalized least squares objective in which
+large coefficient values are discouraged through the penalty term $\lambda
+\|\boldsymbol{\beta}\|_2^2$. This penalty shrinks the estimated coefficients toward the
+origin, which reduces the variance of the estimator and mitigates the effects of
+multicollinearity.
+
+There are many numerical algorithms available to compute ridge regression estimates
+including direct methods, Krylov subspace methods, gradient-based optimization, coordinate
+descent, and stochastic gradient descent. These algorithms differ in their computational
+costs and numerical stability. 
+
+The goal of this experiment is to investigate the performance of these algorithms when we
+vary the structure and scale of the regression problem. To do this, we consider the linear
+model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where the matrix ${X}$
+may be constructed with varying dimensions, sparsity patterns, and conditioning properties.
+# Questions
+The primary goal of this experiment is to compare numerical algorithms for computing ridge
+regression estimates under various conditions. In particular, we aim to address the
+following questions:
+
+1. How does the performance of ridge regression algorithms change as the structural and numerical properties of the regression problem vary?
+
+2. Which ridge regression algorithm provides the best balance between numerical stability and computational cost across these problem regimes?
+
+# Experimental Units
+A block is defined as a collection of experimental units that share common characteristics
+which may influence the response. Blocking is introduced to control for
+variation and to enable meaningful comparisons of algorithm performance across
+different problem regimes.
+
+In this experiment, blocks are defined by combinations of three factors: dimensional
+regime, matrix sparsity, and ridge penalty magnitude. 
+
+The experimental units are the datasets. Each dataset consists of a design matrix $X$ and
+response vector $y$, with the regularization parameter $\lambda$ determined by the
+blocking structure through the choice of $\epsilon$.
+
+Datasets will be grouped according to their dimensional regime, characterized as $p \ll n$,
+$p ≈ n$, and $p \gg n$. These regimes correspond to fundamentally different geometric
+properties of the design matrix, including rank behavior, conditioning, and the stability of
+the normal equations. All of which impact the performance of numerical algorithms.
+
+In addition to dimensional block, the strength of the ridge penalty will be incorporated as
+a secondary blocking factor. The ridge estimator is $\hat{\beta_R} = (X^\top X + \lambda
+I)^{-1}X^\top y$. The matrix conditioning number is defined as $\kappa(A) =
+\frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$. In the context of ridge regression, the
+regularization parameter ${\lambda}$, can impact the conditioning number. Let $X = U\Sigma
+V^\top$ be the SVD of $X$, with singular values $\sigma_1,\dots,\sigma_p$.
+
+Then
+```math
+X^\top X = V \Sigma^\top \Sigma V^\top
+= V \,\mathrm{diag}(\sigma_1^2,\dots,\sigma_p^2)\, V^\top .
+```
+
+Adding the ridge term gives
+
+```math
+X^\top X + \lambda I
+=
+V \,\mathrm{diag}(\sigma_1^2+\lambda,\dots,\sigma_p^2+\lambda)\, V^\top .
+```
+
+```math
+\kappa(X^\top X+\lambda I)
+=
+\frac{\sigma_{\max}^2+\lambda}{\sigma_{\min}^2+\lambda}.
+```
+
+Because the performance of numerical algorithms is strongly influenced by the conditioning
+of the system they solve, the ridge penalty effectively creates regression problems with
+different numerical difficulty. This provides a way to assess how algorithm performance,
+convergence behavior, and computational cost depend on the numerical stability of the
+problem. 
+Recall that,
+```math
+\kappa(X^\top X + \lambda I)
+=
+\frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda},
+```
+Where $\sigma_{\min}$ and $\sigma_{\max}$ denote the smallest and largest
+singular values of $X$. Given a target condition number of the form, $1+\epsilon$ with $\epsilon > 0$, we solve for $\lambda$ by setting
+```math
+\frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda} = 1+\epsilon,
+```
+Which implies
+```math
+\sigma_{\max}^2 + \lambda
+=
+(1+\epsilon)(\sigma_{\min}^2 + \lambda).
+```
+Rearranging gives
+```math
+\sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2 = (\epsilon)\lambda,
+```
+and therefore
+```math
+\lambda = \frac{\sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2}{\epsilon}.
+```
+In this experiment, the ridge penalty parameter is not chosen directly, but is instead
+determined through a target level of conditioning. Because the ideal condition number
+of 1 cannot be achieved in practice, we instead aim for values of the form $1 + \epsilon$,
+where $\epsilon > 0$.
+
+A strong regularization regime corresponds to small values of $\epsilon$, resulting
+in a well-conditioned system with condition number close to 1. Moderate
+regularization corresponds to intermediate values of $\epsilon$, where the condition
+number is reduced but not minimal. Weak regularization corresponds to large values
+of $\epsilon$, where the system remains relatively ill-conditioned and close to the
+unregularized case.
+
+If $X$ is rank deficient, then $\sigma _{min}=0$, then 
+```math
+\lambda =\frac{\sigma _{max}^2}{\epsilon}
+```
+So our method is robust in the case where $X$ is rank deficient.
+
+Another blocking factor that will be considered is how sparse or dense the matrix $X$ is.
+Many algorithms behave differently depending on whether the matrix is sparse or dense. In
+ridge regression, there are many operations involving $X$ including matrix-matrix products
+and matrix-vector products. A dense matrix leads to high computational cost whereas a sparse
+matrix we can significantly reduce the cost. As such, different algorithms may perform
+better depending on the sparsity structure of X, making matrix sparsity a relevant blocking
+factor when comparing algorithm behavior and computational efficiency.
+
+The total number of block combinations is determined by the product of the number of levels
+in each blocking factor, denoted b. For example, if the experiment includes three
+dimensional regimes, two sparsity levels, and two regularization strengths, then there are
+$3 * 2 * 2 = 12$ block combinations. We will also denote r to be the number of replicated
+datasets in each block. Here, we mean the number datasets within a block. The total number
+of experimental units is then ${b * r}$.
+
+
+| Blocking System | Factor | Blocks |
+|:----------------|:-------|:-------|
+| Dataset | Dimensional regime | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$|
+| Ridge Penalty | Magnitude of ${\lambda}$ relative to the spectral scale of $X^\top X$ | Strong: $\kappa \approx 10$, Moderate: $\kappa \approx 10^3$, Weak: $\kappa \approx 10^6$, with $\lambda$ computed from the corresponding $\epsilon$. |
+| Matrix Sparsity| Density of non-zero values in $X$ | Sparse (< 10% non-zero), Moderate (10%-50% non-zero), Dense (> 50% non-zero)|
+# Treatments
+
+The treatments are the ridge regression solution methods:
+
+- Gradient-based optimization
+- Stochastic gradient descent
+- Direct Methods
+    - Golub Kahan Bidiagonalization
+
+ Since each experimental unit will recieves all t treatments, the total number of algorithm
+ runs in the experiment is ${t * b * r}$. For this experiment, ${t=3}$. To ensure fair
+ comparison between algorithms, each treatment will be applied under a fixed wall time
+ constraint. Each algorithm will be run for a maximum of two hours per experimental unit. 
+# Observational Units and Measurements
+
+The observational units are each algorithm-dataset pair. For each combination we will observe the following 
+
+| Column Name | Data Type | Description |
+|:---|:---|:---|
+| `dataset_id` | Positive Integer | Identifier for the generated dataset (experimental unit). |
+| `dimensional_regime` | String | Relationship between predictors and observations: `p << n`, `p ≈ n`, or `p >> n`. |
+| `sparsity_level` | String | Density of the matrix `X`: `Sparse`, `Moderate`, or `Dense`. |
+| `lambda_level` | String | Relative magnitude of the ridge penalty parameter `λ`: `Weak`, `Moderate`, or  `Strong`. |
+| `algorithm` | String | Ridge regression solution method used: `GradientDescent`, `SGD`, or `DirectMethod`. |
+| `runtime_seconds` | Positive Floating-point | Time required for the algorithm to compute a solution. |
+| `iterations` | Positive Integer | Number of iterations performed by the algorithm (`NA` for direct methods). |
+| `step_size` | Positive Floating-point | Step size used in gradient descent or SGD (`NA` for direct methods). |
+| `batch_size` | Positive Integer | Number of samples used per SGD update (`NA` for direct methods and gradient descent). |
+| `number_of_epochs` | Positive Integer | Number epochs per observation (`NA` for direct methods). |
+
+
+
+The collected measurements will be written to a CSV file. Each row in the file corresponds
+to a single algorithm–dataset pair, which forms the observational unit of the experiment.
+The columns represent the recorded measurements. After the experiment, the resulting CSV
+file should contain ${Algorithms∗Datasets}$ number of rows and each row will contain exactly
+10 columns.
diff --git a/test/Project.toml b/test/Project.toml
@@ -1,2 +1,5 @@
 [deps]
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
+LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
-Original file line number
+Diff line change
@@ Expand Up / @@ -14,6 +14,7 @@ makedocs(; @@
         ),
         pages=[
             "Home" => "index.md",
+            "Design" => "design.md",
         ],
     )
@@ Expand Down @@