diff --git a/Project.toml b/Project.toml index 86ea0a5..427e8aa 100644 --- a/Project.toml +++ b/Project.toml @@ -1,7 +1,15 @@ name = "RidgeRegression" uuid = "739161c8-60e1-4c49-8f89-ff30998444b1" -authors = ["Vivak Patel "] version = "0.1.0" +authors = ["Eton Tackett ", "Vivak Patel "] + +[deps] +CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" +DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" +Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" [compat] +CSV = "0.10.15" +DataFrames = "1.8.1" +Downloads = "1.7.0" julia = "1.12.4" diff --git a/docs/make.jl b/docs/make.jl index a1097bb..d42cfbe 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -14,6 +14,7 @@ makedocs(; ), pages=[ "Home" => "index.md", + "Design" => "design.md", ], ) diff --git a/docs/src/design.md b/docs/src/design.md new file mode 100644 index 0000000..cc3c80e --- /dev/null +++ b/docs/src/design.md @@ -0,0 +1,198 @@ +# Motivation and Background +Many modern science problems involve regression problems with extremely large numbers of +predictors. Genome-wide association studies (GWAS), for example, try to identify genetic +variants associated with a disease phenotype using hundreds of thousands or millions of +genomic features. In such settings, traditional least squares methods fail. Penalized Least Squares (PLS) extends ordinary least squares (OLS) +regression by adding a penalty term to shrink parameter estimates. Ridge regression, an +approach within PLS, adds a regularization term, producing a regularized estimator. + +Mathematically, ridge regression estimates the regression coefficients by solving the penalized least squares problem +```math +\hat{\boldsymbol{\beta}} = +\arg\min_{\boldsymbol{\beta}} +\left( +\| \mathbf{y} - X\boldsymbol{\beta} \|_2^2 ++ +\lambda \| \boldsymbol{\beta} \|_2^2 +\right) +``` +where $\lambda > 0$ is a regularization parameter that controls the strength of the penalty. + +The purpose of ridge regression is to stabilize regression estimates when the predictors are +highly correlated or the design matrix $X$ is nearly singular. Ridge regression modifies the +least squares objective by adding a penalty on the squared $\ell_2$-norm of the coefficient +vector. The estimator is obtained by minimizing a penalized least squares objective in which +large coefficient values are discouraged through the penalty term $\lambda +\|\boldsymbol{\beta}\|_2^2$. This penalty shrinks the estimated coefficients toward the +origin, which reduces the variance of the estimator and mitigates the effects of +multicollinearity. + +There are many numerical algorithms available to compute ridge regression estimates +including direct methods, Krylov subspace methods, gradient-based optimization, coordinate +descent, and stochastic gradient descent. These algorithms differ in their computational +costs and numerical stability. + +The goal of this experiment is to investigate the performance of these algorithms when we +vary the structure and scale of the regression problem. To do this, we consider the linear +model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where the matrix ${X}$ +may be constructed with varying dimensions, sparsity patterns, and conditioning properties. +# Questions +The primary goal of this experiment is to compare numerical algorithms for computing ridge +regression estimates under various conditions. In particular, we aim to address the +following questions: + +1. How does the performance of ridge regression algorithms change as the structural and numerical properties of the regression problem vary? + +2. Which ridge regression algorithm provides the best balance between numerical stability and computational cost across these problem regimes? + +# Experimental Units +A block is defined as a collection of experimental units that share common characteristics +which may influence the response. Blocking is introduced to control for +variation and to enable meaningful comparisons of algorithm performance across +different problem regimes. + +In this experiment, blocks are defined by combinations of three factors: dimensional +regime, matrix sparsity, and ridge penalty magnitude. + +The experimental units are the datasets. Each dataset consists of a design matrix $X$ and +response vector $y$, with the regularization parameter $\lambda$ determined by the +blocking structure through the choice of $\epsilon$. + +Datasets will be grouped according to their dimensional regime, characterized as $p \ll n$, +$p ≈ n$, and $p \gg n$. These regimes correspond to fundamentally different geometric +properties of the design matrix, including rank behavior, conditioning, and the stability of +the normal equations. All of which impact the performance of numerical algorithms. + +In addition to dimensional block, the strength of the ridge penalty will be incorporated as +a secondary blocking factor. The ridge estimator is $\hat{\beta_R} = (X^\top X + \lambda +I)^{-1}X^\top y$. The matrix conditioning number is defined as $\kappa(A) = +\frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$. In the context of ridge regression, the +regularization parameter ${\lambda}$, can impact the conditioning number. Let $X = U\Sigma +V^\top$ be the SVD of $X$, with singular values $\sigma_1,\dots,\sigma_p$. + +Then +```math +X^\top X = V \Sigma^\top \Sigma V^\top += V \,\mathrm{diag}(\sigma_1^2,\dots,\sigma_p^2)\, V^\top . +``` + +Adding the ridge term gives + +```math +X^\top X + \lambda I += +V \,\mathrm{diag}(\sigma_1^2+\lambda,\dots,\sigma_p^2+\lambda)\, V^\top . +``` + +```math +\kappa(X^\top X+\lambda I) += +\frac{\sigma_{\max}^2+\lambda}{\sigma_{\min}^2+\lambda}. +``` + +Because the performance of numerical algorithms is strongly influenced by the conditioning +of the system they solve, the ridge penalty effectively creates regression problems with +different numerical difficulty. This provides a way to assess how algorithm performance, +convergence behavior, and computational cost depend on the numerical stability of the +problem. +Recall that, +```math +\kappa(X^\top X + \lambda I) += +\frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda}, +``` +Where $\sigma_{\min}$ and $\sigma_{\max}$ denote the smallest and largest +singular values of $X$. Given a target condition number of the form, $1+\epsilon$ with $\epsilon > 0$, we solve for $\lambda$ by setting +```math +\frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda} = 1+\epsilon, +``` +Which implies +```math +\sigma_{\max}^2 + \lambda += +(1+\epsilon)(\sigma_{\min}^2 + \lambda). +``` +Rearranging gives +```math +\sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2 = (\epsilon)\lambda, +``` +and therefore +```math +\lambda = \frac{\sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2}{\epsilon}. +``` +In this experiment, the ridge penalty parameter is not chosen directly, but is instead +determined through a target level of conditioning. Because the ideal condition number +of 1 cannot be achieved in practice, we instead aim for values of the form $1 + \epsilon$, +where $\epsilon > 0$. + +A strong regularization regime corresponds to small values of $\epsilon$, resulting +in a well-conditioned system with condition number close to 1. Moderate +regularization corresponds to intermediate values of $\epsilon$, where the condition +number is reduced but not minimal. Weak regularization corresponds to large values +of $\epsilon$, where the system remains relatively ill-conditioned and close to the +unregularized case. + +If $X$ is rank deficient, then $\sigma _{min}=0$, then +```math +\lambda =\frac{\sigma _{max}^2}{\epsilon} +``` +So our method is robust in the case where $X$ is rank deficient. + +Another blocking factor that will be considered is how sparse or dense the matrix $X$ is. +Many algorithms behave differently depending on whether the matrix is sparse or dense. In +ridge regression, there are many operations involving $X$ including matrix-matrix products +and matrix-vector products. A dense matrix leads to high computational cost whereas a sparse +matrix we can significantly reduce the cost. As such, different algorithms may perform +better depending on the sparsity structure of X, making matrix sparsity a relevant blocking +factor when comparing algorithm behavior and computational efficiency. + +The total number of block combinations is determined by the product of the number of levels +in each blocking factor, denoted b. For example, if the experiment includes three +dimensional regimes, two sparsity levels, and two regularization strengths, then there are +$3 * 2 * 2 = 12$ block combinations. We will also denote r to be the number of replicated +datasets in each block. Here, we mean the number datasets within a block. The total number +of experimental units is then ${b * r}$. + + +| Blocking System | Factor | Blocks | +|:----------------|:-------|:-------| +| Dataset | Dimensional regime | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$| +| Ridge Penalty | Magnitude of ${\lambda}$ relative to the spectral scale of $X^\top X$ | Strong: $\kappa \approx 10$, Moderate: $\kappa \approx 10^3$, Weak: $\kappa \approx 10^6$, with $\lambda$ computed from the corresponding $\epsilon$. | +| Matrix Sparsity| Density of non-zero values in $X$ | Sparse (< 10% non-zero), Moderate (10%-50% non-zero), Dense (> 50% non-zero)| +# Treatments + +The treatments are the ridge regression solution methods: + +- Gradient-based optimization +- Stochastic gradient descent +- Direct Methods + - Golub Kahan Bidiagonalization + + Since each experimental unit will recieves all t treatments, the total number of algorithm + runs in the experiment is ${t * b * r}$. For this experiment, ${t=3}$. To ensure fair + comparison between algorithms, each treatment will be applied under a fixed wall time + constraint. Each algorithm will be run for a maximum of two hours per experimental unit. +# Observational Units and Measurements + +The observational units are each algorithm-dataset pair. For each combination we will observe the following + +| Column Name | Data Type | Description | +|:---|:---|:---| +| `dataset_id` | Positive Integer | Identifier for the generated dataset (experimental unit). | +| `dimensional_regime` | String | Relationship between predictors and observations: `p << n`, `p ≈ n`, or `p >> n`. | +| `sparsity_level` | String | Density of the matrix `X`: `Sparse`, `Moderate`, or `Dense`. | +| `lambda_level` | String | Relative magnitude of the ridge penalty parameter `λ`: `Weak`, `Moderate`, or `Strong`. | +| `algorithm` | String | Ridge regression solution method used: `GradientDescent`, `SGD`, or `DirectMethod`. | +| `runtime_seconds` | Positive Floating-point | Time required for the algorithm to compute a solution. | +| `iterations` | Positive Integer | Number of iterations performed by the algorithm (`NA` for direct methods). | +| `step_size` | Positive Floating-point | Step size used in gradient descent or SGD (`NA` for direct methods). | +| `batch_size` | Positive Integer | Number of samples used per SGD update (`NA` for direct methods and gradient descent). | +| `number_of_epochs` | Positive Integer | Number epochs per observation (`NA` for direct methods). | + + + +The collected measurements will be written to a CSV file. Each row in the file corresponds +to a single algorithm–dataset pair, which forms the observational unit of the experiment. +The columns represent the recorded measurements. After the experiment, the resulting CSV +file should contain ${Algorithms∗Datasets}$ number of rows and each row will contain exactly +10 columns. \ No newline at end of file diff --git a/test/Project.toml b/test/Project.toml index 0c36332..73141b0 100644 --- a/test/Project.toml +++ b/test/Project.toml @@ -1,2 +1,5 @@ [deps] +CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" +DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" +LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" \ No newline at end of file