-
Notifications
You must be signed in to change notification settings - Fork 0
Experimental Design #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
etontackett
wants to merge
35
commits into
main
Choose a base branch
from
Design
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+211
−1
Open
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
7dc2968
Add dataset utilities and tests
EtonT471 47fc954
Adding dataset_tests.jl
EtonT471 787a88d
Small changes to design.md
EtonT471 2dd2295
March 16 Updates
EtonT471 173cdd1
Ridge Regression file
EtonT471 042d5d6
dataset.jl small update
EtonT471 70cfa67
Updated Experimental Units and Treatments Sections
EtonT471 9b8c48c
Small changes
EtonT471 6c923ab
Bidiagonalization Stuff
EtonT471 6727a90
Adding Linear Algebra to Project.toml
EtonT471 358e881
Design Branch
EtonT471 2ad65a9
Updated design 3/19
EtonT471 fc2595f
3/19 edits
EtonT471 788665c
Update experimental design document
EtonT471 698aff7
Remove code files from design branch
EtonT471 f904351
Compiling issues
EtonT471 49c8a81
Attempt 1
EtonT471 8529e34
Minor adjustment
EtonT471 3734a46
restoring .jl files
EtonT471 48edb29
moving to src
EtonT471 8c2e5bc
Put source file back into the folder.
vp314 8192f16
Delete docs/src/RidgeRegression.jl
vp314 8a6525e
Source folder re-added
EtonT471 dc10590
Fixed math notation
vp314 6b893ef
recent changes
EtonT471 a605e40
Updated design.md
EtonT471 d9a63e7
Remove RidgeRegression.jl changes from Design branch
EtonT471 c29b983
fixing compiling
EtonT471 0ba580c
Fixing dollar signs
EtonT471 e080193
XtopX change
EtonT471 26a6198
Update 4/13
EtonT471 cf1e17e
4/14 Update
EtonT471 9ca4815
4/28 updates
EtonT471 2c641dd
Fixed compiling issue
EtonT471 06f1c06
More compiling fixed
EtonT471 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,15 @@ | ||
| name = "RidgeRegression" | ||
| uuid = "739161c8-60e1-4c49-8f89-ff30998444b1" | ||
| authors = ["Vivak Patel <vp314@users.noreply.github.com>"] | ||
| version = "0.1.0" | ||
| authors = ["Eton Tackett <etont@icloud.com>", "Vivak Patel <vp314@users.noreply.github.com>"] | ||
|
|
||
| [deps] | ||
| CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" | ||
|
|
||
| [compat] | ||
| CSV = "0.10.15" | ||
| DataFrames = "1.8.1" | ||
| Downloads = "1.7.0" | ||
| julia = "1.12.4" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ makedocs(; | |
| ), | ||
| pages=[ | ||
| "Home" => "index.md", | ||
| "Design" => "design.md", | ||
| ], | ||
| ) | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,198 @@ | ||
| # Motivation and Background | ||
| Many modern science problems involve regression problems with extremely large numbers of | ||
| predictors. Genome-wide association studies (GWAS), for example, try to identify genetic | ||
| variants associated with a disease phenotype using hundreds of thousands or millions of | ||
| genomic features. In such settings, traditional least squares methods fail. Penalized Least Squares (PLS) extends ordinary least squares (OLS) | ||
| regression by adding a penalty term to shrink parameter estimates. Ridge regression, an | ||
| approach within PLS, adds a regularization term, producing a regularized estimator. | ||
|
|
||
| Mathematically, ridge regression estimates the regression coefficients by solving the penalized least squares problem | ||
| ```math | ||
| \hat{\boldsymbol{\beta}} = | ||
| \arg\min_{\boldsymbol{\beta}} | ||
| \left( | ||
| \| \mathbf{y} - X\boldsymbol{\beta} \|_2^2 | ||
| + | ||
| \lambda \| \boldsymbol{\beta} \|_2^2 | ||
| \right) | ||
|
Comment on lines
+11
to
+17
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should indicate which norms you are using by using a subscript |
||
| ``` | ||
| where $\lambda > 0$ is a regularization parameter that controls the strength of the penalty. | ||
|
|
||
| The purpose of ridge regression is to stabilize regression estimates when the predictors are | ||
| highly correlated or the design matrix $X$ is nearly singular. Ridge regression modifies the | ||
| least squares objective by adding a penalty on the squared $\ell_2$-norm of the coefficient | ||
| vector. The estimator is obtained by minimizing a penalized least squares objective in which | ||
| large coefficient values are discouraged through the penalty term $\lambda | ||
| \|\boldsymbol{\beta}\|_2^2$. This penalty shrinks the estimated coefficients toward the | ||
| origin, which reduces the variance of the estimator and mitigates the effects of | ||
| multicollinearity. | ||
|
|
||
| There are many numerical algorithms available to compute ridge regression estimates | ||
| including direct methods, Krylov subspace methods, gradient-based optimization, coordinate | ||
| descent, and stochastic gradient descent. These algorithms differ in their computational | ||
| costs and numerical stability. | ||
|
|
||
| The goal of this experiment is to investigate the performance of these algorithms when we | ||
| vary the structure and scale of the regression problem. To do this, we consider the linear | ||
| model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where the matrix ${X}$ | ||
| may be constructed with varying dimensions, sparsity patterns, and conditioning properties. | ||
| # Questions | ||
| The primary goal of this experiment is to compare numerical algorithms for computing ridge | ||
| regression estimates under various conditions. In particular, we aim to address the | ||
| following questions: | ||
|
|
||
| 1. How does the performance of ridge regression algorithms change as the structural and numerical properties of the regression problem vary? | ||
|
|
||
| 2. Which ridge regression algorithm provides the best balance between numerical stability and computational cost across these problem regimes? | ||
|
|
||
| # Experimental Units | ||
| A block is defined as a collection of experimental units that share common characteristics | ||
| which may influence the response. Blocking is introduced to control for | ||
| variation and to enable meaningful comparisons of algorithm performance across | ||
| different problem regimes. | ||
|
|
||
| In this experiment, blocks are defined by combinations of three factors: dimensional | ||
| regime, matrix sparsity, and ridge penalty magnitude. | ||
|
|
||
| The experimental units are the datasets. Each dataset consists of a design matrix $X$ and | ||
| response vector $y$, with the regularization parameter $\lambda$ determined by the | ||
| blocking structure through the choice of $\epsilon$. | ||
|
|
||
| Datasets will be grouped according to their dimensional regime, characterized as $p \ll n$, | ||
| $p ≈ n$, and $p \gg n$. These regimes correspond to fundamentally different geometric | ||
| properties of the design matrix, including rank behavior, conditioning, and the stability of | ||
| the normal equations. All of which impact the performance of numerical algorithms. | ||
|
|
||
| In addition to dimensional block, the strength of the ridge penalty will be incorporated as | ||
| a secondary blocking factor. The ridge estimator is $\hat{\beta_R} = (X^\top X + \lambda | ||
| I)^{-1}X^\top y$. The matrix conditioning number is defined as $\kappa(A) = | ||
| \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$. In the context of ridge regression, the | ||
| regularization parameter ${\lambda}$, can impact the conditioning number. Let $X = U\Sigma | ||
| V^\top$ be the SVD of $X$, with singular values $\sigma_1,\dots,\sigma_p$. | ||
|
|
||
| Then | ||
| ```math | ||
| X^\top X = V \Sigma^\top \Sigma V^\top | ||
| = V \,\mathrm{diag}(\sigma_1^2,\dots,\sigma_p^2)\, V^\top . | ||
| ``` | ||
|
|
||
| Adding the ridge term gives | ||
|
|
||
| ```math | ||
| X^\top X + \lambda I | ||
| = | ||
| V \,\mathrm{diag}(\sigma_1^2+\lambda,\dots,\sigma_p^2+\lambda)\, V^\top . | ||
| ``` | ||
|
|
||
| ```math | ||
| \kappa(X^\top X+\lambda I) | ||
| = | ||
| \frac{\sigma_{\max}^2+\lambda}{\sigma_{\min}^2+\lambda}. | ||
| ``` | ||
|
|
||
| Because the performance of numerical algorithms is strongly influenced by the conditioning | ||
| of the system they solve, the ridge penalty effectively creates regression problems with | ||
| different numerical difficulty. This provides a way to assess how algorithm performance, | ||
| convergence behavior, and computational cost depend on the numerical stability of the | ||
| problem. | ||
| Recall that, | ||
| ```math | ||
| \kappa(X^\top X + \lambda I) | ||
| = | ||
| \frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda}, | ||
| ``` | ||
| Where $\sigma_{\min}$ and $\sigma_{\max}$ denote the smallest and largest | ||
| singular values of $X$. Given a target condition number of the form, $1+\epsilon$ with $\epsilon > 0$, we solve for $\lambda$ by setting | ||
| ```math | ||
| \frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda} = 1+\epsilon, | ||
| ``` | ||
| Which implies | ||
| ```math | ||
| \sigma_{\max}^2 + \lambda | ||
| = | ||
| (1+\epsilon)(\sigma_{\min}^2 + \lambda). | ||
| ``` | ||
| Rearranging gives | ||
| ```math | ||
| \sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2 = (\epsilon)\lambda, | ||
| ``` | ||
| and therefore | ||
| ```math | ||
| \lambda = \frac{\sigma_{\max}^2 - (1+\epsilon)\,\sigma_{\min}^2}{\epsilon}. | ||
| ``` | ||
| In this experiment, the ridge penalty parameter is not chosen directly, but is instead | ||
| determined through a target level of conditioning. Because the ideal condition number | ||
| of 1 cannot be achieved in practice, we instead aim for values of the form $1 + \epsilon$, | ||
| where $\epsilon > 0$. | ||
|
|
||
| A strong regularization regime corresponds to small values of $\epsilon$, resulting | ||
| in a well-conditioned system with condition number close to 1. Moderate | ||
| regularization corresponds to intermediate values of $\epsilon$, where the condition | ||
| number is reduced but not minimal. Weak regularization corresponds to large values | ||
| of $\epsilon$, where the system remains relatively ill-conditioned and close to the | ||
| unregularized case. | ||
|
|
||
| If $X$ is rank deficient, then $\sigma _{min}=0$, then | ||
| ```math | ||
| \lambda =\frac{\sigma _{max}^2}{\epsilon} | ||
| ``` | ||
| So our method is robust in the case where $X$ is rank deficient. | ||
|
|
||
| Another blocking factor that will be considered is how sparse or dense the matrix $X$ is. | ||
| Many algorithms behave differently depending on whether the matrix is sparse or dense. In | ||
| ridge regression, there are many operations involving $X$ including matrix-matrix products | ||
| and matrix-vector products. A dense matrix leads to high computational cost whereas a sparse | ||
| matrix we can significantly reduce the cost. As such, different algorithms may perform | ||
| better depending on the sparsity structure of X, making matrix sparsity a relevant blocking | ||
| factor when comparing algorithm behavior and computational efficiency. | ||
|
|
||
| The total number of block combinations is determined by the product of the number of levels | ||
| in each blocking factor, denoted b. For example, if the experiment includes three | ||
| dimensional regimes, two sparsity levels, and two regularization strengths, then there are | ||
| $3 * 2 * 2 = 12$ block combinations. We will also denote r to be the number of replicated | ||
| datasets in each block. Here, we mean the number datasets within a block. The total number | ||
| of experimental units is then ${b * r}$. | ||
|
|
||
|
|
||
| | Blocking System | Factor | Blocks | | ||
| |:----------------|:-------|:-------| | ||
| | Dataset | Dimensional regime | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$| | ||
| | Ridge Penalty | Magnitude of ${\lambda}$ relative to the spectral scale of $X^\top X$ | Strong: $\kappa \approx 10$, Moderate: $\kappa \approx 10^3$, Weak: $\kappa \approx 10^6$, with $\lambda$ computed from the corresponding $\epsilon$. | | ||
| | Matrix Sparsity| Density of non-zero values in $X$ | Sparse (< 10% non-zero), Moderate (10%-50% non-zero), Dense (> 50% non-zero)| | ||
| # Treatments | ||
|
|
||
| The treatments are the ridge regression solution methods: | ||
|
|
||
| - Gradient-based optimization | ||
| - Stochastic gradient descent | ||
| - Direct Methods | ||
| - Golub Kahan Bidiagonalization | ||
|
|
||
| Since each experimental unit will recieves all t treatments, the total number of algorithm | ||
| runs in the experiment is ${t * b * r}$. For this experiment, ${t=3}$. To ensure fair | ||
| comparison between algorithms, each treatment will be applied under a fixed wall time | ||
| constraint. Each algorithm will be run for a maximum of two hours per experimental unit. | ||
| # Observational Units and Measurements | ||
|
|
||
| The observational units are each algorithm-dataset pair. For each combination we will observe the following | ||
|
|
||
| | Column Name | Data Type | Description | | ||
| |:---|:---|:---| | ||
| | `dataset_id` | Positive Integer | Identifier for the generated dataset (experimental unit). | | ||
| | `dimensional_regime` | String | Relationship between predictors and observations: `p << n`, `p ≈ n`, or `p >> n`. | | ||
| | `sparsity_level` | String | Density of the matrix `X`: `Sparse`, `Moderate`, or `Dense`. | | ||
| | `lambda_level` | String | Relative magnitude of the ridge penalty parameter `λ`: `Weak`, `Moderate`, or `Strong`. | | ||
| | `algorithm` | String | Ridge regression solution method used: `GradientDescent`, `SGD`, or `DirectMethod`. | | ||
| | `runtime_seconds` | Positive Floating-point | Time required for the algorithm to compute a solution. | | ||
| | `iterations` | Positive Integer | Number of iterations performed by the algorithm (`NA` for direct methods). | | ||
| | `step_size` | Positive Floating-point | Step size used in gradient descent or SGD (`NA` for direct methods). | | ||
| | `batch_size` | Positive Integer | Number of samples used per SGD update (`NA` for direct methods and gradient descent). | | ||
| | `number_of_epochs` | Positive Integer | Number epochs per observation (`NA` for direct methods). | | ||
|
|
||
|
|
||
|
|
||
| The collected measurements will be written to a CSV file. Each row in the file corresponds | ||
| to a single algorithm–dataset pair, which forms the observational unit of the experiment. | ||
| The columns represent the recorded measurements. After the experiment, the resulting CSV | ||
| file should contain ${Algorithms∗Datasets}$ number of rows and each row will contain exactly | ||
| 10 columns. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,5 @@ | ||
| [deps] | ||
| CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
| LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.