-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset.jl and Corresponding Tests #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
7dc2968
47fc954
787a88d
2dd2295
173cdd1
042d5d6
70cfa67
9b8c48c
bf9707b
214e74d
a87c905
9e1ad08
13fc5bf
5cbeb0b
121fbc0
049bacc
fa3a24a
e1b0106
17a8d48
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,15 @@ | ||
| name = "RidgeRegression" | ||
| uuid = "739161c8-60e1-4c49-8f89-ff30998444b1" | ||
| authors = ["Vivak Patel <vp314@users.noreply.github.com>"] | ||
| version = "0.1.0" | ||
| authors = ["Eton Tackett <etont@icloud.com>", "Vivak Patel <vp314@users.noreply.github.com>"] | ||
|
|
||
| [deps] | ||
| CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" | ||
|
|
||
| [compat] | ||
| CSV = "0.10.15" | ||
| DataFrames = "1.8.1" | ||
| Downloads = "1.7.0" | ||
| julia = "1.12.4" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ makedocs(; | |
| ), | ||
| pages=[ | ||
| "Home" => "index.md", | ||
| "Design" => "design.md", | ||
| ], | ||
| ) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| # Motivation and Background | ||
| Many modern applications, such as genome-wide association studies (GWAS) involve regression problems with a large number of predictors. Traditional least squares methods fail due to noise and ill-conditioning. Penalized Least Squares (PLS) extends ordinary least squares (OLS) regression by adding a penalty term to shrink parameter estimates. The goal is to select the best possible model, "best" in the sense that we find the best tradeoff between goodness of fit and model complexity. Ridge regression, an approach within PLS, adds a regularization term. | ||
|
|
||
| # Questions | ||
| Key Questions: | ||
| Which ridge regression algorithm is provides the best balance between: | ||
| -Numerical stability | ||
| -Computational aspects (GPU/CPU, runtime, etc) | ||
| -Predicative accuracy | ||
| # Experimental Units | ||
| The experimental units are the datasets under fixed penalty weights. Due to the statistical behavior of ridge regression algorithms depends strongly on the dimensional structure of the problem, a blocking procedure will be used. Datasets will be grouped according to their dimensional regime, characterized as p >> n, p ≈ n, and p << n. These regimes correspond to fundamentally different geometric properties of the design matrix, including rank behavior, conditioning, and the stability of the normal equations. | ||
|
|
||
| In addition to dimensional regime, matrix conditioning will be incorporated as a secondary blocking factor. The condition number of the design matrix quantifies the sensitivity of the regression problem to perturbations in the data and directly affects numerical stability and convergence behavior of ridge solution methods. Ill-conditioned matrices have slow convergence and are sensitive to errors, while well-conditioned matrices tend to produce stable and rapidly convergent behavior. | ||
|
|
||
| | Blocking System | Factor | Blocks | | ||
| |:----------------|:-------|:-------| | ||
| | Dataset | Dimensional regime (\(p/n\)) | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$| | ||
| | Matrix conditioning | Condition number of \( X \) or \( X^T X \) | Low, Medium, High | | ||
| # Treatments | ||
| The treatments are the ridge regression solution methods: | ||
| -Gradient descent | ||
| -Stochastic gradient descent | ||
| -Closed-form solutions | ||
| # Observational Units and Measurements | ||
| The observational units are each algorithm-dataset pair. For each combination we will observe the following | ||
| | Measurement System | Factor | Measurements | | ||
| |:--------------------------|:--------------------------|:-------------| | ||
| | Predictive Performance | Prediction error | Training MSE, Test MSE, RMSE, R² | | ||
| | Estimation Accuracy | Parameter recovery | ‖β̂ − β_true‖₂² | (if known) | ||
| | Computational Performance | Efficiency | Runtime (seconds), Iterations to convergence | | ||
| | Numerical Stability | Solution accuracy | Perturbation sensitivity | | ||
| | Model Complexity | Coefficient magnitude | ‖β̂‖₂ | |
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| using CSV | ||
| using DataFrames | ||
| using Downloads | ||
|
|
||
| export Dataset, csv_dataset | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Julia, we put using/import statements in the main source file. We do the same for export statements. |
||
|
|
||
| """ | ||
| Dataset(name, X, y) | ||
|
|
||
| Contains datasets for ridge regression experiments. | ||
|
|
||
| # Fields | ||
| - `name::String`: Name of dataset | ||
| - `X::Matrix{Float64}`: Matrix of variables/features | ||
| - `y::Vector{Float64}`: Target vector | ||
|
|
||
| # Throws | ||
| - `ArgumentError`: If rows in `X` does not equal length of `y`. | ||
|
Comment on lines
+1
to
+29
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring. |
||
|
|
||
| # Notes | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Notes should be admonitions. See documenter.jl's documentation on admonitions. |
||
| Used as the experimental unit for ridge regression experiments. | ||
| """ | ||
| struct Dataset | ||
| name::String | ||
| X::Matrix{Float64} | ||
| y::Vector{Float64} | ||
|
|
||
| function Dataset(name::String, X::AbstractMatrix, y::AbstractVector) | ||
| size(X, 1) == length(y) || | ||
| throw(ArgumentError("X and y must have same number of rows")) | ||
|
|
||
| new(name, Matrix{Float64}(X), Vector{Float64}(y)) | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to |
||
| end | ||
| end | ||
|
|
||
| """ | ||
| one_hot_encode(dataset::Dataset; drop_first=true) | ||
|
|
||
| One-hot encode categorical (string-like) features in `dataset.X`. | ||
|
|
||
| # Arguments | ||
| - `dataset::Dataset`: Input dataset containing feature matrix `X` | ||
| and response vector `y`. | ||
|
|
||
| # Keyword Arguments | ||
| - `drop_first::Bool=true`: If `true`, drop the first dummy column for | ||
| each categorical feature to avoid multicollinearity. | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The indenting here is not consistent with the indent level in line 51. Please check the bluestyle guide to understand how much indentation is needed. |
||
|
|
||
| # Returns | ||
| A new `Dataset` with numeric `X` and unchanged `y`. | ||
| """ | ||
| function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64} | ||
|
vp314 marked this conversation as resolved.
Outdated
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words. |
||
| n = nrow(Xdf) | ||
| cols = Vector{Vector{Float64}}() | ||
|
|
||
| for name in names(Xdf) | ||
| col = Xdf[!, name] | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe move this inside the first if statement on line 75 |
||
| if eltype(col) <: Real | ||
| push!(cols, Float64.(col)) | ||
| continue | ||
| end | ||
|
|
||
| scol = string.(col) | ||
| lv = unique(scol) | ||
| ind = scol .== permutedims(lv) | ||
|
|
||
| println("Variable: $name") | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not have print statements inside of code. |
||
| for (j, level) in enumerate(lv) | ||
| println(" Dummy column (before drop) $j → $name = $level") | ||
| end | ||
|
|
||
| if drop_first && size(ind, 2) > 1 | ||
| ind = ind[:, 2:end] | ||
| end | ||
|
|
||
| for j in 1:size(ind, 2) | ||
| push!(cols, Float64.(ind[:, j])) | ||
| end | ||
| end | ||
|
|
||
| p = length(cols) | ||
| X = Matrix{Float64}(undef, n, p) | ||
| for j in 1:p | ||
| X[:, j] = cols[j] | ||
| end | ||
|
|
||
| return Matrix{Float64}(X) | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should have an intercept column (column of 1s) prepended to X. I would do this higher up. Probably around Line 68 |
||
|
|
||
| end | ||
| """ | ||
| csv_dataset(path_or_url; target_col, name="csv_dataset") | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bad function name. |
||
|
|
||
| Load a dataset from a CSV file or URL. | ||
|
|
||
| # Arguments | ||
| - `path_or_url::String` | ||
| Local file path or web URL that has CSV data. | ||
|
|
||
| - `target_col` | ||
| Column index OR column name containing the response variable. | ||
|
|
||
| - `name::String` | ||
| Dataset name. | ||
|
|
||
| # Returns | ||
| `Dataset` | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to abide by the style guide as you have done above. |
||
| """ | ||
| function csv_dataset(path_or_url::String; | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not follow BlueStyle |
||
| target_col, | ||
| name::String = "csv_dataset" | ||
| ) | ||
|
|
||
| filepath = | ||
| startswith(path_or_url, "http") ? | ||
| Downloads.download(path_or_url) : | ||
| path_or_url | ||
|
|
||
| df = DataFrame(CSV.File(filepath)) | ||
| df = dropmissing(df) | ||
| Xdf = select(df, DataFrames.Not(target_col)) | ||
|
|
||
| y = target_col isa Int ? | ||
| df[:, target_col] : | ||
| df[:, Symbol(target_col)] | ||
|
|
||
|
|
||
| X = one_hot_encode(Xdf; drop_first = true) | ||
|
|
||
|
|
||
|
|
||
| return Dataset(name, Matrix{Float64}(X), Vector{Float64}(y)) | ||
| end | ||
|
etontackett marked this conversation as resolved.
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Individual test files should be wrapped as their own modules. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| using Test | ||
| using DataFrames | ||
| using CSV | ||
|
|
||
| include("../src/dataset.jl") | ||
| @testset "Dataset" begin | ||
| X = [1 2; 3 4] | ||
| y = [10, 20] | ||
| d = Dataset("toy", X, y) | ||
|
|
||
| @test d.name == "toy" | ||
| @test d.X isa Matrix{Float64} | ||
| @test d.y isa Vector{Float64} | ||
| @test size(d.X) == (2, 2) | ||
| @test length(d.y) == 2 | ||
| @test d.X[1, 1] == 1.0 | ||
| @test d.y[2] == 20.0 | ||
|
|
||
| @test_throws ArgumentError Dataset("bad", X, [1, 2, 3]) | ||
| end | ||
|
|
||
| @testset "one_hot_encode" begin | ||
| df = DataFrame( | ||
| A = ["red", "blue", "red", "green"], | ||
| B = [1, 2, 3, 4], | ||
| C = ["small", "large", "medium", "small"] | ||
| ) | ||
|
|
||
| X = redirect_stdout(devnull) do | ||
| one_hot_encode(df; drop_first = true) | ||
| end | ||
|
|
||
| @test size(X) == (4, 5) | ||
| @test X[:, 3] == [1.0, 2.0, 3.0, 4.0] | ||
| @test all(x -> x == 0.0 || x == 1.0, X[:, [1,2,4,5]]) | ||
| @test all(vec(sum(X[:, 1:2]; dims=2)) .<= 1) | ||
| @test all(vec(sum(X[:, 4:5]; dims=2)) .<= 1) | ||
| end | ||
|
|
||
| @testset "csv_dataset" begin | ||
| tmp = tempname() * ".csv" | ||
| df = DataFrame( | ||
| a = [1.0, 2.0, missing, 4.0], | ||
| b = ["x", "y", "y", "x"], | ||
| y = [10.0, 20.0, 30.0, 40.0] | ||
| ) | ||
| CSV.write(tmp, df) | ||
|
|
||
| d = redirect_stdout(devnull) do | ||
| csv_dataset(tmp; target_col=:y, name="tmp") | ||
| end | ||
|
|
||
| @test d.name == "tmp" | ||
| @test d.X isa Matrix{Float64} | ||
| @test d.y isa Vector{Float64} | ||
|
|
||
| @test length(d.y) == 3 | ||
| @test size(d.X, 1) == 3 | ||
| @test d.y == [10.0, 20.0, 40.0] | ||
|
|
||
| d2 = redirect_stdout(devnull) do | ||
| csv_dataset(tmp; target_col=3, name="tmp2") | ||
| end | ||
| @test d2.y == [10.0, 20.0, 40.0] | ||
| @test size(d2.X, 1) == 3 | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.