feat: add ml/strided/dkmeans-init-plus-plus#12312
Conversation
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
| * Initializes centroids by performing the k-means++ initialization procedure. | ||
| * | ||
| * ## Method | ||
| * | ||
| * The k-means++ algorithm for choosing initial centroids is as follows: | ||
| * | ||
| * 1. Select a data point uniformly at random from a data set \\( X \\). This data point is first centroid and denoted \\( c_0 \\). | ||
| * | ||
| * 2. Compute the distance from each data point to \\( c_0 \\). Denote the distance between \\( c_j \\) and data point \\( m \\) as \\( d(x_m, c_j) \\). | ||
| * | ||
| * 3. Select the next centroid, \\( c_1 \\), at random from \\( X \\) with probability | ||
| * | ||
| * ```tex | ||
| * \frac{d^2(x_m, c_0)}{\sum_{j=0}^{n-1} d^2(x_j, c_0)} | ||
| * ``` | ||
| * | ||
| * where \\( n \\) is the number of data points. | ||
| * | ||
| * 4. To choose centroid \\( j \\), | ||
| * | ||
| * a. Compute the distances from each data point to each centroid and assign each data point to its closest centroid. | ||
| * | ||
| * b. For \\( i = 0,\ldots,n-1 \\) and \\( p = 0,\ldots,j-2 \\), select centroid \\( j \\) at random from \\( X \\) with probability | ||
| * | ||
| * ```tex | ||
| * \frac{d^2(x_i, c_p)}{\sum_{\{h; x_h \exits C_p\}} d^2(x_h, c_p)} | ||
| * ``` | ||
| * | ||
| * where \\( C_p \\) is the set of all data points closest to centroid \\( c_p \\) and \\( x_i \\) belongs to \\( c_p \\). | ||
| * | ||
| * Stated more plainly, select each subsequent centroid with a probability proportional to the distance from the centroid to the closest centroid already chosen. | ||
| * | ||
| * 5. Repeat step `4` until \\( k \\) centroids have been chosen. | ||
| * | ||
| * ## References | ||
| * | ||
| * - Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>. | ||
| * |
There was a problem hiding this comment.
I am not sure on which all files I should write this description, please guide me on this.
There was a problem hiding this comment.
Left a comment. Only the "base" implementation.
| * ]); | ||
| * | ||
| * var v = dkmeansInitPlusPlus( 'row-major', k, M, N, out, 2, xbuf, 2, 'sqeuclidean', 3, 44 ); | ||
| * // returns <Float64Array>[0,0,1,-1,1,1] |
There was a problem hiding this comment.
I hope this return value makes sense. As the user already has the strides it should be easy for them to access it, also the user is expected to pass in an empty initialized out array which will be filled with the flat centroids array.
There was a problem hiding this comment.
Yeah, I am not sure how to interpret this return value. E.g., what does -1 correspond to?
There was a problem hiding this comment.
<Float64Array>[0,0,1,-1,1,1] means [ [ 0,0 ] ,[ 1,-1 ], [ 1,1 ] ] as the user already has LDO = 2.
-1 is just one feature value of a centroid from the centroids array.
There was a problem hiding this comment.
Right, right. The output array contains the initial centroids. In which case, you are missing decimal points. Reading this, I kept thinking that the values referred to indices. Having the decimals would have reinforced that these are values from the list of data points.
| // Create a scratch array for storing cumulative probabilities: | ||
| probs = new Float64Array( M ); | ||
|
|
||
| // 2-5. For each data point, compute the distances to each centroid, find the closest centroid, and, based on the distance to the closest centroid, assign a probability to the data point to be chosen as centroid `c_j`... |
There was a problem hiding this comment.
Please note that all these comments in this implementation is directly referenced from ml/incr/kmeans/lib/init_kmeansplusplus
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: na
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: passed
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: na
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: passed
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
|
/stdlib merge |
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
| if ( trials < 1 ) { | ||
| throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) ); | ||
| } |
There was a problem hiding this comment.
Would it be better to just return NaN?
Coverage Report
The above coverage report was generated for the changes in this PR. |
| if ( k < 1 || M < 1 || N < 1) { | ||
| return NaN; | ||
| } |
There was a problem hiding this comment.
Is this necessary?
- As it is a low-level kernel, we can expect user to pass in valid parameters.
- But if this check is not kept, for invalid params the API return
<Float64Array>[ NaN, NaN, NaN, ..., NaN ].
| expected = new Float64Array( data.expected ); | ||
| for ( i = 0; i < expected.length; i++ ) { | ||
| t.strictEqual( isAlmostSameValue( out[ i ], expected[ i ], 0 ), true, 'returns expected value' ); | ||
| } |
There was a problem hiding this comment.
Planning on using @stdlib/assert/is-same-float64array here, am I good to go?
cc: @kgryte
| * | ||
| * - Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>. | ||
| * | ||
| * @param {string} order - storage layout |
There was a problem hiding this comment.
Your parameter order is odd and does not match conventions elsewhere in the project.
Instead,
f( order, M, N, k, trials, metric, X, LDX, out, LDO[, options] )
There was a problem hiding this comment.
where options corresponds to PRNG options, similar to what is done in https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/random/strided/gamma.
| /** | ||
| * Initializes centroids by performing the k-means++ initialization procedure on double-precision floating-point data points. | ||
| * | ||
| * ## Method |
There was a problem hiding this comment.
You need only include this extended Method section in the base.js file. No need to copy for all interfaces.
| if ( trials < 1 ) { | ||
| throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) ); | ||
| } | ||
| if ( k < 1 || M < 1 || N < 1) { |
There was a problem hiding this comment.
I think you are mixing concerns here. Similar to other packages, such as stats/strided, I would just early return here. If M or N is less than 1, we are dealing with an "empty" array, so nothing to be done. Similarly, if k < 1, no clusters to find.
There was a problem hiding this comment.
One could make the argument that we could throw if k < 1, as k < 1 is likely a user error.
There was a problem hiding this comment.
Okay, but I think we should consider the case where k > M (M : number of data points) as we cant sample k unique centroids from M data points if k > M. Either we could throw here or let the implementation sample duplicate centroids.
There was a problem hiding this comment.
True. We should probably throw when k > M.
| randi = randint({ | ||
| 'seed': rand() | ||
| }); | ||
| rand = rand.normalized; |
There was a problem hiding this comment.
This is not what we want. Rather than a seed, a user should have more control over the PRNG. Meaning, this should be pushed up the stack into userland. It made (somewhat) more sense in the incr/kmeans package as (a) we didn't provide direct access to an initialization API and (b) incremental computation is likely to be more a one-off (i.e., we're not interested in rerunning initialization multiple times for different numbers of clusters, as we have some idea going in what the number of clusters will be), and thus, working with PRNG state is less a concern.
This stated, the incr/kmeans package needs to be updated, as we've moved, as a general practice, toward allowing greater PRNG control in packages needing to generate random variates, so I would not consider incr/kmeans an authoritative reference for PRNG handling.
So, where does this leave us? In this implementation, we are currently using multiple PRNGs: one for selecting random data points (discrete-uniform) and another for sampling a cumulative distribution (mt19937). We should
- Add an
optionsargument with PRNG options. Similar torandom/strided/discrete-uniform, aprngoption should be a PRNG which returns random integers (e.g.,minstd,mt19937, etc). - Similar to
random/strided/gamma(and in turnrandom/base/gamma), if no PRNG options are provided, we should use our own internal PRNG, similar to the above. This PRNG should not be seeded. We don't need to seeduniformordiscrete-uniformin this scenario either. - If PRNG options are provided, we should pass those options down so that
uniformanddiscrete-uniformare instances which are instantiated correctly. Userandom/base/uniformwith range[0,1-FLOAT64_EPS], rather thanrand.normalized.
When we add the C implementation, we'll have additional questions to answer. Namely, how to pass down PRNG state into the native layer. In this scenario, if a user provides options.prng, we'll likely need to stay in JS. If a user provides a seed or state option, this becomes more doable. In C, we may need to always pass in a PRNG instance in order to avoid dynamically allocating new state all the time.
| dhash = new Float64Array( M ); | ||
| for ( i = 0; i < M; i++ ) { | ||
| dhash[ i ] = PINF; // squared distance | ||
| } | ||
| // Create a scratch array for storing cumulative probabilities: | ||
| probs = new Float64Array( M ); |
There was a problem hiding this comment.
We need to avoid dynamic memory allocation within strided implementations. Instead, we need to support a workspace array. So the signature becomes
f( order, M, N, k, trials, metric, X, LDX, out, LDO, workspace1, strideW1, workspace2, strideW2[, options] )where workspace1 is assumed to be a Float64Array and have 2*M elements to hold both dhash and probs and workspace2 is assumed to be an Int(32|64)Array and have k elements to hold the indices of centroid candidates. Because we need to support Int64Arrays (which will be added to stdlib shortly), we need to use accessors for getting and setting centroid candidate elements (e.g., use array/base/accessors).
Why do we want to avoid workspace allocation?
- We want to mirror our eventual C implementation, where we don't want to be dynamically allocating memory within the implementation, as then we need to a means to expose an error state if and when
mallocfails. - In our eventual top-level API, we may support operating on "stacks" of data. In this scenario, we want to allocate workspace memory once and then reuse that memory for each dataset within the stack. Otherwise, if memory is dynamically allocated, each operation on a dataset allocates fresh memory, which then needs to be garbage collected, etc.
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes. report:
Resolves None.
Description
This pull request:
ml/strided/dkmeans-init-plus-plus.Related Issues
This pull request has the following related issues:
Questions
No.
Other
No.
Checklist
AI Assistance
If you answered "yes" above, how did you use AI assistance?
Disclosure
I used Claude Code to compare the implementation with
ml/incr/kmeans/lib/init_kmeansplusplusand existing strided blas implementations, but the proposed changes were fully authored manually by myself.@stdlib-js/reviewers