Skip to content

feat: add ml/strided/dkmeans-init-plus-plus#12312

Draft
nakul-krishnakumar wants to merge 6 commits into
stdlib-js:developfrom
nakul-krishnakumar:ml-strided-dkmeanspp
Draft

feat: add ml/strided/dkmeans-init-plus-plus#12312
nakul-krishnakumar wants to merge 6 commits into
stdlib-js:developfrom
nakul-krishnakumar:ml-strided-dkmeanspp

Conversation

@nakul-krishnakumar
Copy link
Copy Markdown
Member


type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes. report:

  • task: lint_filenames status: passed
  • task: lint_editorconfig status: passed
  • task: lint_markdown_pkg_readmes status: na
  • task: lint_markdown_docs status: na
  • task: lint_markdown status: na
  • task: lint_package_json status: na
  • task: lint_repl_help status: na
  • task: lint_javascript_src status: passed
  • task: lint_javascript_cli status: na
  • task: lint_javascript_examples status: na
  • task: lint_javascript_tests status: na
  • task: lint_javascript_benchmarks status: na
  • task: lint_python status: na
  • task: lint_r status: na
  • task: lint_c_src status: na
  • task: lint_c_examples status: na
  • task: lint_c_benchmarks status: na
  • task: lint_c_tests_fixtures status: na
  • task: lint_shell status: na
  • task: lint_typescript_declarations status: passed
  • task: lint_typescript_tests status: na
  • task: lint_license_headers status: passed ---

Resolves None.

Description

What is the purpose of this pull request?

This pull request:

  • Adds ml/strided/dkmeans-init-plus-plus.

Related Issues

Does this pull request have any related issues?

This pull request has the following related issues:

  • None.

Questions

Any questions for reviewers of this pull request?

No.

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

No.

Checklist

Please ensure the following tasks are completed before submitting this pull request.

AI Assistance

When authoring the changes proposed in this PR, did you use any kind of AI assistance?

  • Yes
  • No

If you answered "yes" above, how did you use AI assistance?

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Disclosure

If you answered "yes" to using AI assistance, please provide a short disclosure indicating how you used AI assistance. This helps reviewers determine how much scrutiny to apply when reviewing your contribution. Example disclosures: "This PR was written primarily by Claude Code." or "I consulted ChatGPT to understand the codebase, but the proposed changes were fully authored manually by myself.".

I used Claude Code to compare the implementation with ml/incr/kmeans/lib/init_kmeansplusplus and existing strided blas implementations, but the proposed changes were fully authored manually by myself.


@stdlib-js/reviewers

---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
  - task: lint_filenames
    status: passed
  - task: lint_editorconfig
    status: passed
  - task: lint_markdown_pkg_readmes
    status: na
  - task: lint_markdown_docs
    status: na
  - task: lint_markdown
    status: na
  - task: lint_package_json
    status: na
  - task: lint_repl_help
    status: na
  - task: lint_javascript_src
    status: passed
  - task: lint_javascript_cli
    status: na
  - task: lint_javascript_examples
    status: na
  - task: lint_javascript_tests
    status: na
  - task: lint_javascript_benchmarks
    status: na
  - task: lint_python
    status: na
  - task: lint_r
    status: na
  - task: lint_c_src
    status: na
  - task: lint_c_examples
    status: na
  - task: lint_c_benchmarks
    status: na
  - task: lint_c_tests_fixtures
    status: na
  - task: lint_shell
    status: na
  - task: lint_typescript_declarations
    status: passed
  - task: lint_typescript_tests
    status: na
  - task: lint_license_headers
    status: passed
---
@nakul-krishnakumar nakul-krishnakumar requested a review from a team May 27, 2026 07:35
@nakul-krishnakumar nakul-krishnakumar marked this pull request as draft May 27, 2026 07:35
@stdlib-bot stdlib-bot added Needs Review A pull request which needs code review. and removed Needs Review A pull request which needs code review. labels May 27, 2026
Comment on lines +34 to +71
* Initializes centroids by performing the k-means++ initialization procedure.
*
* ## Method
*
* The k-means++ algorithm for choosing initial centroids is as follows:
*
* 1. Select a data point uniformly at random from a data set \\( X \\). This data point is first centroid and denoted \\( c_0 \\).
*
* 2. Compute the distance from each data point to \\( c_0 \\). Denote the distance between \\( c_j \\) and data point \\( m \\) as \\( d(x_m, c_j) \\).
*
* 3. Select the next centroid, \\( c_1 \\), at random from \\( X \\) with probability
*
* ```tex
* \frac{d^2(x_m, c_0)}{\sum_{j=0}^{n-1} d^2(x_j, c_0)}
* ```
*
* where \\( n \\) is the number of data points.
*
* 4. To choose centroid \\( j \\),
*
* a. Compute the distances from each data point to each centroid and assign each data point to its closest centroid.
*
* b. For \\( i = 0,\ldots,n-1 \\) and \\( p = 0,\ldots,j-2 \\), select centroid \\( j \\) at random from \\( X \\) with probability
*
* ```tex
* \frac{d^2(x_i, c_p)}{\sum_{\{h; x_h \exits C_p\}} d^2(x_h, c_p)}
* ```
*
* where \\( C_p \\) is the set of all data points closest to centroid \\( c_p \\) and \\( x_i \\) belongs to \\( c_p \\).
*
* Stated more plainly, select each subsequent centroid with a probability proportional to the distance from the centroid to the closest centroid already chosen.
*
* 5. Repeat step `4` until \\( k \\) centroids have been chosen.
*
* ## References
*
* - Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>.
*
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on which all files I should write this description, please guide me on this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment. Only the "base" implementation.

* ]);
*
* var v = dkmeansInitPlusPlus( 'row-major', k, M, N, out, 2, xbuf, 2, 'sqeuclidean', 3, 44 );
* // returns <Float64Array>[0,0,1,-1,1,1]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this return value makes sense. As the user already has the strides it should be easy for them to access it, also the user is expected to pass in an empty initialized out array which will be filled with the flat centroids array.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure how to interpret this return value. E.g., what does -1 correspond to?

Copy link
Copy Markdown
Member Author

@nakul-krishnakumar nakul-krishnakumar Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<Float64Array>[0,0,1,-1,1,1] means [ [ 0,0 ] ,[ 1,-1 ], [ 1,1 ] ] as the user already has LDO = 2.
-1 is just one feature value of a centroid from the centroids array.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, right. The output array contains the initial centroids. In which case, you are missing decimal points. Reading this, I kept thinking that the values referred to indices. Having the decimals would have reinforced that these are values from the list of data points.

// Create a scratch array for storing cumulative probabilities:
probs = new Float64Array( M );

// 2-5. For each data point, compute the distances to each centroid, find the closest centroid, and, based on the distance to the closest centroid, assign a probability to the data point to be chosen as centroid `c_j`...
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that all these comments in this implementation is directly referenced from ml/incr/kmeans/lib/init_kmeansplusplus

---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
  - task: lint_filenames
    status: passed
  - task: lint_editorconfig
    status: passed
  - task: lint_markdown_pkg_readmes
    status: na
  - task: lint_markdown_docs
    status: na
  - task: lint_markdown
    status: na
  - task: lint_package_json
    status: na
  - task: lint_repl_help
    status: na
  - task: lint_javascript_src
    status: passed
  - task: lint_javascript_cli
    status: na
  - task: lint_javascript_examples
    status: na
  - task: lint_javascript_tests
    status: na
  - task: lint_javascript_benchmarks
    status: na
  - task: lint_python
    status: na
  - task: lint_r
    status: na
  - task: lint_c_src
    status: na
  - task: lint_c_examples
    status: na
  - task: lint_c_benchmarks
    status: na
  - task: lint_c_tests_fixtures
    status: na
  - task: lint_shell
    status: na
  - task: lint_typescript_declarations
    status: passed
  - task: lint_typescript_tests
    status: na
  - task: lint_license_headers
    status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
  - task: lint_filenames
    status: passed
  - task: lint_editorconfig
    status: passed
  - task: lint_markdown_pkg_readmes
    status: na
  - task: lint_markdown_docs
    status: na
  - task: lint_markdown
    status: na
  - task: lint_package_json
    status: na
  - task: lint_repl_help
    status: na
  - task: lint_javascript_src
    status: na
  - task: lint_javascript_cli
    status: na
  - task: lint_javascript_examples
    status: na
  - task: lint_javascript_tests
    status: passed
  - task: lint_javascript_benchmarks
    status: na
  - task: lint_python
    status: na
  - task: lint_r
    status: na
  - task: lint_c_src
    status: na
  - task: lint_c_examples
    status: na
  - task: lint_c_benchmarks
    status: na
  - task: lint_c_tests_fixtures
    status: na
  - task: lint_shell
    status: na
  - task: lint_typescript_declarations
    status: passed
  - task: lint_typescript_tests
    status: na
  - task: lint_license_headers
    status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
  - task: lint_filenames
    status: passed
  - task: lint_editorconfig
    status: passed
  - task: lint_markdown_pkg_readmes
    status: na
  - task: lint_markdown_docs
    status: na
  - task: lint_markdown
    status: na
  - task: lint_package_json
    status: na
  - task: lint_repl_help
    status: na
  - task: lint_javascript_src
    status: na
  - task: lint_javascript_cli
    status: na
  - task: lint_javascript_examples
    status: na
  - task: lint_javascript_tests
    status: na
  - task: lint_javascript_benchmarks
    status: passed
  - task: lint_python
    status: na
  - task: lint_r
    status: na
  - task: lint_c_src
    status: na
  - task: lint_c_examples
    status: na
  - task: lint_c_benchmarks
    status: na
  - task: lint_c_tests_fixtures
    status: na
  - task: lint_shell
    status: na
  - task: lint_typescript_declarations
    status: passed
  - task: lint_typescript_tests
    status: na
  - task: lint_license_headers
    status: passed
---
@nakul-krishnakumar
Copy link
Copy Markdown
Member Author

/stdlib merge

@stdlib-bot stdlib-bot added the bot: In Progress Pull request is currently awaiting automation. label May 27, 2026
@stdlib-bot stdlib-bot removed the bot: In Progress Pull request is currently awaiting automation. label May 27, 2026
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
  - task: lint_filenames
    status: passed
  - task: lint_editorconfig
    status: passed
  - task: lint_markdown_pkg_readmes
    status: na
  - task: lint_markdown_docs
    status: na
  - task: lint_markdown
    status: na
  - task: lint_package_json
    status: na
  - task: lint_repl_help
    status: na
  - task: lint_javascript_src
    status: passed
  - task: lint_javascript_cli
    status: na
  - task: lint_javascript_examples
    status: na
  - task: lint_javascript_tests
    status: na
  - task: lint_javascript_benchmarks
    status: na
  - task: lint_python
    status: na
  - task: lint_r
    status: na
  - task: lint_c_src
    status: na
  - task: lint_c_examples
    status: na
  - task: lint_c_benchmarks
    status: na
  - task: lint_c_tests_fixtures
    status: na
  - task: lint_shell
    status: na
  - task: lint_typescript_declarations
    status: passed
  - task: lint_typescript_tests
    status: na
  - task: lint_license_headers
    status: passed
---
Comment on lines +135 to +137
if ( trials < 1 ) {
throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) );
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to just return NaN?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing is fine.

@stdlib-bot
Copy link
Copy Markdown
Contributor

Coverage Report

Package Statements Branches Functions Lines
ml/strided/dkmeans-init-plus-plus $\color{red}519/537$
$\color{green}+96.65%$
$\color{red}38/44$
$\color{green}+86.36%$
$\color{green}2/2$
$\color{green}+100.00%$
$\color{red}519/537$
$\color{green}+96.65%$

The above coverage report was generated for the changes in this PR.

Comment on lines +138 to +140
if ( k < 1 || M < 1 || N < 1) {
return NaN;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

  • As it is a low-level kernel, we can expect user to pass in valid parameters.
  • But if this check is not kept, for invalid params the API return <Float64Array>[ NaN, NaN, NaN, ..., NaN ].

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above.

Comment on lines +111 to +114
expected = new Float64Array( data.expected );
for ( i = 0; i < expected.length; i++ ) {
t.strictEqual( isAlmostSameValue( out[ i ], expected[ i ], 0 ), true, 'returns expected value' );
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning on using @stdlib/assert/is-same-float64array here, am I good to go?
cc: @kgryte

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is fine.

*
* - Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>.
*
* @param {string} order - storage layout
Copy link
Copy Markdown
Member

@kgryte kgryte Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your parameter order is odd and does not match conventions elsewhere in the project.

Instead,

f( order, M, N, k, trials, metric, X, LDX, out, LDO[, options] )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where options corresponds to PRNG options, similar to what is done in https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/random/strided/gamma.

/**
* Initializes centroids by performing the k-means++ initialization procedure on double-precision floating-point data points.
*
* ## Method
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need only include this extended Method section in the base.js file. No need to copy for all interfaces.

if ( trials < 1 ) {
throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) );
}
if ( k < 1 || M < 1 || N < 1) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are mixing concerns here. Similar to other packages, such as stats/strided, I would just early return here. If M or N is less than 1, we are dealing with an "empty" array, so nothing to be done. Similarly, if k < 1, no clusters to find.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could make the argument that we could throw if k < 1, as k < 1 is likely a user error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but I think we should consider the case where k > M (M : number of data points) as we cant sample k unique centroids from M data points if k > M. Either we could throw here or let the implementation sample duplicate centroids.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We should probably throw when k > M.

randi = randint({
'seed': rand()
});
rand = rand.normalized;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we want. Rather than a seed, a user should have more control over the PRNG. Meaning, this should be pushed up the stack into userland. It made (somewhat) more sense in the incr/kmeans package as (a) we didn't provide direct access to an initialization API and (b) incremental computation is likely to be more a one-off (i.e., we're not interested in rerunning initialization multiple times for different numbers of clusters, as we have some idea going in what the number of clusters will be), and thus, working with PRNG state is less a concern.

This stated, the incr/kmeans package needs to be updated, as we've moved, as a general practice, toward allowing greater PRNG control in packages needing to generate random variates, so I would not consider incr/kmeans an authoritative reference for PRNG handling.

So, where does this leave us? In this implementation, we are currently using multiple PRNGs: one for selecting random data points (discrete-uniform) and another for sampling a cumulative distribution (mt19937). We should

  1. Add an options argument with PRNG options. Similar to random/strided/discrete-uniform, a prng option should be a PRNG which returns random integers (e.g., minstd, mt19937, etc).
  2. Similar to random/strided/gamma (and in turn random/base/gamma), if no PRNG options are provided, we should use our own internal PRNG, similar to the above. This PRNG should not be seeded. We don't need to seed uniform or discrete-uniform in this scenario either.
  3. If PRNG options are provided, we should pass those options down so that uniform and discrete-uniform are instances which are instantiated correctly. Use random/base/uniform with range [0,1-FLOAT64_EPS], rather than rand.normalized.

When we add the C implementation, we'll have additional questions to answer. Namely, how to pass down PRNG state into the native layer. In this scenario, if a user provides options.prng, we'll likely need to stay in JS. If a user provides a seed or state option, this becomes more doable. In C, we may need to always pass in a PRNG instance in order to avoid dynamically allocating new state all the time.

Comment on lines +169 to +174
dhash = new Float64Array( M );
for ( i = 0; i < M; i++ ) {
dhash[ i ] = PINF; // squared distance
}
// Create a scratch array for storing cumulative probabilities:
probs = new Float64Array( M );
Copy link
Copy Markdown
Member

@kgryte kgryte Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid dynamic memory allocation within strided implementations. Instead, we need to support a workspace array. So the signature becomes

f( order, M, N, k, trials, metric, X, LDX, out, LDO, workspace1, strideW1, workspace2, strideW2[, options] )

where workspace1 is assumed to be a Float64Array and have 2*M elements to hold both dhash and probs and workspace2 is assumed to be an Int(32|64)Array and have k elements to hold the indices of centroid candidates. Because we need to support Int64Arrays (which will be added to stdlib shortly), we need to use accessors for getting and setting centroid candidate elements (e.g., use array/base/accessors).

Why do we want to avoid workspace allocation?

  1. We want to mirror our eventual C implementation, where we don't want to be dynamically allocating memory within the implementation, as then we need to a means to expose an error state if and when malloc fails.
  2. In our eventual top-level API, we may support operating on "stacks" of data. In this scenario, we want to allocate workspace memory once and then reuse that memory for each dataset within the stack. Otherwise, if memory is dynamically allocated, each operation on a dataset allocates fresh memory, which then needs to be garbage collected, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants