feat: add `ml/strided/dkmeans-init-plus-plus` by nakul-krishnakumar · Pull Request #12312 · stdlib-js/stdlib

nakul-krishnakumar · 2026-05-27T07:35:32Z

type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes. report:

task: lint_filenames status: passed
task: lint_editorconfig status: passed
task: lint_markdown_pkg_readmes status: na
task: lint_markdown_docs status: na
task: lint_markdown status: na
task: lint_package_json status: na
task: lint_repl_help status: na
task: lint_javascript_src status: passed
task: lint_javascript_cli status: na
task: lint_javascript_examples status: na
task: lint_javascript_tests status: na
task: lint_javascript_benchmarks status: na
task: lint_python status: na
task: lint_r status: na
task: lint_c_src status: na
task: lint_c_examples status: na
task: lint_c_benchmarks status: na
task: lint_c_tests_fixtures status: na
task: lint_shell status: na
task: lint_typescript_declarations status: passed
task: lint_typescript_tests status: na
task: lint_license_headers status: passed ---

Resolves None.

Description

What is the purpose of this pull request?

This pull request:

Adds ml/strided/dkmeans-init-plus-plus.

Related Issues

Does this pull request have any related issues?

This pull request has the following related issues:

None.

Questions

Any questions for reviewers of this pull request?

No.

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

No.

Checklist

Please ensure the following tasks are completed before submitting this pull request.

Read, understood, and followed the contributing guidelines.

AI Assistance

When authoring the changes proposed in this PR, did you use any kind of AI assistance?

Yes
No

If you answered "yes" above, how did you use AI assistance?

Code generation (e.g., when writing an implementation or fixing a bug)
Test/benchmark generation
Documentation (including examples)
Research and understanding

Disclosure

If you answered "yes" to using AI assistance, please provide a short disclosure indicating how you used AI assistance. This helps reviewers determine how much scrutiny to apply when reviewing your contribution. Example disclosures: "This PR was written primarily by Claude Code." or "I consulted ChatGPT to understand the codebase, but the proposed changes were fully authored manually by myself.".

I used Claude Code to compare the implementation with ml/incr/kmeans/lib/init_kmeansplusplus and existing strided blas implementations, but the proposed changes were fully authored manually by myself.

@stdlib-js/reviewers

--- type: pre_commit_static_analysis_report description: Results of running static analysis checks when committing changes. report: - task: lint_filenames status: passed - task: lint_editorconfig status: passed - task: lint_markdown_pkg_readmes status: na - task: lint_markdown_docs status: na - task: lint_markdown status: na - task: lint_package_json status: na - task: lint_repl_help status: na - task: lint_javascript_src status: passed - task: lint_javascript_cli status: na - task: lint_javascript_examples status: na - task: lint_javascript_tests status: na - task: lint_javascript_benchmarks status: na - task: lint_python status: na - task: lint_r status: na - task: lint_c_src status: na - task: lint_c_examples status: na - task: lint_c_benchmarks status: na - task: lint_c_tests_fixtures status: na - task: lint_shell status: na - task: lint_typescript_declarations status: passed - task: lint_typescript_tests status: na - task: lint_license_headers status: passed ---

nakul-krishnakumar · 2026-05-27T07:36:48Z

+* Initializes centroids by performing the k-means++ initialization procedure.
+*
+* ## Method
+*
+* The k-means++ algorithm for choosing initial centroids is as follows:
+*
+* 1.  Select a data point uniformly at random from a data set \\( X \\). This data point is first centroid and denoted \\( c_0 \\).
+*
+* 2.  Compute the distance from each data point to \\( c_0 \\). Denote the distance between \\( c_j \\) and data point \\( m \\) as \\( d(x_m, c_j) \\).
+*
+* 3.  Select the next centroid, \\( c_1 \\), at random from \\( X \\) with probability
+*
+*     ```tex
+*     \frac{d^2(x_m, c_0)}{\sum_{j=0}^{n-1} d^2(x_j, c_0)}
+*     ```
+*
+*     where \\( n \\) is the number of data points.
+*
+* 4.  To choose centroid \\( j \\),
+*
+*     a.   Compute the distances from each data point to each centroid and assign each data point to its closest centroid.
+*
+*     b.   For \\( i = 0,\ldots,n-1 \\) and \\( p = 0,\ldots,j-2 \\), select centroid \\( j \\) at random from \\( X \\) with probability
+*
+*          ```tex
+*          \frac{d^2(x_i, c_p)}{\sum_{\{h; x_h \exits C_p\}} d^2(x_h, c_p)}
+*          ```
+*
+*          where \\( C_p \\) is the set of all data points closest to centroid \\( c_p \\) and \\( x_i \\) belongs to \\( c_p \\).
+*
+*          Stated more plainly, select each subsequent centroid with a probability proportional to the distance from the centroid to the closest centroid already chosen.
+*
+* 5.  Repeat step `4` until \\( k \\) centroids have been chosen.
+*
+* ## References
+*
+* -   Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>.
+*


I am not sure on which all files I should write this description, please guide me on this.

Left a comment. Only the "base" implementation.

nakul-krishnakumar · 2026-05-27T07:38:15Z

+* ]);
+*
+* var v = dkmeansInitPlusPlus( 'row-major', k, M, N, out, 2, xbuf, 2, 'sqeuclidean', 3, 44 );
+* // returns <Float64Array>[0,0,1,-1,1,1]


I hope this return value makes sense. As the user already has the strides it should be easy for them to access it, also the user is expected to pass in an empty initialized out array which will be filled with the flat centroids array.

Yeah, I am not sure how to interpret this return value. E.g., what does -1 correspond to?

<Float64Array>[0,0,1,-1,1,1] means [ [ 0,0 ] ,[ 1,-1 ], [ 1,1 ] ] as the user already has LDO = 2.
-1 is just one feature value of a centroid from the centroids array.

Right, right. The output array contains the initial centroids. In which case, you are missing decimal points. Reading this, I kept thinking that the values referred to indices. Having the decimals would have reinforced that these are values from the list of data points.

nakul-krishnakumar · 2026-05-27T07:40:33Z

+	// Create a scratch array for storing cumulative probabilities:
+	probs = new Float64Array( M );
+
+	// 2-5. For each data point, compute the distances to each centroid, find the closest centroid, and, based on the distance to the closest centroid, assign a probability to the data point to be chosen as centroid `c_j`...


Please note that all these comments in this implementation is directly referenced from ml/incr/kmeans/lib/init_kmeansplusplus

--- type: pre_commit_static_analysis_report description: Results of running static analysis checks when committing changes. report: - task: lint_filenames status: passed - task: lint_editorconfig status: passed - task: lint_markdown_pkg_readmes status: na - task: lint_markdown_docs status: na - task: lint_markdown status: na - task: lint_package_json status: na - task: lint_repl_help status: na - task: lint_javascript_src status: passed - task: lint_javascript_cli status: na - task: lint_javascript_examples status: na - task: lint_javascript_tests status: na - task: lint_javascript_benchmarks status: na - task: lint_python status: na - task: lint_r status: na - task: lint_c_src status: na - task: lint_c_examples status: na - task: lint_c_benchmarks status: na - task: lint_c_tests_fixtures status: na - task: lint_shell status: na - task: lint_typescript_declarations status: passed - task: lint_typescript_tests status: na - task: lint_license_headers status: passed ---

--- type: pre_commit_static_analysis_report description: Results of running static analysis checks when committing changes. report: - task: lint_filenames status: passed - task: lint_editorconfig status: passed - task: lint_markdown_pkg_readmes status: na - task: lint_markdown_docs status: na - task: lint_markdown status: na - task: lint_package_json status: na - task: lint_repl_help status: na - task: lint_javascript_src status: na - task: lint_javascript_cli status: na - task: lint_javascript_examples status: na - task: lint_javascript_tests status: passed - task: lint_javascript_benchmarks status: na - task: lint_python status: na - task: lint_r status: na - task: lint_c_src status: na - task: lint_c_examples status: na - task: lint_c_benchmarks status: na - task: lint_c_tests_fixtures status: na - task: lint_shell status: na - task: lint_typescript_declarations status: passed - task: lint_typescript_tests status: na - task: lint_license_headers status: passed ---

--- type: pre_commit_static_analysis_report description: Results of running static analysis checks when committing changes. report: - task: lint_filenames status: passed - task: lint_editorconfig status: passed - task: lint_markdown_pkg_readmes status: na - task: lint_markdown_docs status: na - task: lint_markdown status: na - task: lint_package_json status: na - task: lint_repl_help status: na - task: lint_javascript_src status: na - task: lint_javascript_cli status: na - task: lint_javascript_examples status: na - task: lint_javascript_tests status: na - task: lint_javascript_benchmarks status: passed - task: lint_python status: na - task: lint_r status: na - task: lint_c_src status: na - task: lint_c_examples status: na - task: lint_c_benchmarks status: na - task: lint_c_tests_fixtures status: na - task: lint_shell status: na - task: lint_typescript_declarations status: passed - task: lint_typescript_tests status: na - task: lint_license_headers status: passed ---

nakul-krishnakumar · 2026-05-27T17:11:20Z

/stdlib merge

…nspp

--- type: pre_commit_static_analysis_report description: Results of running static analysis checks when committing changes. report: - task: lint_filenames status: passed - task: lint_editorconfig status: passed - task: lint_markdown_pkg_readmes status: na - task: lint_markdown_docs status: na - task: lint_markdown status: na - task: lint_package_json status: na - task: lint_repl_help status: na - task: lint_javascript_src status: passed - task: lint_javascript_cli status: na - task: lint_javascript_examples status: na - task: lint_javascript_tests status: na - task: lint_javascript_benchmarks status: na - task: lint_python status: na - task: lint_r status: na - task: lint_c_src status: na - task: lint_c_examples status: na - task: lint_c_benchmarks status: na - task: lint_c_tests_fixtures status: na - task: lint_shell status: na - task: lint_typescript_declarations status: passed - task: lint_typescript_tests status: na - task: lint_license_headers status: passed ---

nakul-krishnakumar · 2026-05-27T17:45:56Z

+	if ( trials < 1 ) {
+		throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) );
+	}


Would it be better to just return NaN?

Throwing is fine.

stdlib-bot · 2026-05-27T17:46:28Z

Coverage Report

Package	Statements	Branches	Functions	Lines
ml/strided/dkmeans-init-plus-plus	$\color{red}519/537$ $\color{green}+96.65%$	$\color{red}38/44$ $\color{green}+86.36%$	$\color{green}2/2$ $\color{green}+100.00%$	$\color{red}519/537$ $\color{green}+96.65%$

The above coverage report was generated for the changes in this PR.

nakul-krishnakumar · 2026-05-27T17:48:07Z

+	if ( k < 1 || M < 1 || N < 1) {
+		return NaN;
+	}


Is this necessary?

As it is a low-level kernel, we can expect user to pass in valid parameters.

But if this check is not kept, for invalid params the API return <Float64Array>[ NaN, NaN, NaN, ..., NaN ].

Answered above.

nakul-krishnakumar · 2026-05-30T13:32:38Z

+	expected = new Float64Array( data.expected );
+	for ( i = 0; i < expected.length; i++ ) {
+		t.strictEqual( isAlmostSameValue( out[ i ], expected[ i ], 0 ), true, 'returns expected value' );
+	}


Planning on using @stdlib/assert/is-same-float64array here, am I good to go?
cc: @kgryte

Yes, that is fine.

kgryte · 2026-06-01T00:28:56Z

+*
+* -   Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>.
+*
+* @param {string} order - storage layout


Your parameter order is odd and does not match conventions elsewhere in the project.

Instead,

f( order, M, N, k, trials, metric, X, LDX, out, LDO[, options] )

where options corresponds to PRNG options, similar to what is done in https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/random/strided/gamma.

kgryte · 2026-06-01T00:30:23Z

+/**
+* Initializes centroids by performing the k-means++ initialization procedure on double-precision floating-point data points.
+*
+* ## Method


You need only include this extended Method section in the base.js file. No need to copy for all interfaces.

kgryte · 2026-06-01T01:01:05Z

+	if ( trials < 1 ) {
+		throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) );
+	}
+	if ( k < 1 || M < 1 || N < 1) {


I think you are mixing concerns here. Similar to other packages, such as stats/strided, I would just early return here. If M or N is less than 1, we are dealing with an "empty" array, so nothing to be done. Similarly, if k < 1, no clusters to find.

One could make the argument that we could throw if k < 1, as k < 1 is likely a user error.

Okay, but I think we should consider the case where k > M (M : number of data points) as we cant sample k unique centroids from M data points if k > M. Either we could throw here or let the implementation sample duplicate centroids.

True. We should probably throw when k > M.

kgryte · 2026-06-01T01:22:23Z

+	randi = randint({
+		'seed': rand()
+	});
+	rand = rand.normalized;


This is not what we want. Rather than a seed, a user should have more control over the PRNG. Meaning, this should be pushed up the stack into userland. It made (somewhat) more sense in the incr/kmeans package as (a) we didn't provide direct access to an initialization API and (b) incremental computation is likely to be more a one-off (i.e., we're not interested in rerunning initialization multiple times for different numbers of clusters, as we have some idea going in what the number of clusters will be), and thus, working with PRNG state is less a concern.

This stated, the incr/kmeans package needs to be updated, as we've moved, as a general practice, toward allowing greater PRNG control in packages needing to generate random variates, so I would not consider incr/kmeans an authoritative reference for PRNG handling.

So, where does this leave us? In this implementation, we are currently using multiple PRNGs: one for selecting random data points (discrete-uniform) and another for sampling a cumulative distribution (mt19937). We should

Add an options argument with PRNG options. Similar to random/strided/discrete-uniform, a prng option should be a PRNG which returns random integers (e.g., minstd, mt19937, etc).

Similar to random/strided/gamma (and in turn random/base/gamma), if no PRNG options are provided, we should use our own internal PRNG, similar to the above. This PRNG should not be seeded. We don't need to seed uniform or discrete-uniform in this scenario either.

If PRNG options are provided, we should pass those options down so that uniform and discrete-uniform are instances which are instantiated correctly. Use random/base/uniform with range [0,1-FLOAT64_EPS], rather than rand.normalized.

When we add the C implementation, we'll have additional questions to answer. Namely, how to pass down PRNG state into the native layer. In this scenario, if a user provides options.prng, we'll likely need to stay in JS. If a user provides a seed or state option, this becomes more doable. In C, we may need to always pass in a PRNG instance in order to avoid dynamically allocating new state all the time.

kgryte · 2026-06-01T01:37:18Z

+	dhash = new Float64Array( M );
+	for ( i = 0; i < M; i++ ) {
+		dhash[ i ] = PINF; // squared distance
+	}
+	// Create a scratch array for storing cumulative probabilities:
+	probs = new Float64Array( M );


We need to avoid dynamic memory allocation within strided implementations. Instead, we need to support a workspace array. So the signature becomes

f( order, M, N, k, trials, metric, X, LDX, out, LDO, workspace1, strideW1, workspace2, strideW2[, options] )

where workspace1 is assumed to be a Float64Array and have 2*M elements to hold both dhash and probs and workspace2 is assumed to be an Int(32|64)Array and have k elements to hold the indices of centroid candidates. Because we need to support Int64Arrays (which will be added to stdlib shortly), we need to use accessors for getting and setting centroid candidate elements (e.g., use array/base/accessors).

Why do we want to avoid workspace allocation?

We want to mirror our eventual C implementation, where we don't want to be dynamically allocating memory within the implementation, as then we need to a means to expose an error state if and when malloc fails.

In our eventual top-level API, we may support operating on "stacks" of data. In this scenario, we want to allocate workspace memory once and then reuse that memory for each dataset within the stack. Otherwise, if memory is dynamically allocated, each operation on a dataset allocates fresh memory, which then needs to be garbage collected, etc.

nakul-krishnakumar requested a review from a team May 27, 2026 07:35

nakul-krishnakumar marked this pull request as draft May 27, 2026 07:35

stdlib-bot added Needs Review A pull request which needs code review. and removed Needs Review A pull request which needs code review. labels May 27, 2026

nakul-krishnakumar commented May 27, 2026

View reviewed changes

nakul-krishnakumar added 3 commits May 27, 2026 13:15

nakul-krishnakumar mentioned this pull request May 27, 2026

Weekly Standup (2026-05-25) stdlib-js/meetings#127

Open

stdlib-bot added the bot: In Progress Pull request is currently awaiting automation. label May 27, 2026

Merge remote-tracking branch 'upstream/develop' into ml-strided-dkmea…

b75757a

…nspp

stdlib-bot removed the bot: In Progress Pull request is currently awaiting automation. label May 27, 2026

nakul-krishnakumar commented May 27, 2026

View reviewed changes

nakul-krishnakumar commented May 30, 2026

View reviewed changes

kgryte reviewed Jun 1, 2026

View reviewed changes

Uh oh!

Conversation

nakul-krishnakumar commented May 27, 2026

Description

Related Issues

Questions

Other

Checklist

AI Assistance

Disclosure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nakul-krishnakumar Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nakul-krishnakumar commented May 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stdlib-bot commented May 27, 2026

Coverage Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgryte Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgryte Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nakul-krishnakumar Jun 1, 2026 •

edited

Loading

kgryte Jun 1, 2026 •

edited

Loading

kgryte Jun 1, 2026 •

edited

Loading