Skip to content

Commit c90e4d1

Browse files
authored
Merge pull request #36 from Arkoniak/hamerly_algorithm
added Hamerly docstring
2 parents 72d8fac + 654ac4d commit c90e4d1

File tree

1 file changed

+31
-17
lines changed

1 file changed

+31
-17
lines changed

src/hamerly.jl

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
"""
2-
TODO: Hamerly description
2+
Hamerly()
3+
4+
Hamerly algorithm implementation, based on "Hamerly, Greg. (2010). Making k-means Even Faster.
5+
Proceedings of the 2010 SIAM International Conference on Data Mining. 130-140. 10.1137/1.9781611972801.12."
6+
7+
This algorithm provides much faster convergence than Lloyd algorithm with realtively small increase in
8+
memory footprint. It is especially suitable for low to medium dimensional input data.
9+
10+
It can be used directly in `kmeans` function
11+
12+
```julia
13+
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
14+
15+
kmeans(Hamerly(), X, 3) # 3 clusters, Hamerly algorithm
16+
```
317
"""
418
struct Hamerly <: AbstractKMeansAlg end
519

@@ -261,29 +275,29 @@ end
261275
chunk_update_bounds!(containers, r1, r2, pr1, pr2, r, idx)
262276
263277
Updates upper and lower bounds of point distance to the centers, with regard to the centers movement.
264-
Since bounds are squred distance, `sqrt` is used to make corresponding estimation, unlike
265-
the original paper, where usual metric is used.
266-
267-
Using notation from original paper, `u` is upper bound and `a` is `labels`, so
268-
269-
`u[i] -> u[i] + p[a[i]]`
270-
271-
then squared distance is
272-
273-
`u[i]^2 -> (u[i] + p[a[i]])^2 = u[i]^2 + 2 p[a[i]] u[i] + p[a[i]]^2`
274-
275-
Taking into account that in our noations `p^2 -> p`, `u^2 -> ub` we obtain
276-
277-
`ub[i] -> ub[i] + 2 sqrt(p[a[i]] ub[i]) + p[a[i]]`
278-
279-
The same applies to the lower bounds.
280278
"""
281279
function chunk_update_bounds!(containers, r1, r2, pr1, pr2, r, idx)
282280
p = containers.p
283281
ub = containers.ub
284282
lb = containers.lb
285283
labels = containers.labels
286284

285+
# Since bounds are squred distance, `sqrt` is used to make corresponding estimation, unlike
286+
# the original paper, where usual metric is used.
287+
#
288+
# Using notation from original paper, `u` is upper bound and `a` is `labels`, so
289+
#
290+
# `u[i] -> u[i] + p[a[i]]`
291+
#
292+
# then squared distance is
293+
#
294+
# `u[i]^2 -> (u[i] + p[a[i]])^2 = u[i]^2 + 2 p[a[i]] u[i] + p[a[i]]^2`
295+
#
296+
# Taking into account that in our noations `p^2 -> p`, `u^2 -> ub` we obtain
297+
#
298+
# `ub[i] -> ub[i] + 2 sqrt(p[a[i]] ub[i]) + p[a[i]]`
299+
#
300+
# The same applies to the lower bounds.
287301
@inbounds for i in r
288302
label = labels[i]
289303
ub[i] += 2*sqrt(abs(ub[i] * p[label])) + p[label]

0 commit comments

Comments
 (0)