|
1 | 1 | """
|
2 |
| - TODO: Hamerly description |
| 2 | + Hamerly() |
| 3 | +
|
| 4 | +Hamerly algorithm implementation, based on "Hamerly, Greg. (2010). Making k-means Even Faster. |
| 5 | + Proceedings of the 2010 SIAM International Conference on Data Mining. 130-140. 10.1137/1.9781611972801.12." |
| 6 | +
|
| 7 | +This algorithm provides much faster convergence than Lloyd algorithm with realtively small increase in |
| 8 | +memory footprint. It is especially suitable for low to medium dimensional input data. |
| 9 | +
|
| 10 | +It can be used directly in `kmeans` function |
| 11 | +
|
| 12 | +```julia |
| 13 | +X = rand(30, 100_000) # 100_000 random points in 30 dimensions |
| 14 | +
|
| 15 | +kmeans(Hamerly(), X, 3) # 3 clusters, Hamerly algorithm |
| 16 | +``` |
3 | 17 | """
|
4 | 18 | struct Hamerly <: AbstractKMeansAlg end
|
5 | 19 |
|
@@ -261,29 +275,29 @@ end
|
261 | 275 | chunk_update_bounds!(containers, r1, r2, pr1, pr2, r, idx)
|
262 | 276 |
|
263 | 277 | Updates upper and lower bounds of point distance to the centers, with regard to the centers movement.
|
264 |
| -Since bounds are squred distance, `sqrt` is used to make corresponding estimation, unlike |
265 |
| -the original paper, where usual metric is used. |
266 |
| -
|
267 |
| -Using notation from original paper, `u` is upper bound and `a` is `labels`, so |
268 |
| -
|
269 |
| -`u[i] -> u[i] + p[a[i]]` |
270 |
| -
|
271 |
| -then squared distance is |
272 |
| -
|
273 |
| -`u[i]^2 -> (u[i] + p[a[i]])^2 = u[i]^2 + 2 p[a[i]] u[i] + p[a[i]]^2` |
274 |
| -
|
275 |
| -Taking into account that in our noations `p^2 -> p`, `u^2 -> ub` we obtain |
276 |
| -
|
277 |
| -`ub[i] -> ub[i] + 2 sqrt(p[a[i]] ub[i]) + p[a[i]]` |
278 |
| -
|
279 |
| -The same applies to the lower bounds. |
280 | 278 | """
|
281 | 279 | function chunk_update_bounds!(containers, r1, r2, pr1, pr2, r, idx)
|
282 | 280 | p = containers.p
|
283 | 281 | ub = containers.ub
|
284 | 282 | lb = containers.lb
|
285 | 283 | labels = containers.labels
|
286 | 284 |
|
| 285 | + # Since bounds are squred distance, `sqrt` is used to make corresponding estimation, unlike |
| 286 | + # the original paper, where usual metric is used. |
| 287 | + # |
| 288 | + # Using notation from original paper, `u` is upper bound and `a` is `labels`, so |
| 289 | + # |
| 290 | + # `u[i] -> u[i] + p[a[i]]` |
| 291 | + # |
| 292 | + # then squared distance is |
| 293 | + # |
| 294 | + # `u[i]^2 -> (u[i] + p[a[i]])^2 = u[i]^2 + 2 p[a[i]] u[i] + p[a[i]]^2` |
| 295 | + # |
| 296 | + # Taking into account that in our noations `p^2 -> p`, `u^2 -> ub` we obtain |
| 297 | + # |
| 298 | + # `ub[i] -> ub[i] + 2 sqrt(p[a[i]] ub[i]) + p[a[i]]` |
| 299 | + # |
| 300 | + # The same applies to the lower bounds. |
287 | 301 | @inbounds for i in r
|
288 | 302 | label = labels[i]
|
289 | 303 | ub[i] += 2*sqrt(abs(ub[i] * p[label])) + p[label]
|
|
0 commit comments