Skip to content

Commit a54432a

Browse files
committed
small bug fixes & TODO requests
1 parent cf3bd18 commit a54432a

File tree

3 files changed

+72
-173
lines changed

3 files changed

+72
-173
lines changed

docs/src/index.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ git checkout experimental
5252

5353

5454
## Pending Features
55-
- [ ] Implementation of Hamerly implementation.
55+
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
5656
- [ ] Full Implementation of Triangle inequality based on [Elkan C. (2003) "Using the Triangle Inequality to Accelerate
5757
K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
5858
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
@@ -81,10 +81,10 @@ results = kmeans(X, 3; n_threads=1, max_iters=300)
8181
The main design goal is to offer all available variations of the KMeans algorithm to end users as composable elements. By default, Lloyd's implementation is used but users can specify different variations of the KMeans clustering algorithm via this interface
8282

8383
```julia
84-
some_results = kmeans([algo], data_matrix, k; kwargs)
84+
some_results = kmeans([algo], input_matrix, k; kwargs)
8585

8686
# example
87-
r = kmeans(Lloyd(), X, 4) # same result as the default
87+
r = kmeans(Lloyd(), X, 3) # same result as the default
8888
```
8989

9090
### Supported KMeans algorithm variations.
@@ -143,6 +143,8 @@ Currently, this package is benchmarked against similar implementation in both Py
143143
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
144144

145145

146+
<!-- Insert Benchmark Plot Right Below -->
147+
146148

147149
| Package | Language | Input Data | Execution Time |
148150
|:-----------------:|:--------:|:---------------------------------:|:--------------:|
@@ -161,7 +163,7 @@ Ultimately, we see this package as potentially the one stop shop for everything
161163

162164
Detailed contribution guidelines will be added in upcoming releases.
163165

164-
<!--- Insert Contribution Guidelines --->
166+
<!--- Insert Contribution Guidelines Below --->
165167

166168
```@index
167169
```

extras/ClusteringJL & ParallelKMeans Benchmarks Final.ipynb

Lines changed: 45 additions & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,16 @@
3131
"cell_type": "code",
3232
"execution_count": 2,
3333
"metadata": {},
34-
"outputs": [],
34+
"outputs": [
35+
{
36+
"name": "stderr",
37+
"output_type": "stream",
38+
"text": [
39+
"┌ Info: Precompiling ParallelKMeans [42b8e9d4-006b-409a-8472-7f34b3fb58af]\n",
40+
"└ @ Base loading.jl:1260\n"
41+
]
42+
}
43+
],
3544
"source": [
3645
"# Load Packages\n",
3746
"using Clustering\n",
@@ -99,164 +108,38 @@
99108
},
100109
{
101110
"cell_type": "code",
102-
"execution_count": 7,
111+
"execution_count": null,
103112
"metadata": {},
104-
"outputs": [
105-
{
106-
"data": {
107-
"text/plain": [
108-
"BenchmarkTools.Trial: \n",
109-
" memory estimate: 29.06 GiB\n",
110-
" allocs estimate: 27820\n",
111-
" --------------\n",
112-
" minimum time: 486.203 s (0.57% GC)\n",
113-
" median time: 620.239 s (0.53% GC)\n",
114-
" mean time: 604.342 s (0.55% GC)\n",
115-
" maximum time: 681.707 s (0.53% GC)\n",
116-
" --------------\n",
117-
" samples: 7\n",
118-
" evals/sample: 1"
119-
]
120-
},
121-
"execution_count": 7,
122-
"metadata": {},
123-
"output_type": "execute_result"
124-
}
125-
],
113+
"outputs": [],
126114
"source": [
127-
"@benchmark [Clustering.kmeans(X_1m, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] samples=7 seconds=6000"
115+
"@btime [Clustering.kmeans(X_1m, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] "
128116
]
129117
},
130118
{
131119
"cell_type": "code",
132-
"execution_count": 8,
120+
"execution_count": null,
133121
"metadata": {},
134-
"outputs": [
135-
{
136-
"data": {
137-
"text/plain": [
138-
"BenchmarkTools.Trial: \n",
139-
" memory estimate: 2.39 GiB\n",
140-
" allocs estimate: 22563\n",
141-
" --------------\n",
142-
" minimum time: 38.106 s (0.58% GC)\n",
143-
" median time: 42.316 s (0.55% GC)\n",
144-
" mean time: 42.721 s (0.54% GC)\n",
145-
" maximum time: 48.713 s (0.48% GC)\n",
146-
" --------------\n",
147-
" samples: 7\n",
148-
" evals/sample: 1"
149-
]
150-
},
151-
"execution_count": 8,
152-
"metadata": {},
153-
"output_type": "execute_result"
154-
}
155-
],
122+
"outputs": [],
156123
"source": [
157-
"@benchmark [Clustering.kmeans(X_100k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] samples=7 seconds=3000"
124+
"@btime [Clustering.kmeans(X_100k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] "
158125
]
159126
},
160127
{
161128
"cell_type": "code",
162-
"execution_count": 9,
129+
"execution_count": null,
163130
"metadata": {},
164-
"outputs": [
165-
{
166-
"ename": "InterruptException",
167-
"evalue": "InterruptException:",
168-
"output_type": "error",
169-
"traceback": [
170-
"InterruptException:",
171-
"",
172-
"Stacktrace:",
173-
" [1] Array at ./boot.jl:407 [inlined]",
174-
" [2] Array at ./boot.jl:415 [inlined]",
175-
" [3] similar at ./array.jl:361 [inlined]",
176-
" [4] similar at ./abstractarray.jl:634 [inlined]",
177-
" [5] reducedim_initarray at ./reducedim.jl:92 [inlined]",
178-
" [6] reducedim_initarray at ./reducedim.jl:93 [inlined]",
179-
" [7] reducedim_init at ./reducedim.jl:172 [inlined]",
180-
" [8] _mapreduce_dim at ./reducedim.jl:317 [inlined]",
181-
" [9] #mapreduce#580 at ./reducedim.jl:307 [inlined]",
182-
" [10] _sum at ./reducedim.jl:679 [inlined]",
183-
" [11] #sum#584 at ./reducedim.jl:653 [inlined]",
184-
" [12] _pairwise!(::Array{Float64,2}, ::Distances.SqEuclidean, ::Array{Float64,2}, ::Array{Float64,2}) at /Users/mysterio/.julia/packages/Distances/jwhuc/src/metrics.jl:563",
185-
" [13] pairwise!(::Array{Float64,2}, ::Distances.SqEuclidean, ::Array{Float64,2}, ::Array{Float64,2}; dims::Int64) at /Users/mysterio/.julia/packages/Distances/jwhuc/src/generic.jl:166",
186-
" [14] _kmeans!(::Array{Float64,2}, ::Nothing, ::Array{Float64,2}, ::Int64, ::Float64, ::Int64, ::Distances.SqEuclidean) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/kmeans.jl:169",
187-
" [15] kmeans!(::Array{Float64,2}, ::Array{Float64,2}; weights::Nothing, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/kmeans.jl:70",
188-
" [16] kmeans(::Array{Float64,2}, ::Int64; weights::Nothing, init::Symbol, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/kmeans.jl:112",
189-
" [17] (::var\"#9#11\")(::Int64) at ./none:0",
190-
" [18] iterate at ./generator.jl:47 [inlined]",
191-
" [19] collect_to!(::Array{Float64,1}, ::Base.Generator{UnitRange{Int64},var\"#9#11\"}, ::Int64, ::Int64) at ./array.jl:710",
192-
" [20] collect_to_with_first!(::Array{Float64,1}, ::Float64, ::Base.Generator{UnitRange{Int64},var\"#9#11\"}, ::Int64) at ./array.jl:689",
193-
" [21] collect(::Base.Generator{UnitRange{Int64},var\"#9#11\"}) at ./array.jl:670",
194-
" [22] ##core#264() at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371",
195-
" [23] ##sample#265(::BenchmarkTools.Parameters) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:379",
196-
" [24] sample at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:394 [inlined]",
197-
" [25] _lineartrial(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#263\")}, ::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:133",
198-
" [26] _lineartrial(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#263\")}, ::BenchmarkTools.Parameters) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:125",
199-
" [27] #invokelatest#1 at ./essentials.jl:712 [inlined]",
200-
" [28] invokelatest at ./essentials.jl:711 [inlined]",
201-
" [29] #lineartrial#38 at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]",
202-
" [30] lineartrial at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]",
203-
" [31] tune!(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#263\")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:209",
204-
" [32] tune! at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:208 [inlined] (repeats 2 times)",
205-
" [33] top-level scope at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:288",
206-
" [34] top-level scope at In[9]:1"
207-
]
208-
}
209-
],
131+
"outputs": [],
210132
"source": [
211-
"@benchmark [Clustering.kmeans(X_10k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] samples=7 seconds=1200"
133+
"@btime [Clustering.kmeans(X_10k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] "
212134
]
213135
},
214136
{
215137
"cell_type": "code",
216-
"execution_count": 10,
138+
"execution_count": null,
217139
"metadata": {},
218-
"outputs": [
219-
{
220-
"ename": "InterruptException",
221-
"evalue": "InterruptException:",
222-
"output_type": "error",
223-
"traceback": [
224-
"InterruptException:",
225-
"",
226-
"Stacktrace:",
227-
" [1] Weights at /Users/mysterio/.julia/packages/StatsBase/Q9jSr/src/weights.jl:13 [inlined] (repeats 2 times)",
228-
" [2] Weights at /Users/mysterio/.julia/packages/StatsBase/Q9jSr/src/weights.jl:16 [inlined]",
229-
" [3] weights at /Users/mysterio/.julia/packages/StatsBase/Q9jSr/src/weights.jl:76 [inlined]",
230-
" [4] wsample(::Random._GLOBAL_RNG, ::UnitRange{Int64}, ::Array{Float64,1}) at /Users/mysterio/.julia/packages/StatsBase/Q9jSr/src/sampling.jl:829",
231-
" [5] wsample at /Users/mysterio/.julia/packages/StatsBase/Q9jSr/src/sampling.jl:830 [inlined]",
232-
" [6] initseeds!(::Array{Int64,1}, ::KmppAlg, ::Array{Float64,2}, ::Distances.SqEuclidean) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/seeding.jl:176",
233-
" [7] initseeds! at /Users/mysterio/.julia/packages/Clustering/uj53P/src/seeding.jl:161 [inlined]",
234-
" [8] initseeds at /Users/mysterio/.julia/packages/Clustering/uj53P/src/seeding.jl:42 [inlined]",
235-
" [9] initseeds(::Symbol, ::Array{Float64,2}, ::Int64) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/seeding.jl:74",
236-
" [10] kmeans(::Array{Float64,2}, ::Int64; weights::Nothing, init::Symbol, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean) at /Users/mysterio/.julia/packages/Clustering/uj53P/src/kmeans.jl:109",
237-
" [11] (::var\"#12#14\")(::Int64) at ./none:0",
238-
" [12] iterate at ./generator.jl:47 [inlined]",
239-
" [13] collect_to!(::Array{Float64,1}, ::Base.Generator{UnitRange{Int64},var\"#12#14\"}, ::Int64, ::Int64) at ./array.jl:710",
240-
" [14] collect_to_with_first!(::Array{Float64,1}, ::Float64, ::Base.Generator{UnitRange{Int64},var\"#12#14\"}, ::Int64) at ./array.jl:689",
241-
" [15] collect(::Base.Generator{UnitRange{Int64},var\"#12#14\"}) at ./array.jl:670",
242-
" [16] ##core#268() at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371",
243-
" [17] ##sample#269(::BenchmarkTools.Parameters) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:379",
244-
" [18] sample at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:394 [inlined]",
245-
" [19] _lineartrial(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#267\")}, ::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:133",
246-
" [20] _lineartrial(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#267\")}, ::BenchmarkTools.Parameters) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:125",
247-
" [21] #invokelatest#1 at ./essentials.jl:712 [inlined]",
248-
" [22] invokelatest at ./essentials.jl:711 [inlined]",
249-
" [23] #lineartrial#38 at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]",
250-
" [24] lineartrial at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]",
251-
" [25] tune!(::BenchmarkTools.Benchmark{Symbol(\"##benchmark#267\")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:209",
252-
" [26] tune! at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:208 [inlined] (repeats 2 times)",
253-
" [27] top-level scope at /Users/mysterio/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:288",
254-
" [28] top-level scope at In[10]:1"
255-
]
256-
}
257-
],
140+
"outputs": [],
258141
"source": [
259-
"@benchmark [Clustering.kmeans(X_1k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] samples=7 seconds=1200"
142+
"@btime [Clustering.kmeans(X_1k, i; tol=1e-6, maxiter=1000).totalcost for i = 2:10] "
260143
]
261144
},
262145
{
@@ -286,9 +169,7 @@
286169
"metadata": {},
287170
"outputs": [],
288171
"source": [
289-
"@benchmark [ParallelKMeans.kmeans(Lloyd(), X_1m, i;\n",
290-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
291-
" for i = 2:10] samples=7 seconds=2000"
172+
"@btime [ParallelKMeans.kmeans(Lloyd(), X_1m, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
292173
]
293174
},
294175
{
@@ -297,9 +178,7 @@
297178
"metadata": {},
298179
"outputs": [],
299180
"source": [
300-
"@benchmark [ParallelKMeans.kmeans(Lloyd(), X_100k, i;\n",
301-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
302-
" for i = 2:10] samples=7 seconds=1200"
181+
"@btime [ParallelKMeans.kmeans(Lloyd(), X_100k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
303182
]
304183
},
305184
{
@@ -308,9 +187,7 @@
308187
"metadata": {},
309188
"outputs": [],
310189
"source": [
311-
"@benchmark [ParallelKMeans.kmeans(Lloyd(), X_10k, i;\n",
312-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
313-
" for i = 2:10] samples=7 seconds=1200"
190+
"@btime [ParallelKMeans.kmeans(Lloyd(), X_10k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
314191
]
315192
},
316193
{
@@ -319,9 +196,7 @@
319196
"metadata": {},
320197
"outputs": [],
321198
"source": [
322-
"@benchmark [ParallelKMeans.kmeans(Lloyd(), X_1k, i;\n",
323-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
324-
" for i = 2:10] samples=7 seconds=1200"
199+
"@btime [ParallelKMeans.kmeans(Lloyd(), X_1k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
325200
]
326201
},
327202
{
@@ -332,39 +207,39 @@
332207
]
333208
},
334209
{
335-
"cell_type": "raw",
210+
"cell_type": "code",
211+
"execution_count": null,
336212
"metadata": {},
213+
"outputs": [],
337214
"source": [
338-
"@benchmark [ParallelKMeans.kmeans(Hamerly(), X_1m, i;\n",
339-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
340-
" for i = 2:10] samples=7 seconds=1200"
215+
"@btime [ParallelKMeans.kmeans(Hamerly(), X_1m, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
341216
]
342217
},
343218
{
344-
"cell_type": "raw",
219+
"cell_type": "code",
220+
"execution_count": null,
345221
"metadata": {},
222+
"outputs": [],
346223
"source": [
347-
"@benchmark [ParallelKMeans.kmeans(Hamerly(), X_100k, i;\n",
348-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
349-
" for i = 2:10] samples=7 seconds=1200"
224+
"@btime [ParallelKMeans.kmeans(Hamerly(), X_100k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
350225
]
351226
},
352227
{
353-
"cell_type": "raw",
228+
"cell_type": "code",
229+
"execution_count": null,
354230
"metadata": {},
231+
"outputs": [],
355232
"source": [
356-
"@benchmark [ParallelKMeans.kmeans(Hamerly(), X_10k, i;\n",
357-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
358-
" for i = 2:10] samples=7 seconds=1200"
233+
"@btime [ParallelKMeans.kmeans(Hamerly(), X_10k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10]"
359234
]
360235
},
361236
{
362-
"cell_type": "raw",
237+
"cell_type": "code",
238+
"execution_count": null,
363239
"metadata": {},
240+
"outputs": [],
364241
"source": [
365-
"@benchmark [ParallelKMeans.kmeans(Hamerly(), X_1k, i;\n",
366-
" tol=1e-6, max_iters=1000, verbose=false).totalcost \n",
367-
" for i = 2:10] samples=7 seconds=1200"
242+
"@benchmark [ParallelKMeans.kmeans(Hamerly(), X_1k, i; tol=1e-6, max_iters=1000, verbose=false).totalcost for i = 2:10] samples=7 seconds=1200"
368243
]
369244
}
370245
],
@@ -379,6 +254,9 @@
379254
"mimetype": "application/julia",
380255
"name": "julia",
381256
"version": "1.4.0"
257+
},
258+
"nteract": {
259+
"version": "0.22.4"
382260
}
383261
},
384262
"nbformat": 4,

0 commit comments

Comments
 (0)