CPU gap when doing k-means with Spark -
- i working spark 1.2.0.
- my feature vector 350 dimensions
- the data set 24k vectors
- the problem described below happens
kmeans||
algorithm; have switchedkmeans-random
now, know whykmeans||
doesn't work.
when call kmeans.train
k=100, observe cpu usage gap after spark has done several collectasmap
calls. marked in red in image, there 8 cores, 1 core working while other 7 @ rest during gap.
if raise k 200, gap increase.
i want know why gap? how avoid it? because work requires me set k=5000 larger data set. current settings, job never ends...
i have tried approach both windows , linux (both 64bit) environment, , observe same behavior.
i want, give code , sample data.
have checked webui, gc times? 1 cpu up, others down stop-the-world garbage collection.
you might wanna try enabling parallel gc , check section on gc tuning in spark documentation.
other that, collectasmap
return data master/driver, bigger data gets, longer single driver process take process. should try increasing spark.driver.memory
.
Comments
Post a Comment