CPU gap when doing k-means with Spark -
- i working spark 1.2.0.
- my feature vector 350 dimensions
- the data set 24k vectors
- the problem described below happens
kmeans||algorithm; have switchedkmeans-randomnow, know whykmeans||doesn't work.
when call kmeans.train k=100, observe cpu usage gap after spark has done several collectasmap calls. marked in red in image, there 8 cores, 1 core working while other 7 @ rest during gap.
if raise k 200, gap increase.
i want know why gap? how avoid it? because work requires me set k=5000 larger data set. current settings, job never ends...
i have tried approach both windows , linux (both 64bit) environment, , observe same behavior.
i want, give code , sample data.

have checked webui, gc times? 1 cpu up, others down stop-the-world garbage collection.
you might wanna try enabling parallel gc , check section on gc tuning in spark documentation.
other that, collectasmap return data master/driver, bigger data gets, longer single driver process take process. should try increasing spark.driver.memory.
Comments
Post a Comment