CPU gap when doing k-means with Spark -

- August 15, 2013

i working spark 1.2.0.
my feature vector 350 dimensions
the data set 24k vectors
the problem described below happens kmeans|| algorithm; have switched kmeans-random now, know why kmeans|| doesn't work.

when call kmeans.train k=100, observe cpu usage gap after spark has done several collectasmap calls. marked in red in image, there 8 cores, 1 core working while other 7 @ rest during gap.

if raise k 200, gap increase.

i want know why gap? how avoid it? because work requires me set k=5000 larger data set. current settings, job never ends...

i have tried approach both windows , linux (both 64bit) environment, , observe same behavior.

i want, give code , sample data.

enter image description here

have checked webui, gc times? 1 cpu up, others down stop-the-world garbage collection.

you might wanna try enabling parallel gc , check section on gc tuning in spark documentation.

other that, collectasmap return data master/driver, bigger data gets, longer single driver process take process. should try increasing spark.driver.memory.

Search This Blog

Unity

CPU gap when doing k-means with Spark -

Comments

Post a Comment

Popular posts from this blog

angularjs - Showing an empty as first option in select tag -

qt - Change color of QGraphicsView rubber band -

c++ - Print Preview in Qt -