CPU gap when doing k-means with Spark -


  • i working spark 1.2.0.
  • my feature vector 350 dimensions
  • the data set 24k vectors
  • the problem described below happens kmeans|| algorithm; have switched kmeans-random now, know why kmeans|| doesn't work.

when call kmeans.train k=100, observe cpu usage gap after spark has done several collectasmap calls. marked in red in image, there 8 cores, 1 core working while other 7 @ rest during gap.

if raise k 200, gap increase.

i want know why gap? how avoid it? because work requires me set k=5000 larger data set. current settings, job never ends...

i have tried approach both windows , linux (both 64bit) environment, , observe same behavior.

i want, give code , sample data.

enter image description here

have checked webui, gc times? 1 cpu up, others down stop-the-world garbage collection.

you might wanna try enabling parallel gc , check section on gc tuning in spark documentation.

other that, collectasmap return data master/driver, bigger data gets, longer single driver process take process. should try increasing spark.driver.memory.


Comments

Popular posts from this blog

google chrome - Developer tools - How to inspect the elements which are added momentarily (by JQuery)? -

angularjs - Showing an empty as first option in select tag -

php - Cloud9 cloud IDE and CakePHP -