Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am trying to calculate the mean of a certain column and save it as a new column, following is my code snippet of achieving this:

df = df.withColumn("avg_colname", lit(df.select(avg("colname").as("temp")).first().getAs("temp")))

In total, there are 8 columns to be calculated. On a small 3-node cluster using the "spark-submit" command, the code execution takes much more time than on a single machine using the "spark-shell" command(several minutes vs. a few seconds).

Why does the code execute on a cluster slower than on a single machine, and how can the code snippet above be improved?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
343 views
Welcome To Ask or Share your Answers For Others

1 Answer

you could re-write your code using window functions:

df = df.withColumn("avg_colname", avg("colname").over())
// add other columns

Or otherwise joining the averages:

df = df.crossJoin(broadcast(
   df.agg(
      avg("colname").as("avg_colname")
      // add other columns
    )
 ))

The two approaches should give the same result, but they work in a different way: The window-function will move all data to 1 partition, while the second aproach will use partial aggregation and will scale better for big datasets

You may also try to cache the initial dataframe and check whether this helps. Also note that if your data is small, distributed computing only adds overhead and makes it slower. If you know the data is small, best could be to use 1 machine und 1 partition...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...