improvement of a snippet of scala/spark code to improve performance

Question

Welcome To Ask or Share your Answers For Others

improvement of a snippet of scala/spark code to improve performance

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am trying to calculate the mean of a certain column and save it as a new column, following is my code snippet of achieving this:

df = df.withColumn("avg_colname", lit(df.select(avg("colname").as("temp")).first().getAs("temp")))

In total, there are 8 columns to be calculated. On a small 3-node cluster using the "spark-submit" command, the code execution takes much more time than on a single machine using the "spark-shell" command(several minutes vs. a few seconds).

Why does the code execute on a cluster slower than on a single machine, and how can the code snippet above be improved?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

343 views

1 Answer

深蓝 · Answer 1 · 2021-01-27T04:37:06+0000

you could re-write your code using window functions:

df = df.withColumn("avg_colname", avg("colname").over())
// add other columns

Or otherwise joining the averages:

df = df.crossJoin(broadcast(
   df.agg(
      avg("colname").as("avg_colname")
      // add other columns
    )
 ))

The two approaches should give the same result, but they work in a different way: The window-function will move all data to 1 partition, while the second aproach will use partial aggregation and will scale better for big datasets

You may also try to cache the initial dataframe and check whether this helps. Also note that if your data is small, distributed computing only adds overhead and makes it slower. If you know the data is small, best could be to use 1 machine und 1 partition...

Categories

improvement of a snippet of scala/spark code to improve performance

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags