I am trying to calculate the mean of a certain column and save it as a new column, following is my code snippet of achieving this:
df = df.withColumn("avg_colname", lit(df.select(avg("colname").as("temp")).first().getAs("temp")))
In total, there are 8 columns to be calculated. On a small 3-node cluster using the "spark-submit" command, the code execution takes much more time than on a single machine using the "spark-shell" command(several minutes vs. a few seconds).
Why does the code execute on a cluster slower than on a single machine, and how can the code snippet above be improved?