apache spark - GroupByKey and create lists of values pyspark sql dataframe

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

apache spark - GroupByKey and create lists of values pyspark sql dataframe

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

So I have a spark dataframe that looks like:

a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :

a | b_list
5 | (2,4)
2 | (4,3)

How would I go about doing this with a pyspark sql dataframe?

Thank you! :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.3k views

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:35:16+0000

Here are the steps to get that Dataframe.

>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  5|  2|  1|
|  5|  4|  3|
|  2|  4|  2|
|  2|  3|  7|
+---+---+---+

>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
|  a|collect_list(b)|
+---+---------------+
|  5|         [2, 4]|
|  2|         [4, 3]|
+---+---------------+

Categories

apache spark - GroupByKey and create lists of values pyspark sql dataframe

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags