Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have created one dataframe ordersDF.Below is the schema.

root
 |-- order_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)
 

In some places we are using 'order_id', order_id,ordersDF.order_id.It is really confusing when to use which one. For example.

1)ordersDF.select(order_id).show() -- NameError: name 'order_id' is not defined
 ordersDF.where('order_id==9').show() --No error here
 
 2)ordersDF.select('order_id').show() --No error here
 
 3)ordersDF.select(ordersDF.order_id).show()--No error here
 
 4)ordersDF.where('ordersDF.order_id==9').show() --AnalysisException: cannot resolve '`ordersDF.order_id`' given input columns: [order_customer_id, order_date, order_id, order_status]; line 1 pos 0;
question from:https://stackoverflow.com/questions/65914467/spark-dataframe-clarification-on-select

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.3k views
Welcome To Ask or Share your Answers For Others

1 Answer

From what I understand, you are confused while passing a column.
Use DF.ColumnName without any ' or pass column name under single quotes like 'order_id'. You can use any of these ways to pass the column.

This should solve your problem.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...