Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

According to python's GIL we cannot use threading in CPU bound processes so my question is how does Apache Spark utilize python in multi-core environment?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
856 views
Welcome To Ask or Share your Answers For Others

1 Answer

Multi-threading python issues are separated from Apache Spark internals. Parallelism on Spark is dealt with inside the JVM.

enter image description here

And the reason is that in the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python sub-processes and communicate with them using pipes, sending the user's code and the data to be processed.

PS: I'm not sure if this actually answers your question completely.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...