Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have setup spark standalone mode with 1 master and 2 workers. I launched spark application (java jar) using spark-submit and as expected my application runs and produces output in both of the worker machines.

I can find part files in both of the worker machines.

Spark job as follows,

  1. I have a text file which has numbers between 1 to 100 in worker node 1 in location /Users/abc/spark/a.txt

  2. I have another text file between 101-200 in another worker node in the same path and the same file name.

My spark job read data from the textFile, maps each number multiplied by 2 and then saves the output.

The problem is, when I checked the output in both worker nodes,

Spark ignored the numbers 50-60 in the txtfile in worker node 1. It only considered 1-49 and 61-100 and produced output in the same machine as between 2-98 and 122-200 in 8 part files.

In 2nd worker node it considered only the numbers between 150-160. It produced output in second worker node as between 300-320 in 2 part files.

Not sure why spark ignored the other portion of the input data.

Am I storing the data in wrong format? Or is because I'm not using Hdfs?

question from:https://stackoverflow.com/questions/65651022/incorrect-data-read-in-spark-standalone-cluster-mode

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
957 views
Welcome To Ask or Share your Answers For Others

1 Answer

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...