I have setup spark standalone mode with 1 master and 2 workers. I launched spark application (java jar) using spark-submit and as expected my application runs and produces output in both of the worker machines.
I can find part files in both of the worker machines.
Spark job as follows,
I have a text file which has numbers between 1 to 100 in worker node 1 in location /Users/abc/spark/a.txt
I have another text file between 101-200 in another worker node in the same path and the same file name.
My spark job read data from the textFile, maps each number multiplied by 2 and then saves the output.
The problem is, when I checked the output in both worker nodes,
Spark ignored the numbers 50-60 in the txtfile in worker node 1. It only considered 1-49 and 61-100 and produced output in the same machine as between 2-98 and 122-200 in 8 part files.
In 2nd worker node it considered only the numbers between 150-160. It produced output in second worker node as between 300-320 in 2 part files.
Not sure why spark ignored the other portion of the input data.
Am I storing the data in wrong format? Or is because I'm not using Hdfs?
question from:https://stackoverflow.com/questions/65651022/incorrect-data-read-in-spark-standalone-cluster-mode