I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables.
From official spark site https://spark.apache.org/docs/latest/job-scheduling.html it's clear that this can work...
...cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.
However you might have noticed in this SO post Concurrent job Execution in Spark that there was no accepted answer on this similar question and the most upvoted answer starts with
This is not really in the spirit of Spark
- Everyone knows it's not in the "spirit" of Spark
- Who cares what is the spirit of Spark? That doesn't actually mean anything
Has anyone gotten something like this to work before? Did you have to do anything special? Just wanted some pointers before I wasted a lot of work hours prototyping. I would really appreciate any help on this!
See Question&Answers more detail:os