I am trying to replicate the soultion given here https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html to import external packages in pypspark. But it is failing.
My code:
spark_distro.py
from pyspark import SparkContext, SparkConf
def import_my_special_package(x):
from external_package import external
return external.fun(x)
conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x)).collect()
external_package.py
class external:
def __init__(self,in):
self.in = in
def fun(self,in):
return self.in*3
spark submit command:
spark-submit
--master yarn
/path to script/spark_distro.py
--py-files /path to script/external_package.py
1000
Actual Error:
Actual:
vs = list(itertools.islice(iterator, batch))
File "/home/gsurapur/pyspark_examples/spark_distro.py", line 13, in <lambda>
File "/home/gsurapur/pyspark_examples/spark_distro.py", line 6, in import_my_special_package
ImportError: No module named external_package
Expected output:
[3,6,9,12]
I tried sc.addPyFile
option too and it is failing with same issue.