Having two separate pyspark applications that instantiate a HiveContext
in place of a SQLContext
lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. ', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe
functions, that are only supported with a HiveContext
(e.g. collect_set
). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep
is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext
with SQLContext
everything's fine.
Does anyone know why that is?
See Question&Answers more detail:os