Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
- Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1
They are actually directory markers as path + /. Source 2
- To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
- Read about S3 , S3a and S3n here and here
- Thanks to @stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…