Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Having some issues with the way that Spark is interpreting columns for parquet.

I have an Oracle source with confirmed schema (df.schema() method):

root
  |-- LM_PERSON_ID: decimal(15,0) (nullable = true)
  |-- LM_BIRTHDATE: timestamp (nullable = true)
  |-- LM_COMM_METHOD: string (nullable = true)
  |-- LM_SOURCE_IND: string (nullable = true)
  |-- DATASET_ID: decimal(38,0) (nullable = true)
  |-- RECORD_ID: decimal(38,0) (nullable = true)

Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):

  message spark_schema {
    optional int64 LM_PERSON_ID (DECIMAL(15,0));
    optional int96 LM_BIRTHDATE;
    optional binary LM_COMM_METHOD (UTF8);
    optional binary LM_SOURCE_IND (UTF8);
    optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
    optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}

My application then generates the table DDL using a HashMap for type conversion, for example:

CREATE EXTERNAL TABLE IF NOT EXISTS 
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'

My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.

Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'. 
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]

How do I know when to substitute a Decimal field for BIGINT?

The parquet message type is logged but not accessible?

Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64

The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.

What am I missing here? Can I enforce a schema for the parquet file for decimal?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
399 views
Welcome To Ask or Share your Answers For Others

1 Answer

From Configuration section of Parquet Files in the official documentation of Apache Spark:

spark.sql.parquet.writeLegacyFormat (default: false)

If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

Answer Given Before Official Docs Got Updated

The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...