ORC Files

Apache ORC is a columnar format which has more advanced features like native zstd compression, bloom filter and columnar encryption.

ORC Implementation

Spark supports two ORC implementations (native and hive) which is controlled by spark.sql.orc.impl. Two implementations share most functionalities with different design goals.

  • native implementation is designed to follow Spark’s data source behavior like Parquet.
  • hive implementation is designed to follow Hive’s behavior and uses Hive SerDe.

For example, historically, native implementation handles CHAR/VARCHAR with Spark’s native String while hive implementation handles it via Hive CHAR/VARCHAR. The query results are different. Since Spark 3.1.0, SPARK-33480 removes this difference by supporting CHAR/VARCHAR from Spark-side.

Vectorized Reader

native implementation supports a vectorized ORC reader and has been the default ORC implementation since Spark 2.3. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true.

For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true, and is turned on by default.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, ORC also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple ORC files with different but mutually compatible schemas. The ORC data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default . You may enable it by

  1. setting data source option mergeSchema to true when reading ORC files, or
  2. setting the global SQL option spark.sql.orc.mergeSchema to true.

Zstandard

Spark supports both Hadoop 2 and 3. Since Spark 3.2, you can take advantage of Zstandard compression in ORC files on both Hadoop versions. Please see Zstandard for the benefits.

  1. CREATE TABLE compressed (
  2. key STRING,
  3. value STRING
  4. )
  5. USING ORC
  6. OPTIONS (
  7. compression 'zstd'
  8. )

Bloom Filters

You can control bloom filters and dictionary encodings for ORC data sources. The following ORC example will create bloom filter and use dictionary encoding only for favorite_color. To find more detailed information about the extra ORC options, visit the official Apache ORC websites.

  1. CREATE TABLE users_with_options (
  2. name STRING,
  3. favorite_color STRING,
  4. favorite_numbers array<integer>
  5. )
  6. USING ORC
  7. OPTIONS (
  8. orc.bloom.filter.columns 'favorite_color',
  9. orc.dictionary.key.threshold '1.0',
  10. orc.column.encoding.direct 'name'
  11. )

Columnar Encryption

Since Spark 3.2, columnar encryption is supported for ORC tables with Apache ORC 1.6. The following example is using Hadoop KMS as a key provider with the given location. Please visit Apache Hadoop KMS for the detail.

  1. CREATE TABLE encrypted (
  2. ssn STRING,
  3. email STRING,
  4. name STRING
  5. )
  6. USING ORC
  7. OPTIONS (
  8. hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  9. orc.key.provider "hadoop",
  10. orc.encrypt "pii:ssn,email",
  11. orc.mask "nullify:ssn;sha256:email"
  12. )

Hive metastore ORC table conversion

When reading from Hive metastore ORC tables and inserting to Hive metastore ORC tables, Spark SQL will try to use its own ORC support instead of Hive SerDe for better performance. For CTAS statement, only non-partitioned Hive metastore ORC tables are converted. This behavior is controlled by the spark.sql.hive.convertMetastoreOrc configuration, and is turned on by default.

Configuration

Property NameDefaultMeaningSince Version
spark.sql.orc.implnativeThe name of ORC implementation. It can be one of native and hive. native means the native ORC support. hive means the ORC library in Hive.2.3.0
spark.sql.orc.enableVectorizedReadertrueEnables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored.2.3.0
spark.sql.orc.columnarReaderBatchSize4096The number of rows to include in an orc vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data.2.4.0
spark.sql.orc.columnarWriterBatchSize1024The number of rows to include in an orc vectorized writer batch. The number should be carefully chosen to minimize overhead and avoid OOMs in writing data.3.4.0
spark.sql.orc.enableNestedColumnVectorizedReadertrueEnables vectorized orc decoding in native implementation for nested data types (array, map and struct). If spark.sql.orc.enableVectorizedReader is set to false, this is ignored.3.2.0
spark.sql.orc.filterPushdowntrueWhen true, enable filter pushdown for ORC files.1.4.0
spark.sql.orc.aggregatePushdownfalseIf true, aggregates will be pushed down to ORC for optimization. Support MIN, MAX and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date type. For COUNT, support all data types. If statistics is missing from any ORC file footer, exception would be thrown.3.3.0
spark.sql.orc.mergeSchemafalse

When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.

3.0.0
spark.sql.hive.convertMetastoreOrctrueWhen set to false, Spark SQL will use the Hive SerDe for ORC tables instead of the built in support.2.0.0

Data Source Option

Data source options of ORC can be set via:

Property NameDefaultMeaningScope
mergeSchemafalsesets whether we should merge schemas collected from all ORC part-files. This will override spark.sql.orc.mergeSchema. The default value is specified in spark.sql.orc.mergeSchema.read
compressionsnappycompression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override orc.compress and spark.sql.orc.compression.codec.write

Other generic options can be found in Generic File Source Options.