Skip to content

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Sep 2, 2025

What changes were proposed in this pull request?

This PR introduces a configuration spark.sql.hive.convertMetastoreAsNullable to set nullable to true for all fields while converting Hive parquet/orc tables to datasources to fix a data-correctness issue.

Why are the changes needed?

Hive tables are generally schema-on-read, and those Spark-executed Hive DDLs are not always Hive compatible.

We have an online case that produces incorrect values while convertMetastoreParquet=true, for a null record, it sometimes returns 5, others 0, and no other values are witnessed yet.

I've tried 4 ways to read the table and data

  1. convertMetastoreParquet=true, vectorized -- negative
  2. convertMetastoreParquet=true, row-based -- negative
  3. convertMetastoreParquet=false -- positive
  4. scan on the table path, partition path, or files -- positive

It turns out that the incorrect table schema led to wrong results.

So, in this PR, I added a new configuration to fall back on how Hive handles nullability blindly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unfortunately, I've not yet figured out a reproducible unit test to cover this. So, I packed and tested online with our prod data

Before this PR

image

After this PR

image

Was this patch authored or co-authored using generative AI tooling?

no

@yaooqinn yaooqinn marked this pull request as draft September 2, 2025 04:21
@github-actions github-actions bot added the SQL label Sep 2, 2025
@yaooqinn yaooqinn changed the title [WIP][SPARK-53450] Nulls are filled after converting hive table scan to logical relation [WIP][SPARK-53450] Nulls are filled unexpectedly after converting hive table scan to logical relation Sep 2, 2025
@yaooqinn yaooqinn marked this pull request as ready for review September 2, 2025 06:40
@yaooqinn yaooqinn self-assigned this Sep 2, 2025
@yaooqinn yaooqinn requested a review from cloud-fan September 2, 2025 06:40
@yaooqinn yaooqinn changed the title [WIP][SPARK-53450] Nulls are filled unexpectedly after converting hive table scan to logical relation [SPARK-53450] Nulls are filled unexpectedly after converting hive table scan to logical relation Sep 2, 2025
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nit.

if (shouldInfer) {
val tableName = relation.tableMeta.identifier.unquotedString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop relation. here for code consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thank you, @peter-toth

@yaooqinn yaooqinn closed this in 166895a Sep 2, 2025
@yaooqinn
Copy link
Member Author

yaooqinn commented Sep 2, 2025

Merged to master, thank you @peter-toth

@yaooqinn yaooqinn deleted the SPARK-53450 branch September 5, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants