[SPARK-53450] Nulls are filled unexpectedly after converting hive table scan to logical relation #52193
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces a configuration
spark.sql.hive.convertMetastoreAsNullable
to set nullable to true for all fields while converting Hive parquet/orc tables to datasources to fix a data-correctness issue.Why are the changes needed?
Hive tables are generally schema-on-read, and those Spark-executed Hive DDLs are not always Hive compatible.
We have an online case that produces incorrect values while
convertMetastoreParquet=true
, for a null record, it sometimes returns 5, others 0, and no other values are witnessed yet.I've tried 4 ways to read the table and data
It turns out that the incorrect table schema led to wrong results.
So, in this PR, I added a new configuration to fall back on how Hive handles nullability blindly.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unfortunately, I've not yet figured out a reproducible unit test to cover this. So, I packed and tested online with our prod data
Before this PR
After this PR
Was this patch authored or co-authored using generative AI tooling?
no