-
Notifications
You must be signed in to change notification settings - Fork 112
Description
Apache Iceberg version
main (development)
Please describe the bug 🐞
Environment
- iceberg-go: v0.3.1
- Catalog: REST catalog (standard Iceberg REST API)
- Go: 1.21+
- OS: Linux/Windows (reproducible across)
Summary
In v0.3.0, table.NewAddSchemaUpdate(*Schema, lastColumnID, initial)
allowed callers to ensure the table’s last-assigned-field-id
stayed monotonic even when deleting the column that previously had the highest field ID.
In v0.3.1, the API changed to table.NewAddSchemaUpdate(*Schema)
(no lastColumnID
). When we only delete the highest-ID column(s) and add no new columns, the library appears to derive last-assigned-field-id
from the new schema’s max field id, which decreases. The REST catalog then rejects the commit with:
invalid_metadata: The specified metadata is not valid
Deleting a first/middle column works; deleting the tail (max-ID) column fails.
Note: This repro excludes partition/sort references (i.e., we are not deleting a column referenced by the default spec or sort order).
Steps to Reproduce
-
Start with a table whose current schema has fields, e.g.:
a
(id=1),b
(id=2),c
(id=3).
No partition/sort references toc
.
-
Build a new schema that removes
c
and keepsa
/b
with the same field IDs (we don’t touch IDs). -
Submit two updates in one commit:
AddSchema
usingtable.NewAddSchemaUpdate(newSchema)
SetCurrentSchema
usingtable.NewSetCurrentSchemaUpdate(newSchemaID)
(We obtainnewSchemaID
by runningb := table.MetadataBuilderFromBase(meta); id, _ := b.AddSchema(newSchema)
.)
-
Include concurrency requirements (optional but recommended):
AssertTableUUID(meta.UUID())
AssertLastAssignedFieldID(oldLastID)
whereoldLastID
is 3 in this example.
-
CommitTable(...)
→ fails withinvalid_metadata
.
Minimal code sketch (v0.3.1 style):
meta := tbl.Metadata()
oldLast := highestID(tbl.Schema()) // returns 3 in the example
// Build new schema that keeps a(id=1), b(id=2) only (delete c(id=3))
newSchema := buildSchemaKeepAB(tbl.Schema()) // preserves existing IDs
// Precompute new schema-id
b := table.MetadataBuilderFromBase(meta)
newSchemaID, err := b.AddSchema(newSchema)
if err != nil { panic(err) }
// Prepare updates (v0.3.1 API)
add := table.NewAddSchemaUpdate(newSchema) // no lastColumnID parameter anymore
set := table.NewSetCurrentSchemaUpdate(newSchemaID)
reqs := []table.Requirement{
table.AssertTableUUID(meta.UUID()),
table.AssertLastAssignedFieldID(int(oldLast)), // oldLast == 3
}
_, _, err = cat.CommitTable(ctx, tbl, reqs, []table.Update{add, set})
// => invalid_metadata when only deleting tail/highest-ID columns
Expected Behavior
-
Deleting columns (including the highest-ID column) should be allowed as long as:
- We do not change existing field IDs of the kept columns.
last-assigned-field-id
does not decrease (i.e., remains the previous value).
-
In v0.3.0, passing
lastColumnID=oldLast
ensured monotonicity and commits succeeded.
Actual Behavior
- With v0.3.1,
NewAddSchemaUpdate
cannot acceptlastColumnID
. - When we only delete the max-ID column and add no new columns, the commit is rejected with
invalid_metadata
—apparently because the derivedlast-assigned-field-id
regresses to the new schema’s max ID.
Analysis
- Iceberg requires
last-assigned-field-id
to be monotonic (never decreases). - In the “delete-tail-columns only” scenario, the current
last-assigned-field-id
is the old max (e.g., 3). The new schema’s max becomes smaller (e.g., 2).
If the client or server infers the counter from the new schema’s max, it violates monotonicity →invalid_metadata
.
Workarounds
- Add a sentinel (dummy) column in the same update with ID =
oldLast + 1
(e.g.,__compat_padding_...
), nullable, never used. This keeps the new schema’s max ≥ old max.
Or, more practically, add a real new column in the same change so max ID increases. - (Less ideal) Maintain a fork that restores the older API (
NewAddSchemaUpdate(schema, lastColumnID, initial)
) or custom-craft the REST payload to setlast-column-id = oldLast
. - Of course still ensure you’re not deleting a field referenced by partition spec or sort order (not the case in this repro).
Proposal
-
API / behavior options:
- Re-introduce a way to set
lastColumnID
(or an equivalent parameter) onAddSchema
in the Go client; or - Have the client compute
last-assigned-field-id
asmax(oldLastID, max(newSchema.FieldIDs))
so it never regresses; or - Provide a dedicated update or requirement to explicitly set/preserve
last-assigned-field-id
without requiring a dummy column.
- Re-introduce a way to set
-
Docs: Clarify in v0.3.1 migration notes that callers must ensure the counter doesn’t regress when deleting the highest-ID column, and suggest recommended patterns.
Additional Context
- The same flow succeeds if we delete a middle/first column (the new schema’s max ID stays the same).
- The same flow succeeds if we add at least one new column (the new schema’s max ID increases).
Happy to provide a tiny repro program if needed. Thanks!