When executing the DELETE, UPDATE, and MERGE commands on Iceberg tables and the command fails, they should always clean up the new files created and uploaded to the object storage during the execution regardless of where the failure occurs. These files were not committed in a new snapshot, so they are garbage that will never be used.
If the MERGE command fails during the MergeWriterOperator execution, then the method MergeWriterOperator.abort() will clean up the new data files stored in the object storage during the command execution.
The TableFinishOperator creates the new deletion files to record deleted or updated rows and stores them in object storage. The process is the same for DELETE, UPDATE, and MERGE commands.
If the DELETE, UPDATE or MERGE commands fail during the TableFinishOperator execution, Presto does not delete the new data files stored in object storage. As a result, after a command failure, garbage remains in object storage.
Expected Behavior
Presto should delete the new files stored in the object storage during the command execution.
Current Behavior
Presto leaves files in the object storage that will never be used.
Possible Solution
Currently, the workaround to clean up these files is executing the iceberg.system.remove_orphan_files()
Steps to Reproduce
- Simulate an error executing the DELETE, UPDATE or MERGE command. To do it, you can add a
throw new RuntimeException() in the methods finishWrite() and finishDeleteWithOutput() in the com.facebook.presto.iceberg.IcebergAbstractMetadata Java class.
- Run Presto.
- Create an Iceberg table:
CREATE TABLE iceberg.default.mytest(a int, b int)
WITH (
location = 'hdfs:///user/presto/warehouse/mytest/'
);
INSERT INTO iceberg.default.mytest VALUES(1, 0), (2, 0);
- List the data files in the object storage:
$ ls /user/presto/warehouse/mytest/data/
Found 1 items
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/7f3ff2fa-35b3-46b5-83fa-e836050dd08d.parquet
-
Delete some rows in mytest table:
DELETE FROM iceberg.default.mytest WHERE a > 1
The command will return the error: FORCED FAILURE DURING DELETE COMMAND EXECUTION
-
List the data files in the object storage:
$ ls /user/presto/warehouse/mytest/data/
Found 2 items
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/7f3ff2fa-35b3-46b5-83fa-e836050dd08d.parquet
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/delete_file_e09d21c4-72fe-4b84-a3dc-9a7de687074e.parquet
The delete_file_e09d21c4-72fe-4b84-a3dc-9a7de687074e.parquet file should not be there. Presto must delete this file when the DELETE command fails.
-
Update some rows in mytest table:
UPDATE iceberg.default.mytest SET a = 1111 WHERE a > 1
The command will return the error: FORCED FAILURE DURING UPDATE OR MERGE COMMAND EXECUTION
-
List the data files in the object storage:
$ ls /user/presto/warehouse/mytest/data/
Found 4 items
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/7f3ff2fa-35b3-46b5-83fa-e836050dd08d.parquet
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/delete_file_e09d21c4-72fe-4b84-a3dc-9a7de687074e.parquet
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/f031d070-c05e-4296-a6ff-78c61abd750d.parquet
-rw-rw-rw- 3 presto hadoop hdfs:///user/presto/warehouse/mytest/data/delete_file_29f5dd84-22af-489d-b873-a95cbf5402a1.parquet
The delete_file_29f5dd84-22af-489d-b873-a95cbf5402a1.parquet and f031d070-c05e-4296-a6ff-78c61abd750d.parquet files should not be there. Presto must delete these files when the UPDATE command fails.
When executing the DELETE, UPDATE, and MERGE commands on Iceberg tables and the command fails, they should always clean up the new files created and uploaded to the object storage during the execution regardless of where the failure occurs. These files were not committed in a new snapshot, so they are garbage that will never be used.
If the MERGE command fails during the
MergeWriterOperatorexecution, then the methodMergeWriterOperator.abort()will clean up the new data files stored in the object storage during the command execution.The
TableFinishOperatorcreates the new deletion files to record deleted or updated rows and stores them in object storage. The process is the same for DELETE, UPDATE, and MERGE commands.If the DELETE, UPDATE or MERGE commands fail during the
TableFinishOperatorexecution, Presto does not delete the new data files stored in object storage. As a result, after a command failure, garbage remains in object storage.Expected Behavior
Presto should delete the new files stored in the object storage during the command execution.
Current Behavior
Presto leaves files in the object storage that will never be used.
Possible Solution
Currently, the workaround to clean up these files is executing the
iceberg.system.remove_orphan_files()Steps to Reproduce
throw new RuntimeException()in the methodsfinishWrite()andfinishDeleteWithOutput()in thecom.facebook.presto.iceberg.IcebergAbstractMetadataJava class.Delete some rows in mytest table:
DELETE FROM iceberg.default.mytest WHERE a > 1The command will return the error:
FORCED FAILURE DURING DELETE COMMAND EXECUTIONList the data files in the object storage:
$ ls /user/presto/warehouse/mytest/data/
The
delete_file_e09d21c4-72fe-4b84-a3dc-9a7de687074e.parquetfile should not be there. Presto must delete this file when the DELETE command fails.Update some rows in mytest table:
UPDATE iceberg.default.mytest SET a = 1111 WHERE a > 1The command will return the error:
FORCED FAILURE DURING UPDATE OR MERGE COMMAND EXECUTIONList the data files in the object storage:
$ ls /user/presto/warehouse/mytest/data/
The
delete_file_29f5dd84-22af-489d-b873-a95cbf5402a1.parquetandf031d070-c05e-4296-a6ff-78c61abd750d.parquetfiles should not be there. Presto must delete these files when the UPDATE command fails.