The arrow spec leaves the solution of duplicate field names to implementors.
The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:
-
https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc
-
https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java
I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data interactively with tools like pandas?
Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?
Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.
If we choose to support replace, ignore or append, at least we must let user control the exact behavior. For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.
Reporter: Qingyou Meng / @mqy
Note: This issue was originally created as ARROW-11178. Please see the migration documentation for further details.
The arrow spec leaves the solution of
duplicate field namesto implementors.The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:
https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc
https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java
I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data interactively with tools like
pandas?Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?
Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.
If we choose to support
replace,ignoreorappend, at least we must let user control the exact behavior. For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.Reporter: Qingyou Meng / @mqy
Note: This issue was originally created as ARROW-11178. Please see the migration documentation for further details.