Describe the bug, including details regarding any error messages, version, and platform.
Description
The utf8_is_digit kernel in pyarrow.compute does not fully replicate Python's str.isdigit() behavior, especially with certain Unicode digit characters.
For example, the character '³' (U+00B3 SUPERSCRIPT THREE) returns True with Python’s str.isdigit() but returns False when passed to pyarrow.compute.utf8_is_digit.
This divergence leads to downstream inconsistencies, particularly in pandas when using StringDtype(storage="pyarrow").
Reproduction
import pyarrow as pa
import pyarrow.compute as pc
arr = pa.array(['3', '٣', '५', '123', '³'])
print(pc.utf8_is_digit(arr).to_pylist())
Output:
[True, True, True, True, False] # <-- '³' incorrectly returns False
Expected Output (matches str.isdigit()):
[True, True, True, True, True]
Notes
- The issue seems to stem from the implementation of
IsDigitUnicode::PredicateCharacterAll not including characters in the Unicode "No" (Number, Other) category, such as superscript digits (³, ², etc.).
- Python's behavior can be verified as:
print("³".isdigit()) # True
Impact
This affects pandas string operations like .str.isdigit() when using pyarrow storage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like '³'.
System Info
Tested with:
- PyArrow 20.0.0 (pip-installed)
- Pyarrow
main 0.1.dev17578+g218c886
- Python 3.12
- Debian-based Linux (Ubuntu)
Component(s)
Python
Describe the bug, including details regarding any error messages, version, and platform.
Description
The
utf8_is_digitkernel inpyarrow.computedoes not fully replicate Python'sstr.isdigit()behavior, especially with certain Unicode digit characters.For example, the character
'³'(U+00B3 SUPERSCRIPT THREE) returnsTruewith Python’sstr.isdigit()but returnsFalsewhen passed topyarrow.compute.utf8_is_digit.This divergence leads to downstream inconsistencies, particularly in pandas when using
StringDtype(storage="pyarrow").Reproduction
Output:
Expected Output (matches
str.isdigit()):Notes
IsDigitUnicode::PredicateCharacterAllnot including characters in the Unicode "No" (Number, Other) category, such as superscript digits (³,², etc.).Impact
This affects pandas string operations like
.str.isdigit()when usingpyarrowstorage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like'³'.System Info
Tested with:
main0.1.dev17578+g218c886Component(s)
Python