You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.
The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of from_dataframe is incorrect - it should use the column dtype rather than the data buffer dtype.
Fix
Fixing the get_buffers implementation should be relatively simple. However, this will break any from_dataframe implementation (also from other libraries) that rely on the data buffer having the column dtype.
So fixing this should ideally go in three steps:
Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.
Describe the bug, including details regarding any error messages, version, and platform.
Code example
Issue description
As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.
The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of
from_dataframeis incorrect - it should use the column dtype rather than the data buffer dtype.Fix
Fixing the
get_buffersimplementation should be relatively simple. However, this will break anyfrom_dataframeimplementation (also from other libraries) that rely on the data buffer having the column dtype.So fixing this should ideally go in three steps:
from_dataframeimplementation to use the column dtype rather than the data buffer dtype to interpret the buffers.from_dataframeimplementation. See BUG: Interchange object data buffer has the wrong dtype /from_dataframeincorrect pandas-dev/pandas#54781 for the pandas issue.Tagging @AlenkaF as I know you've been working on the protocol for pyarrow.
Component(s)
Python