New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEEDBACK] User experience with arrow
strings
#10139
Comments
Should we pin the issue? Makes it more visible as it gets pushed down. |
arrow
stringsarrow
strings
Please let me know if I should open a separate ticket for this, but it appears the conversion to PyArrow strings is a bit too greedy and is picking up other objects as well: In [10]: import dask.dataframe as ddf
In [11]: import pandas as pd
In [12]: df = pd.DataFrame({'lists': pd.Series([['a', 'b'], ['c', 'd']])})
In [13]: df
Out[13]:
lists
0 [a, b]
1 [c, d]
In [14]: df.dtypes
Out[14]:
lists object
dtype: object
In [15]: ddf.from_pandas(df, npartitions=1).compute()
Out[15]:
lists
0 ['a', 'b']
1 ['c', 'd']
In [16]: ddf.from_pandas(df, npartitions=1).compute().dtypes
Out[16]:
lists string[pyarrow]
dtype: object While in most cases users should probably be avoiding the object type when possible, I personally feel it's a regression to assume all objects are strings. |
@aiudirog Thank you for your feedback, yes, this is a known problem. Since Dask (unlike pandas) is lazy, it doesn't load and parse your data until it's time to read it. So it has no way to determine if columns marked with For people that do store complex data in a column, it would be advisable to set I hope this helps. |
We ran into a small problem with the new dask arrow strings.
|
We also saw some kind of issue in Dask-SQL related to this change. It popped up when we were trying to wrap up a few fixes and make a release. Unfortunately don't have more details since we haven't had any time to investigate yet. So we disabled it for now ( dask-contrib/dask-sql#1206 ). Just wanted to add that data point here as well |
@GenevieveBuckley I don't think that I am following completely. The option is only pulled if the correct Arrow version is installed, so different forms of behaviour are expected. Is there anything else that is not working as expected that we should be aware of? |
@phofl Trying to clarify: I see how different behaviors are expected depending on whether the correct arrow version is installed. Our problem was exactly about that, namely that So in a way the problem we had was not related to the direct use of arrow strings, but the current mismatch between the pip and conda dask distributions. |
Ah got it. @jrbourbeau can provide more context on the packaging requirements |
The Edit: Also noting that |
Edit: if useful, here is an example stacktrace:
|
pandas does support it, you can do:
|
@phofl Good point! I thought it wouldn't be possible for Pandas to load a Parquet file with that dtype, but I just tested it, and it works. Note: The intended limitation of I tried converting to |
Looked into this a bit more, and it definitely seems there is a bug somewhere -- either in Dask or in PyArrow. The error I got in my real application was
I can reproduce the error like this: import pyarrow as pa
import pandas as pd
dtype = pd.ArrowDtype(pa.large_string())
other_dtype = pd.StringDtype("pyarrow")
df = pd.DataFrame({'foo': ['bar', 'baz'], 'bar': ['baz', 'quux']})
df['foo'] = df.foo.astype(dtype)
df['bar'] = df.foo.astype(other_dtype)
# Lightly modified from https://github.com/dask/dask/blob/b2f11d026d2c6f806036c050ff5dbd59d6ceb6ec/dask/dataframe/backends.py#L232-L240
def default_types_mapper(pyarrow_dtype: pa.DataType) -> object:
# Avoid converting strings from `string[pyarrow]` to
# `string[python]` if we have *any* `string[pyarrow]`
if (
pyarrow_dtype in {pa.large_string(), pa.string()}
and pd.StringDtype("pyarrow") in df.dtypes.values
):
return pd.StringDtype("pyarrow")
return None
# Emulating https://github.com/dask/dask/blob/b2f11d026d2c6f806036c050ff5dbd59d6ceb6ec/dask/dataframe/backends.py#L243
pa.Table.from_pandas(df).to_pandas(types_mapper=default_types_mapper) It seems that the failure depends on there being other strings in the DataFrame that are not |
This is an issue where you can report your experience with
arrow
strings.Please let us know what works well, but also what doesn't. We'll help as best we can.
The text was updated successfully, but these errors were encountered: