-
-
Notifications
You must be signed in to change notification settings - Fork 330
can we reject dtype inference for numpy object arrays #3077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
👍 to this. However, we should recognize that a user may still pass |
Yes, under my proposal for pre-numpy 2 variable length string data, it would still be numpy I have a second suggestion: we should also remove the alias "str" for "variable length string".
|
@d-v-b Just to be very clear about what you are proposing: in future releases of If I have understood this correctly, then I am also in favour.
I am (weakly) against this. I think the fact that Strings, UTF-8 etc are notoriously confusing for folks who have no experience with them. As such, I think we should be putting sensible defaults here i.e. if a user puts in 'str', then we should default to variable length string. If we don't default, this can become quite confusing for users (e.g. they then need to understand things like UTF-8), and it can lead to them creating dtypes that they don't actually want (see Arrow again, and what folks seem to want in practice).
I'm not sure if this is correct e.g. the |
numpy arrays with dtype
"O"
are ambiguous, in the sense that they could contain values that zarr should store as:Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation. For these dtypes (e.g.,
int8
,int16
, etc), a user can provide a numpy array and we can automatically pick the right zarr data type representation from that array. But for the object dtype, this is not possible. Extra information is needed to resolve a zarr data type for object dtype arrays.in zarr-python 2, we used an optional
object_codec
keyword argument to array creation routines. If a user provideddtype=np.dtype('O')
or equivalent without aobject_codec
, then zarr-python 2 would error.I don't want to use this exact pattern today, because
object_codec
is not really well-defined, and this extra parameter, used only for numpy object dtypes, would greatly complicate the dtype inference for all the other dtypes. Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.e.g.:
create_array(...., dtype=np.dtype('O'))
would raise an informative exception, guiding the user to do this instead:create_array(..., dtype=zarr.dtypes.VariableLengthString())
, orcreate_array(..., dtype='numpy.variable_length_string')
Thoughts on this pattern?
The text was updated successfully, but these errors were encountered: