Skip to content

can we reject dtype inference for numpy object arrays #3077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d-v-b opened this issue May 21, 2025 · 3 comments
Open

can we reject dtype inference for numpy object arrays #3077

d-v-b opened this issue May 21, 2025 · 3 comments
Labels
enhancement New features or improvements
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 21, 2025

numpy arrays with dtype "O" are ambiguous, in the sense that they could contain values that zarr should store as:

  • variable-length strings
  • variable-length arrays
  • arbitrary python objects
  • etc

Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation. For these dtypes (e.g., int8, int16, etc), a user can provide a numpy array and we can automatically pick the right zarr data type representation from that array. But for the object dtype, this is not possible. Extra information is needed to resolve a zarr data type for object dtype arrays.

in zarr-python 2, we used an optional object_codec keyword argument to array creation routines. If a user provided dtype=np.dtype('O') or equivalent without a object_codec, then zarr-python 2 would error.

I don't want to use this exact pattern today, because object_codec is not really well-defined, and this extra parameter, used only for numpy object dtypes, would greatly complicate the dtype inference for all the other dtypes. Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.

e.g.:
create_array(...., dtype=np.dtype('O')) would raise an informative exception, guiding the user to do this instead:
create_array(..., dtype=zarr.dtypes.VariableLengthString()), or create_array(..., dtype='numpy.variable_length_string')

Thoughts on this pattern?

@rabernat
Copy link
Contributor

we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.

👍 to this.

However, we should recognize that a user may still pass O type data when writing to an array, and may receive it back after decoding.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

However, we should recognize that a user may still pass O type data when writing to an array, and may receive it back after decoding.

Yes, under my proposal for pre-numpy 2 variable length string data, it would still be numpy O in, and numpy O out, once the array has been created. The main change would be zarr becoming more strict about what the dtype parameter could be at array creation time.

I have a second suggestion: we should also remove the alias "str" for "variable length string".

  • It's ambiguous because we support multiple string data types
  • a more explicit alternative exists (directly creating an instance of VariableLengthString(), or using the name of that data type
  • it's contrary to numpy, which interprets "str" as UTF-32

@d-v-b d-v-b added this to the 3.1.0 milestone May 23, 2025
@nenb
Copy link

nenb commented May 27, 2025

Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes.

@d-v-b Just to be very clear about what you are proposing: in future releases of zarr-python, when users try to create a v3 Zarr array then they will not be allowed to pass the 'O' dtype (if they do, it will raise an informative error message)? I understand that the 'O' dtype may still be present internally (e.g. if numpy<2 is installed), but we will not allow users to pass it. Is this correct?

If I have understood this correctly, then I am also in favour.

I have a second suggestion: we should also remove the alias "str" for "variable length string".

I am (weakly) against this. I think the fact that zarr-python supports multiple string data types is a legacy of numpy. If I look at the Arrow data types, I believe that they only have one string dtype (ignore the difference between Utf8 and Large Utf8 for now I will open another issue related to this shortly). The fact that they only have one dtring dtype suggests to me that in practice, this is all that is really required.

Strings, UTF-8 etc are notoriously confusing for folks who have no experience with them. As such, I think we should be putting sensible defaults here i.e. if a user puts in 'str', then we should default to variable length string. If we don't default, this can become quite confusing for users (e.g. they then need to understand things like UTF-8), and it can lead to them creating dtypes that they don't actually want (see Arrow again, and what folks seem to want in practice).

Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation.

I'm not sure if this is correct e.g. the numpy void dtype, and bitpacking things like int2, int4 etc. I don't want to distract from the main message of this issue though and so I will write this up in a separate thread shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

4 participants