can we reject dtype inference for numpy object arrays #3077

d-v-b · 2025-05-21T11:00:35Z

numpy arrays with dtype "O" are ambiguous, in the sense that they could contain values that zarr should store as:

variable-length strings
variable-length arrays
arbitrary python objects
etc

Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation. For these dtypes (e.g., int8, int16, etc), a user can provide a numpy array and we can automatically pick the right zarr data type representation from that array. But for the object dtype, this is not possible. Extra information is needed to resolve a zarr data type for object dtype arrays.

in zarr-python 2, we used an optional object_codec keyword argument to array creation routines. If a user provided dtype=np.dtype('O') or equivalent without a object_codec, then zarr-python 2 would error.

I don't want to use this exact pattern today, because object_codec is not really well-defined, and this extra parameter, used only for numpy object dtypes, would greatly complicate the dtype inference for all the other dtypes. Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.

e.g.:
create_array(...., dtype=np.dtype('O')) would raise an informative exception, guiding the user to do this instead:
create_array(..., dtype=zarr.dtypes.VariableLengthString()), or create_array(..., dtype='numpy.variable_length_string')

Thoughts on this pattern?

The text was updated successfully, but these errors were encountered:

rabernat · 2025-05-21T19:33:28Z

we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.

👍 to this.

However, we should recognize that a user may still pass O type data when writing to an array, and may receive it back after decoding.

d-v-b · 2025-05-22T15:22:10Z

However, we should recognize that a user may still pass O type data when writing to an array, and may receive it back after decoding.

Yes, under my proposal for pre-numpy 2 variable length string data, it would still be numpy O in, and numpy O out, once the array has been created. The main change would be zarr becoming more strict about what the dtype parameter could be at array creation time.

I have a second suggestion: we should also remove the alias "str" for "variable length string".

It's ambiguous because we support multiple string data types
a more explicit alternative exists (directly creating an instance of VariableLengthString(), or using the name of that data type
it's contrary to numpy, which interprets "str" as UTF-32

nenb · 2025-05-27T16:25:11Z

Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes.

@d-v-b Just to be very clear about what you are proposing: in future releases of zarr-python, when users try to create a v3 Zarr array then they will not be allowed to pass the 'O' dtype (if they do, it will raise an informative error message)? I understand that the 'O' dtype may still be present internally (e.g. if numpy<2 is installed), but we will not allow users to pass it. Is this correct?

If I have understood this correctly, then I am also in favour.

I have a second suggestion: we should also remove the alias "str" for "variable length string".

I am (weakly) against this. I think the fact that zarr-python supports multiple string data types is a legacy of numpy. If I look at the Arrow data types, I believe that they only have one string dtype (ignore the difference between Utf8 and Large Utf8 for now ~~I will open another issue related to this shortly~~). The fact that they only have one dtring dtype suggests to me that in practice, this is all that is really required.

Strings, UTF-8 etc are notoriously confusing for folks who have no experience with them. As such, I think we should be putting sensible defaults here i.e. if a user puts in 'str', then we should default to variable length string. If we don't default, this can become quite confusing for users (e.g. they then need to understand things like UTF-8), and it can lead to them creating dtypes that they don't actually want (see Arrow again, and what folks seem to want in practice).

Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation.

I'm not sure if this is correct e.g. the numpy void dtype, and bitpacking things like int2, int4 etc. I don't want to distract from the main message of this issue though and so I will write this up in a separate thread shortly.

d-v-b mentioned this issue May 21, 2025

refactor v3 data types #2874

Open

d-v-b added this to the 3.1.0 milestone May 23, 2025

nenb mentioned this issue May 27, 2025

Mapping of new dtypes to numpy dtypes #3101

Closed

dstansby added the enhancement New features or improvements label May 31, 2025

This was referenced Jun 1, 2025

Monthly issue metrics report #3107

Open

Monthly issue metrics report sanketverma1704/zarr-python#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

can we reject dtype inference for numpy object arrays #3077

can we reject dtype inference for numpy object arrays #3077

d-v-b commented May 21, 2025

rabernat commented May 21, 2025

Uh oh!

d-v-b commented May 22, 2025

Uh oh!

nenb commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

can we reject dtype inference for numpy object arrays #3077

can we reject dtype inference for numpy object arrays #3077

Comments

d-v-b commented May 21, 2025

rabernat commented May 21, 2025

Uh oh!

d-v-b commented May 22, 2025

Uh oh!

nenb commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nenb commented May 27, 2025 •

edited

Loading