Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added ICU charset conversion implementation #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Added ICU charset conversion implementation #64
Changes from 7 commits
8809cd5
10defe7
827c164
362f7e9
8b638bb
4e4cbfd
eafadfb
8abbee4
9b06bf1
340d545
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these
err = U_ZERO_ERROR
assignments are necessary, because we check the value oferr
after every call to an ICU function and throw an exceptionif (U_FAILURE(err))
. Therefore, we know that if we get past anyif (U_FAILURE(err)) { ... }
block, we know thaterr
indicates success.Strictly speaking, it might not be equal to
U_ZERO_ERROR = 0
, because negativeerr
values are used for warnings andU_ZERO_ERROR
is only used when there is no warnings. But warnings don't represent errors, so they will not cause ICU functions to exit if they see one - only errors do, seeUErrorCode
API docs:Note that
U_FAILURE
is defined like this -icu4c/source/common/unicode/utypes.h:713-717
:Bottom line, I'd remove them because they are unnecessary (to be clear, we still need to initialize the variable as
UErrorCode err = U_ZERO_ERROR;
at the beginning, but then we don't have to worry about it anymore).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
ucnv_getMaxCharSize(conv)
guaranteed to be sufficient here? I'm not sure, because the documentation ofucnv_getMaxCharSize()
says the following:Whereas here we use it for a target buffer for conversion to Unicode (UTF-16), which is the opposite.
The examples of returned values also illustrate the possible problem nicely:
SBCS apparently stands for single-byte character set (I've never heard this acronym before). This makes sense: if we take ISO-8859-1 as an example, then any
UChar
(UTF-16 character, also an unsigned 16-bit integer) valid in this encoding will be converted to just a single byte. However, obviously no character in this charset maps to just 1 byte in UTF-16 (in fact, none does).It seems that this mismatch in meaning wasn't caught by our tests because you're also using
UCNV_GET_MAX_BYTES_FOR_STRING
incorrectly (in a way that doesn't agree with the documentation):So the first parameter of the macro is
length
, which should be number ofUChar
s, i.e. UTF-16 code units or unsigned 16-bit integers. However, you passsrc.length()
, which is the number of bytes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs for
ucnv_toUChars()
:So we should probably use this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after thinking about it for a while, I've convinced myself that this calculated capacity should (coincidentally) always be enough, even if the calculation is logically incorrect and should definitely be fixed.
In theory, the only scenario in which it could underallocate memory would be for a SBCS (single-byte character set) with some character outside the BMP (Basic Multilingual Plane) - then
src.length()
would be1
(assuming that a 1-character string with that one character is converted) anducnv_getMaxCharSize(conv)
would also be1
(since it is a SBCS), souniStrCapacity
would be1
and thus theuniStr
array would only have space for 1UChar
(16-bit code unit). But since all Unicode characters outside the BMP (i.e. supplementary characters, U+10000 and greater) require 2UChars
to be represented in UTF-16 (see https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF), this would lead to an error.However, such SBCS seemingly doesn't exist, because BMP pretty much covers all the common characters that any sane (real-world) SBCS would use. The most popular supplementary characters (U+10000 and above) are emojis, but of course there is no real-world SBCS with emojis, since all SBCSs predate emojis by a lot.
So underallocation isn't a problem here, but overallocation by a certain factor is quite likely. For example, if we're decoding a UTF-8 string, then
ucnv_getMaxCharSize(conv)
returns 3 - seeucnv_getMaxCharSize()
docs:So
uniStr
will be allocated with a capacity of at least3 * src.length()
, while2 * src.length()
is already guaranteed be enough. As mentioned above, each character of any encoding could be converted into a surrogate pair (a pair of two 16-bit code units, i.e. 2UChars
) in the worst case, but it doesn't get any worse.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place where
illegal_seq_in_encoding
is thrown when running our test suite, and the only place where it actually makes sense to throw it. It makes no sense when handling potential errors fromucnv_setToUCallBack
,ucnv_setFromUCallBack
orucnv_fromUChars
- if these fail, it doesn't indicate an illegal sequence in the input string (instead, it would indicate a bug in ourbytes_to_str
implementation, as none of these should normally fail regardless of the user input).Therefore,
throw illegal_seq_in_encoding
should only remain here andthrow bytes_to_str_error
should be used everywhere else.