-
Notifications
You must be signed in to change notification settings - Fork 150
Missing documentation of UTF8PROC_DECOMPOSE
, UTF8PROC_COMPOSE
flags in utf8proc_decompose_char
#290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Usually we call it twice, once to get the buffer size and once to do the decomposition, to be safe. But I agree that in principle there should be an upper bound, but unfortunately it may depend on the Unicode version. I'm not sure what the current upper bound is, but it could be computed easily and put in the docs (with a test to make sure that it doesn't need to be changed in future versions). The problem with documenting the current upper bound, however, is that updating the Unicode version may then potentially break binary compatibility, even if the API doesn't otherwise change. So the safest thing is to set some reasonable upper bound on the buffer size, but always explicitly check for an error return to see if you need a bigger buffer.
I think so, yes — all of the decompositions are already backwards compatible. |
Hmmmm....UTF-8 should not be related at all in this transformation. It's the decomposition that, for example, converts the small ligature |
Ok, I read the edited answer. If you compute the value please update the doc/this issue. Sorry, I'm newbie user of utf8proc but I am happy I could use it for the task and integrate it nicely in C++, without spurious heap allocations. |
I've computed the value: it's currently 4 chars. I have a PR to add a note to the documentation (while commenting that the value may increase in future versions), and a test to make sure that the hint remains current: #291 |
Based on the actual
utf8proc_NFKC
implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:This is more convenient for me to use instead of
utf8proc_NFKC
, since I already have the vector ofchar32_t
codepoints, which I also need to further postprocess after the normalization. The only problem I found is thatUTF8PROC_DECOMPOSE
orUTF8PROC_COMPOSE
are not documented as accepted flags inutf8proc_decompose_char
, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).If you bother also clarifying a couple of other things:
utf8proc_decompose_char
for thedst
buffer (I guess that there exists a static max value)?UTF8PROC_STABLE
may currently be unused in the code utf8proc code, correct?The text was updated successfully, but these errors were encountered: