Missing documentation of `UTF8PROC_DECOMPOSE`, `UTF8PROC_COMPOSE` flags in `utf8proc_decompose_char` #290

ceztko · 2025-03-16T22:03:39Z

Based on the actual utf8proc_NFKC implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:

bool tryNormalizeNFKC(const vector<char32_t>& codePoints, vector<char32_t>& normalized)
{
    normalized.clear();
    normalized.reserve(codePoints.size());

    char32_t buff[8];
    utf8proc_ssize_t rc;
    int lastBoundClass;
    for (size_t i = 0; i < codePoints.size(); i++)
    {
        // NOTE: UTF8PROC_DECOMPOSE is undocumented for utf8proc_decompose_char but it's necessary
        rc = utf8proc_decompose_char(codePoints[i], (utf8proc_int32_t*)buff, std::size(buff),
            (utf8proc_option_t)(UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT), &lastBoundClass);
        if (rc < 0 || rc > std::size(buff))
            goto Fail;

        normalized.insert(normalized.end(), buff, buff + rc);
    }

    rc = utf8proc_normalize_utf32((utf8proc_int32_t*)normalized.data(),
        (utf8proc_ssize_t)normalized.size(), (utf8proc_option_t)(UTF8PROC_COMPOSE | UTF8PROC_STABLE));

    if (rc < 0)
        goto Fail;

    normalized.resize((size_t)rc);
    return true;

Fail:
    normalized.clear();
    return false;
}

This is more convenient for me to use instead of utf8proc_NFKC, since I already have the vector of char32_t codepoints, which I also need to further postprocess after the normalization. The only problem I found is that UTF8PROC_DECOMPOSE or UTF8PROC_COMPOSE are not documented as accepted flags in utf8proc_decompose_char, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).

If you bother also clarifying a couple of other things:

What's the maximum size I need utf8proc_decompose_char for the dst buffer (I guess that there exists a static max value)?
I noticed UTF8PROC_STABLE may currently be unused in the code utf8proc code, correct?

The text was updated successfully, but these errors were encountered:

stevengj · 2025-03-19T16:20:52Z

What's the maximum size I need utf8proc_decompose_char for the dst buffer

~~4 bytes. This should really be documented explicitly, but it's intrinsic to the UTF-8 encoding.~~ Sorry, I was thinking of encoding.

Usually we call it twice, once to get the buffer size and once to do the decomposition, to be safe. But I agree that in principle there should be an upper bound, but unfortunately it may depend on the Unicode version. I'm not sure what the current upper bound is, but it could be computed easily and put in the docs (with a test to make sure that it doesn't need to be changed in future versions).

The problem with documenting the current upper bound, however, is that updating the Unicode version may then potentially break binary compatibility, even if the API doesn't otherwise change.

So the safest thing is to set some reasonable upper bound on the buffer size, but always explicitly check for an error return to see if you need a bigger buffer.

I noticed UTF8PROC_STABLE may currently be unused in the code utf8proc code, correct?

I think so, yes — all of the decompositions are already backwards compatible.

ceztko · 2025-03-19T16:35:25Z

4 bytes. This should really be documented explicitly, but it's intrinsic to the UTF-8 encoding.

Hmmmm....UTF-8 should not be related at all in this transformation. It's the decomposition that, for example, converts the small ligature fl (single code point \ufb01) into the two code points f and l. This is intrinsic of Unicode, not UTF-8, and it depends on actual natural languages scripts. Do you appear to remember what's the maximum number of code points one single code point can be decomposed to?

ceztko · 2025-03-19T16:45:04Z

Ok, I read the edited answer. If you compute the value please update the doc/this issue. Sorry, I'm newbie user of utf8proc but I am happy I could use it for the task and integrate it nicely in C++, without spurious heap allocations.

stevengj · 2025-03-19T16:50:13Z

I've computed the value: it's currently 4 chars. I have a PR to add a note to the documentation (while commenting that the value may increase in future versions), and a test to make sure that the hint remains current: #291

stevengj mentioned this issue Mar 19, 2025

check max size of utf8proc_decompose_char buffer #291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing documentation of `UTF8PROC_DECOMPOSE`, `UTF8PROC_COMPOSE` flags in `utf8proc_decompose_char` #290

Missing documentation of `UTF8PROC_DECOMPOSE`, `UTF8PROC_COMPOSE` flags in `utf8proc_decompose_char` #290

ceztko commented Mar 16, 2025 •

edited

Loading

stevengj commented Mar 19, 2025 •

edited

Loading

ceztko commented Mar 19, 2025 •

edited

Loading

ceztko commented Mar 19, 2025

stevengj commented Mar 19, 2025

Missing documentation of UTF8PROC_DECOMPOSE, UTF8PROC_COMPOSE flags in utf8proc_decompose_char #290

Missing documentation of UTF8PROC_DECOMPOSE, UTF8PROC_COMPOSE flags in utf8proc_decompose_char #290

Comments

ceztko commented Mar 16, 2025 • edited Loading

stevengj commented Mar 19, 2025 • edited Loading

ceztko commented Mar 19, 2025 • edited Loading

ceztko commented Mar 19, 2025

stevengj commented Mar 19, 2025

Missing documentation of `UTF8PROC_DECOMPOSE`, `UTF8PROC_COMPOSE` flags in `utf8proc_decompose_char` #290

Missing documentation of `UTF8PROC_DECOMPOSE`, `UTF8PROC_COMPOSE` flags in `utf8proc_decompose_char` #290

ceztko commented Mar 16, 2025 •

edited

Loading

stevengj commented Mar 19, 2025 •

edited

Loading

ceztko commented Mar 19, 2025 •

edited

Loading