Simple Unicode-aware class in C++

If you just want the full source code click here.

Unicode basics

Unicode associates each possible character (letter, number, punctuation, emoji, etc…) with a unique number known as a codepoint.

Characters are the abstract representations of the smallest components of written language that have semantic value.
Unicode 16.0.0 / 2.2.3 Characters, No Glyphs

The set of all codepoints is known as codespace and Unicode states that there are 1114112 or 0x10FFFF codepoints.

Notationally, a codepoint is denoted by U+n, where n is four to six hexadecimal digits, omitting leading zeros, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

The codespace is divided in 17 sections or ranges, known as planes, each plane has 65536 characters. The first plane (Plane 0) is known as the BMP or Basic Multilingual Plane, this plane contains English letters, and matches ASCII from 0 to 127.

The last four hexadecimal digits in each codepoint indicate a character’s position inside a plane. The remaining digits indicate the plane. For example, U+23456 is found at location 0x3456 in Plane 2. for more details see Unicode 16.0.0 / 2.9 Details of Allocation.

In this post we’re interested in encoding Unicode codepoints, which for our purposes it will mean how to represent a codepoint in binary for actual usage in computer systems. Let's start with a couple of observations.

Bits required

We can represent any codepoint in the codespace using 21 bits

2^{21} = 2097152

This would leave us with 965040 unused numbers.

2097152 - 1114112 = 965040

We also can encode the entire codespace using any number of bits greater than 21, but we end up with a lot more unused numbers (which sometimes is preferable, see UTF-32 encoding below).

Alternatively, we can use 16 bits + 20 bits to precisely represent the Unicode codespace (See UTF-16 encoding below).

2^{16} + 2^{20} = 65536 + 1048576 = 1114112

Fixed vs variable width encoding

When representing a codepoint in memory we need to define how many bits will be used per character, fixed-width means that all codepoints will be encoded using the same number of bits, variable-width means that different characters can be encoded with different numbers of bits.

A specific encoding system like UTF8, UTF16 or UTF32 defines the number of bits to be used for a given codepoint through a code unit, which is the minimal storable information in the system. We say for example:

The code point U+1F92F: 🤯

Requires 4 code units to be encoded in UTF-8 (where code unit is 8 bits)
Requires 2 code units to be encoded in UTF-16 (where code unit is 16 bits)
Requires 1 code unit to be encoded in UTF-32 (where code unit is 32 bits)

The code point U+0041: A

Requires 1 code unit to be encoded in UTF-8 (where code unit is 8 bits)
Requires 1 code unit to be encoded in UTF-16 (where code unit is 16 bits)
Requires 1 code unit to be encoded in UTF-32 (where code unit is 32 bits)

UTF 32

It is a fixed-width encoding, each codepoint is directly mapped to a 32 bit code unit

uint32_t LetterA = 0x0041;

Can be useful when working on unicode text whose characters need to be indexed easily or require other per-codepoint processing:

uint32_t RandomUserEmoji = UserInputEmojis[4];

See Unicode 16.0.0 / 3.9.1 UTF-32

UTF 16

It is a variable-width encoding, a codepoint is represented using one code unit (16 bits) if the character to be encoded is in the BMP plane or two code units otherwise.

Codepoints outside of the BMP are encoded using a surrogate pair, which is a pair of two code units whose value is in the range [0xD800, DFFF], this value is associated uniquely with a Unicode scalar value in the range [U+10000,U+10FFFF].

Figure from Wikipedia Unicode Plane article

The 6 highest bits of a code unit in a surrogate pair, denote which element of the pair the code unit is, observe the following bit pattern for an illustration:

// Possible numeric ranges of first element of the surrogate pair
// First 6 digits are 0b1101’10
From 0xD800 to 0xD8FF = From 0b1101’1000’0000’0000 to 0b1101’1000’1111’1111
From 0xD900 to 0xD9FF = From 0b1101’1001’0000’0000 to 0b1101’1001’1111’1111
From 0xDA00 to 0xDAFF = From 0b1101’1010’0000’0000 to 0b1101’1010’1111’1111
From 0xDB00 to 0xDBFF = From 0b1101’1011’0000’0000 to 0b1101’1011’1111’1111

// Possible numeric ranges of second element in the surrogate pair
// First 6 digits are 0b1101’11
From 0xDC00 to 0xDCFF = From 0b1101’1100’0000’0000 to 0b1101’1100’1111’1111
From 0xDD00 to 0xDDFF = From 0b1101’1101’0000’0000 to 0b1101’1101’1111’1111
From 0xDE00 to 0xDEFF = From 0b1101’1110’0000’0000 to 0b1101’1110’1111’1111
From 0xDF00 to 0xDFFF = From 0b1101’1111’0000’0000 to 0b1101’1111’1111’1111

From the above bit pattern we can also note that there are 10 free bits (per pair element) that can be used to encode stuff, summing a total of 20 free bits (Which can represent 1048576 distinct values).

See Unicode 16.0.0 / 3.9.2 UTF-16

UTF 8

It is a variable-width encoding, the high bits of each code unit indicate the part of the code unit sequence to which each byte belongs, for example.

0b0xxx’xxxx

Code units that represent an ASCII character, which is a number in the range [0, 127]

0b10xx’xxxx

For code units that are part of a code unit sequence but are not the leading (first) byte of the sequence

0b110x’xxxx

For code units that are the first byte in a code unit sequence of 2 bytes

0b1110’xxxx

For code units that are the first byte in a code unit sequence of 3 bytes

0b1111’0xxx

For code units that are the first byte in a code unit sequence of 4 bytes

See Unicode 16.0.0 / 3.9.3 UTF-8

Code

The following code aligns to the UTF-8 everywhere manifesto, that suggests using UTF-8 internally in your program and doing conversions when communicating with other APIs that receive other types of encodings.

Let's start detecting the size of an UTF-8 code unit sequence, note that there are better ways to implement this but here we're aiming for a simple and easy to understand approach.

enum class EUtf8SequenceSize : uint8_t
{
    One, Two, Three, Four, Invalid
};

[[nodiscard]] EUtf8SequenceSize SizeOfUtf8Sequence(const unsigned char& Utf8Char)
{
    // https://unicode.org/mail-arch/unicode-ml/y2003-m02/att-0467/01-The_Algorithm_to_Valide_an_UTF-8_String
    if (Utf8Char <= 0x7f) // 0b0111'1111
    {
        // ASCII
        return EUtf8SequenceSize::One;
    }
    else if (Utf8Char <= 0xbf) // 0b1011'1111
    {
        // Not a leading UTF8 byte, possibly something went wrong reading previous sequences
        return EUtf8SequenceSize::Invalid; 
    }
    else if (Utf8Char <= 0xdf) // 0b1101'1111
    {
        // Leading byte in a two byte sequence
        return EUtf8SequenceSize::Two;
    }
    else if (Utf8Char <= 0xef) // 0b1110'1111
    {
        // Leading byte in a three byte sequence
        return EUtf8SequenceSize::Three;
    }
    else if (Utf8Char <= 0xf7) // 0b1111'0111
    {
        // Leading byte in a four byte sequence
        return EUtf8SequenceSize::Four;
    }

    // Unicode 3.1 ruled out the five and six octets UTF-8 sequence as illegal although
    // previous standard / specification such as Unicode 3.0 and RFC 2279 allow the
    // five and six octets UTF-8 sequence. Therefore, we need to make sure those value are not in the UTF-8
    return EUtf8SequenceSize::Invalid;
}

Now a function to convert an UTF8 sequence into a unicode code point (encoded in UTF32).

[[nodiscard]] char32_t NextCodepointFromUtf8Sequence(const unsigned char*& Utf8Sequence)
{
    if (*Utf8Sequence == 0)
    {
        return 0;
    }

    EUtf8SequenceSize NumOfBytes = SizeOfUtf8Sequence(*Utf8Sequence);
    if (NumOfBytes == EUtf8SequenceSize::Invalid)
    {
        return 0; // End processing
    }

    unsigned char FirstByte = *Utf8Sequence;
    if (NumOfBytes == EUtf8SequenceSize::One)
    {
        ++Utf8Sequence; // Point to the start of the next UTF8 sequence
        return FirstByte;
    }

    unsigned char SecondByte = *(++Utf8Sequence);
    if (SecondByte == 0)
    {
        return 0;
    }

    if (NumOfBytes == EUtf8SequenceSize::Two)
    {
        ++Utf8Sequence; // Point to the start of the next UTF8 sequence
        return
            ((FirstByte      & 0b0001'1111) << 6)       |
             (SecondByte     & 0b0011'1111);
    }

    unsigned char ThirdByte = *(++Utf8Sequence);
    if (ThirdByte == 0)
    {
        return 0;
    }

    if (NumOfBytes == EUtf8SequenceSize::Three)
    {
        ++Utf8Sequence; // Point to the start of the next UTF8 sequence
        return
            ((FirstByte      & 0b0000'1111) << 12)      |
            ((SecondByte     & 0b0011'1111) << 6)       |
             (ThirdByte      & 0b0011'1111);
    }

    unsigned char FourthByte = *(++Utf8Sequence);
    if (FourthByte == 0)
    {
        return 0;
    }

    ++Utf8Sequence; // Point to the start of the next UTF8 sequence
    return
        ((FirstByte         & 0b0000'0111) << 18)   |
        ((SecondByte        & 0b0011'1111) << 12)   |
        ((ThirdByte         & 0b0011'1111) << 6)    |
         (FourthByte        & 0b0011'1111);
}

The above functions can be easily used to create utility methods for our Utf8String class, below is a function that calculates how many code points does the UTF8 string has.


int32_t Utf8String::CodePointsLen() const
{
    if (Len() == 0)
    {
        return 0;
    }

    int32_t TotalCodePoints = 0;
    const unsigned char* Utf8Str = GetRawData();
    while (NextCodepointFromUtf8Sequence(Utf8Str))
    {
        ++TotalCodePoints;
    }
    return TotalCodePoints;
}

A function that converts from UTF8 to UTF32


std::u32string Utf8String::ToUtf32() const
{
    std::u32string Utf32Output;
    if (Len() == 0)
    {
        return Utf32Output;
    }

    const unsigned char* Utf8Str = GetRawData();
    while (*Utf8Str != 0)
    {
        char32_t UnicodeCodePoint = NextCodepointFromUtf8Sequence(Utf8Str);
        Utf32Output.push_back(UnicodeCodePoint);
    }

    return Utf32Output;
}

A function that converts from UTF8 to UTF16

std::u16string Utf8String::ToUtf16() const
{
    std::u16string Utf16Output;
    if (Len() == 0)
    {
        return Utf16Output;
    }

    const unsigned char* Utf8Str = GetRawData();
    while (*Utf8Str != 0)
    {
        char32_t UnicodeCodePoint = NextCodepointFromUtf8Sequence(Utf8Str);
        if (UnicodeCodePoint < 0x1'0000) // 0b0001'0000'0000'0000'0000
        {
            Utf16Output.push_back(UnicodeCodePoint);
        }
        else
        {
            UnicodeCodePoint -= 0x1'0000;
            char16_t HighSurrogate = 0xd800 + ((UnicodeCodePoint >> 10) & 0x3FF); // 0x3FF == 0b0011'1111'1111
            char16_t LowSurrogate = 0xdc00 + (UnicodeCodePoint & 0x3FF);
            Utf16Output.push_back(HighSurrogate);
            Utf16Output.push_back(LowSurrogate);
        }
    }

    return Utf16Output;
}

See full source code here.

Understanding the `wchar` mess

Size of wchar_t can be different depending on the platform, on Windows it’s 2 bytes (16 bits) and on most Unix systems it’s 4 bytes (32 bits), why?

In the beginning 7 bits were used to encode ASCII characters (numbers from 0 to 127), then multiple working groups tried to define rules for the usage of one extra bit (8 bits in total) to represent characters outside of the ASCII range.

ISO produced ISO-8859-1, ISO-8859-2, and 13 more standards, Microsoft created Windows-1252 a.k.a CP-1252 (and a bunch of other code pages) and referred to them as "ANSI code pages", which is technically incorrect because they never became ANSI standards, in words of Cathy Wissink:

The term "ANSI" as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1.
However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see “ANSI” or “A” used to signify Windows code page support."
Unicode WinXP, Cathy Wissink

Then people noted that if they wanted to encode all possible characters in a unified way, more than 8 bits were required, how many? 16 bits according to ISO (first draft of ISO/IEC-10646) and people from Xerox and Apple (Unicode 1.0).

In 1990 two initiatives existed for a unified character set, ISO/IEC-10646 and Unicode, ISO defined an encoding form known as UCS-2 which was a fixed width encoding of 16 bits, in 1991 ISO and Unicode agreed to synchronize their codepoint assignments and UCS-2 became the standard way to encode Unicode codepoints.

Then the cool guys at Microsoft decided to support Unicode, in the process they invented the data type wchar with a size of 2 bytes, the “w” stands for wide to outline the fact that this is a character wider than 1 byte.

Time passed and the smart guys at ISO and Unicode noted again that they needed more than 16 bits to encode all characters of the world, the solution was the introduction of the surrogate pairs of UTF-16 stated in Unicode 2.0 in 1996, Also UTF-32 (UCS-4) emerged and Unix systems decided to make their wchar to be a size of 4 bytes instead of 2 for easier unicode support.

Credits

Written by Romualdo Villalobos

Simple Unicode-aware class in C++

Unicode basics

Bits required

Fixed vs variable width encoding

UTF 32

UTF 16

UTF 8

0b0xxx’xxxx

0b10xx’xxxx

0b110x’xxxx

0b1110’xxxx

0b1111’0xxx

Code

Understanding the `wchar` mess

Links

Credits

Programmer notes on CRC computation

Friendly MB2 source code reference

Unicode basics

Bits required

Fixed vs variable width encoding

UTF 32

UTF 16

UTF 8

0b0xxx’xxxx

0b10xx’xxxx

0b110x’xxxx

0b1110’xxxx

0b1111’0xxx

Code

Understanding the wchar mess

Links

Credits

Programmer notes on CRC computation

Friendly MB2 source code reference

Understanding the `wchar` mess