We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

UTF-8 Validation

Difficulty: Medium


Problem Description

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Key Insights

  • UTF-8 encoding can have characters of varying lengths (1 to 4 bytes).
  • The first byte of a multi-byte character indicates how many bytes are involved.
  • Continuation bytes must start with the bits 10.
  • Validating the sequence requires careful tracking of expected continuation bytes.

Space and Time Complexity

Time Complexity: O(n)
Space Complexity: O(1)

Solution

To determine if the given data is a valid UTF-8 encoding, we can iterate through the array of integers and check each byte according to the UTF-8 rules. We will maintain a count of how many continuation bytes we are expecting based on the first byte's value. If we encounter a continuation byte, we will verify if it starts with 10. If any byte does not meet the expected conditions, we will return false.


Code Solutions

Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
        1              |   0xxxxxxx
        2              |   110xxxxx 10xxxxxx
        3              |   1110xxxx 10xxxxxx 10xxxxxx
        4              |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
← Back to All Questions