Problem Description
Given an integer array data
representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a
0
, followed by its Unicode code. - For an n-bytes character, the first
n
bits are all one's, then + 1
bit is0
, followed byn - 1
bytes with the most significant2
bits being10
.
This is how the UTF-8 encoding would work:
x
denotes a bit in the binary form of a byte that may be either 0
or 1
.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Key Insights
- UTF-8 encoding can have characters of varying lengths (1 to 4 bytes).
- The first byte of a multi-byte character indicates how many bytes are involved.
- Continuation bytes must start with the bits
10
. - Validating the sequence requires careful tracking of expected continuation bytes.
Space and Time Complexity
Time Complexity: O(n)
Space Complexity: O(1)
Solution
To determine if the given data is a valid UTF-8 encoding, we can iterate through the array of integers and check each byte according to the UTF-8 rules. We will maintain a count of how many continuation bytes we are expecting based on the first byte's value. If we encounter a continuation byte, we will verify if it starts with 10
. If any byte does not meet the expected conditions, we will return false.