Class UnicodeUtil


  • public final class UnicodeUtil
    extends Object
    Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.
    NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.
    • Method Detail

      • UTF16toUTF8WithHash

        public static int UTF16toUTF8WithHash​(char[] source,
                                              int offset,
                                              int length,
                                              BytesRef result)
        Encode characters from a char[] source, starting at offset for length chars. Returns a hash of the resulting bytes. After encoding, result.offset will always be 0.
      • UTF16toUTF8

        public static void UTF16toUTF8​(char[] source,
                                       int offset,
                                       UnicodeUtil.UTF8Result result)
        Encode characters from a char[] source, starting at offset and stopping when the character 0xffff is seen. Returns the number of bytes written to bytesOut.
      • UTF16toUTF8

        public static void UTF16toUTF8​(char[] source,
                                       int offset,
                                       int length,
                                       UnicodeUtil.UTF8Result result)
        Encode characters from a char[] source, starting at offset for length chars. Returns the number of bytes written to bytesOut.
      • UTF16toUTF8

        public static void UTF16toUTF8​(String s,
                                       int offset,
                                       int length,
                                       UnicodeUtil.UTF8Result result)
        Encode characters from this String, starting at offset for length characters. Returns the number of bytes written to bytesOut.
      • UTF16toUTF8

        public static void UTF16toUTF8​(CharSequence s,
                                       int offset,
                                       int length,
                                       BytesRef result)
        Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.
      • UTF16toUTF8

        public static void UTF16toUTF8​(char[] source,
                                       int offset,
                                       int length,
                                       BytesRef result)
        Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.
      • UTF8toUTF16

        public static void UTF8toUTF16​(byte[] utf8,
                                       int offset,
                                       int length,
                                       UnicodeUtil.UTF16Result result)
        Convert UTF8 bytes into UTF16 characters. If offset is non-zero, conversion starts at that starting point in utf8, re-using the results from the previous call up until offset.
      • newString

        public static String newString​(int[] codePoints,
                                       int offset,
                                       int count)
        Cover JDK 1.5 API. Create a String from an array of codePoints.
        Parameters:
        codePoints - The code array
        offset - The start of the text in the code point array
        count - The number of code points
        Returns:
        a String representing the code points between offset and count
        Throws:
        IllegalArgumentException - If an invalid code point is encountered
        IndexOutOfBoundsException - If the offset or count are out of bounds.
      • UTF8toUTF16

        public static void UTF8toUTF16​(byte[] utf8,
                                       int offset,
                                       int length,
                                       CharsRef chars)
        Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

        NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

      • validUTF16String

        public static boolean validUTF16String​(CharSequence s)