Package org.apache.lucene.util
Class UnicodeUtil
- java.lang.Object
-
- org.apache.lucene.util.UnicodeUtil
-
public final class UnicodeUtil extends Object
Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.- NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
UnicodeUtil.UTF16Result
Holds decoded UTF16 code units.static class
UnicodeUtil.UTF8Result
Holds decoded UTF8 code units.
-
Field Summary
Fields Modifier and Type Field Description static int
UNI_REPLACEMENT_CHAR
static int
UNI_SUR_HIGH_END
static int
UNI_SUR_HIGH_START
static int
UNI_SUR_LOW_END
static int
UNI_SUR_LOW_START
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
newString(int[] codePoints, int offset, int count)
Cover JDK 1.5 API.static void
UTF16toUTF8(char[] source, int offset, int length, BytesRef result)
Encode characters from a char[] source, starting at offset for length chars.static void
UTF16toUTF8(char[] source, int offset, int length, UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset for length chars.static void
UTF16toUTF8(char[] source, int offset, UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset and stopping when the character 0xffff is seen.static void
UTF16toUTF8(CharSequence s, int offset, int length, BytesRef result)
Encode characters from this String, starting at offset for length characters.static void
UTF16toUTF8(String s, int offset, int length, UnicodeUtil.UTF8Result result)
Encode characters from this String, starting at offset for length characters.static int
UTF16toUTF8WithHash(char[] source, int offset, int length, BytesRef result)
Encode characters from a char[] source, starting at offset for length chars.static void
UTF8toUTF16(byte[] utf8, int offset, int length, CharsRef chars)
Interprets the given byte array as UTF-8 and converts to UTF-16.static void
UTF8toUTF16(byte[] utf8, int offset, int length, UnicodeUtil.UTF16Result result)
Convert UTF8 bytes into UTF16 characters.static void
UTF8toUTF16(BytesRef bytesRef, CharsRef chars)
Utility method forUTF8toUTF16(byte[], int, int, CharsRef)
static boolean
validUTF16String(CharSequence s)
-
-
-
Field Detail
-
UNI_SUR_HIGH_START
public static final int UNI_SUR_HIGH_START
- See Also:
- Constant Field Values
-
UNI_SUR_HIGH_END
public static final int UNI_SUR_HIGH_END
- See Also:
- Constant Field Values
-
UNI_SUR_LOW_START
public static final int UNI_SUR_LOW_START
- See Also:
- Constant Field Values
-
UNI_SUR_LOW_END
public static final int UNI_SUR_LOW_END
- See Also:
- Constant Field Values
-
UNI_REPLACEMENT_CHAR
public static final int UNI_REPLACEMENT_CHAR
- See Also:
- Constant Field Values
-
-
Method Detail
-
UTF16toUTF8WithHash
public static int UTF16toUTF8WithHash(char[] source, int offset, int length, BytesRef result)
Encode characters from a char[] source, starting at offset for length chars. Returns a hash of the resulting bytes. After encoding, result.offset will always be 0.
-
UTF16toUTF8
public static void UTF16toUTF8(char[] source, int offset, UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset and stopping when the character 0xffff is seen. Returns the number of bytes written to bytesOut.
-
UTF16toUTF8
public static void UTF16toUTF8(char[] source, int offset, int length, UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset for length chars. Returns the number of bytes written to bytesOut.
-
UTF16toUTF8
public static void UTF16toUTF8(String s, int offset, int length, UnicodeUtil.UTF8Result result)
Encode characters from this String, starting at offset for length characters. Returns the number of bytes written to bytesOut.
-
UTF16toUTF8
public static void UTF16toUTF8(CharSequence s, int offset, int length, BytesRef result)
Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.
-
UTF16toUTF8
public static void UTF16toUTF8(char[] source, int offset, int length, BytesRef result)
Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.
-
UTF8toUTF16
public static void UTF8toUTF16(byte[] utf8, int offset, int length, UnicodeUtil.UTF16Result result)
Convert UTF8 bytes into UTF16 characters. If offset is non-zero, conversion starts at that starting point in utf8, re-using the results from the previous call up until offset.
-
newString
public static String newString(int[] codePoints, int offset, int count)
Cover JDK 1.5 API. Create a String from an array of codePoints.- Parameters:
codePoints
- The code arrayoffset
- The start of the text in the code point arraycount
- The number of code points- Returns:
- a String representing the code points between offset and count
- Throws:
IllegalArgumentException
- If an invalid code point is encounteredIndexOutOfBoundsException
- If the offset or count are out of bounds.
-
UTF8toUTF16
public static void UTF8toUTF16(byte[] utf8, int offset, int length, CharsRef chars)
Interprets the given byte array as UTF-8 and converts to UTF-16. TheCharsRef
will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
-
UTF8toUTF16
public static void UTF8toUTF16(BytesRef bytesRef, CharsRef chars)
Utility method forUTF8toUTF16(byte[], int, int, CharsRef)
- See Also:
UTF8toUTF16(byte[], int, int, CharsRef)
-
validUTF16String
public static boolean validUTF16String(CharSequence s)
-
-