In our last “Data Collection” weekly meeting, there
is a question on how UTF-8 is supported
in cross-platform environments and between Java and C
libraries. Here is the follow-up.
Brief introduction to UTF-8
- UTF stands for Unicode
Transformation Format
- UTF uses bit-shifting techniques to encode Unicode
characters as byte values
- In UTF-8, each Unicode character is represented in
between 1 and 4 bytes
For a complete encoding and decoding
UTF-8, you can view its spec at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
UTF-8 in Java and JNI
- In Java, UTF-8 strings are always 0-terminated
- UTF-8 is upwards-compatible with 7-bit ASCII
Ref: http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html
But Java supports the “Modified UTF-8 Strings”
and not standard UTF-8
- Null byte is transformed into two bytes
- No four-byte transformation
Ref: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html
Recommendation:
- It is the right approach to adopt UTF-8 as it is the
universal and widely accepted choice for cross-platform data format.
- But we do need to handle the case for the modified UTF-8 when the engine is running on a JVM
-
We want to simply expose this as a public attribute/property of the engine
so that the client
can choose to build appropriate format of the UTF-8 stream.
-
This is expected to be a rare case if the data are
expected to be in ASCII format
and the client can
ignore this exception.
We can discuss this issue more on our next weekly meeting.