Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[hyades-dev] UTF-8 as the data exchange format to and from the data collection engine

 

In our last “Data Collection” weekly meeting, there is a question on how UTF-8 is supported

in cross-platform environments and between Java and C libraries. Here is the follow-up.

 

Brief introduction to UTF-8

  • UTF stands for Unicode Transformation Format
  • UTF uses bit-shifting techniques to encode Unicode characters as byte values
  • In UTF-8, each Unicode character is represented in between 1 and 4 bytes

For a complete encoding and decoding UTF-8, you can view its spec at

                 http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

 

UTF-8 in Java and JNI

  • In Java, UTF-8 strings are always 0-terminated
  • UTF-8 is upwards-compatible with 7-bit ASCII

Ref: http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html

 

But Java supports the “Modified UTF-8 Strings” and not standard UTF-8

  • Null byte is transformed into two bytes
  • No four-byte transformation

Ref: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html

 

Recommendation:

  • It is the right approach to adopt UTF-8 as it is the universal and widely accepted choice for cross-platform data format.
  • But we do need to handle the case for the modified UTF-8 when the engine is running on a JVM

-          We want to simply expose this as a public attribute/property of the engine

so that the client can choose to build appropriate format of the UTF-8 stream.

-          This is expected to be a rare case if the data are expected to be in ASCII format

and the client can ignore this exception.

 

We can discuss this issue more on our next weekly meeting.

 


Back to the top