[hyades-dev] UTF-8 as the data exchange format to and from the data coll

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[hyades-dev] UTF-8 as the data exchange format to and from the data collection engine

From: "Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx>
Date: Sun, 15 Aug 2004 22:32:22 -0700
Delivered-to: hyades-dev@xxxxxxxxxxx
List-archive: <http://dev.eclipse.org/pipermail/hyades-dev/>
List-help: <mailto:hyades-dev-request@eclipse.org?subject=help>
List-subscribe: <http://dev.eclipse.org/mailman/listinfo/hyades-dev>, <mailto:hyades-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <http://dev.eclipse.org/mailman/listinfo/hyades-dev>, <mailto:hyades-dev-request@eclipse.org?subject=unsubscribe>
Thread-index: AcSDUmf+Y/R41vSoTwC3KCrCys8FiQ==
Thread-topic: UTF-8 as the data exchange format to and from the data collection engine

In our last “Data Collection” weekly meeting, there is a question on how UTF-8 is supported

in cross-platform environments and between Java and C libraries. Here is the follow-up.

Brief introduction to UTF-8

UTF stands for Unicode Transformation Format
UTF uses bit-shifting techniques to encode Unicode characters as byte values
In UTF-8, each Unicode character is represented in between 1 and 4 bytes

For a complete encoding and decoding UTF-8, you can view its spec at

http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

UTF-8 in Java and JNI

In Java, UTF-8 strings are always 0-terminated
UTF-8 is upwards-compatible with 7-bit ASCII

Ref: http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html

But Java supports the “Modified UTF-8 Strings” and not standard UTF-8

Null byte is transformed into two bytes
No four-byte transformation

Ref: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html

Recommendation:

It is the right approach to adopt UTF-8 as it is the universal and widely accepted choice for cross-platform data format.
But we do need to handle the case for the modified UTF-8 when the engine is running on a JVM

- We want to simply expose this as a public attribute/property of the engine

so that the client can choose to build appropriate format of the UTF-8 stream.

- This is expected to be a rare case if the data are expected to be in ASCII format

and the client can ignore this exception.

We can discuss this issue more on our next weekly meeting.

Follow-Ups:
- Re: [hyades-dev] UTF-8 as the data exchange format to and from the data collection engine
  - From: Allan K Pratt

Prev by Date: [hyades-dev] Interface IDs
Next by Date: [hyades-dev] Brian G Battersby is out of the office.
Previous by thread: [hyades-dev] Interface IDs
Next by thread: Re: [hyades-dev] UTF-8 as the data exchange format to and from the data collection engine
Index(es):
- Date
- Thread

Back to the top