[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
RE: [hyades-dev] More info on Java UTF-8
|
There is still some debate about using UTF-8 as a canonical string format,
and a suggestion that we should use a pluggable transcoder stack to handle
impedance matching among components. I don't think we should do that. I
think it will benefit us to pick UTF-8 as a canonical string format, just
like we will pick a canonical byte order for multi-byte numeric values. I
will try to lay out my logic and see if people agree.
First let me say that I know Java and .Net aren't the only target
environments, and Eclipse/Java isn't the only client environment. I think
they're going to be high-runner cases. While we must not make things
impossibly complex for a non-Java agent or client, I think we can consider
the common cases when deciding which way to streamline things.
Second I'll say this: I'm arguing against us inventing a pluggable
transcoding stack of our own. My resistance is based on the added
complexity and the difficulty of getting the design and implementation
right. If we find and adopt an established standard message system that
already has such a feature, and has an implementation on the platforms of
interest, and meets our other criteria, that's OK with me.
Third, I want to comment that I've just finished making my Java agent
extension component work on z/OS and OS/400; Java uses Unicode, and the
RAC on z/OS wants ASCII while the one on AS/400 wants EBCDIC. Different
command-line options and pragmas in C source are used to control ASCII vs.
EBCDIC string constants. My battle scars are fresh.
Now that the preliminaries are over, I'll start with the bones of my
argument, then flesh them out:
1. A transcoding stack is beneficial in a limited range of
scenarios, compared to picking UTF-8 as a common format.
2. The use of the HCE protocol will be almost nonexistent the
scenario where such a stack is beneficial.
3. A transcoding stack adds noticeable complexity to the design
and implementation, so it should be done only if justified.
4. I believe it is not justified. Therefore we should use UTF-8,
and not invent and use a transcoding stack.
Point 1: when does a transcoding stack help you? It saves useless
transcoding when two components with identical encodings want to talk.
Let's say the only two encodings of interest are UTF-8 and EBCDIC. (We can
consider ASCII a subset of UTF-8.) If we pick UTF-8 as the canonical
encoding, then two EBCDIC components that want to talk will go through two
useless conversion steps: EBCDIC to UTF-8 and back again. That's where a
transcoding stack wins: it recognizes that no conversion is necessary.
But that's the ONLY scenario where the stack is helpful compared to
picking UTF-8 as the canonical format. Two interacting UTF-8 components
don't need to transcode in either case, and mismatched components must go
through one transcode step regardless.
And look again: I think this team acknowledges that transcoding is not a
big deal for low-volume traffic. So the only real benefit scenario is
*high-volume* communication between two EBCDIC agents.
And here's something else: even EBCDIC agents will want to be in a Unicode
universe if they're using strings that come from a Java or .Net process -
strings like package and names, or log messages. I know that's not the
only kind of agent, but it's a kind worth considering. Therefore even
conversations among local components on an EBCDIC platform might want to
use a Unicode-capable format like UTF-8.
Point 2: We've established that the only winning scenario is high-volume
communication between EBCDIC components that don't need to operate in a
Unicode world. We have to ask ourselves, how likely is that? While it may
be that the Workbench isn't the only client, it is certainly the major
one. Even considering non-Workbench clients, they will overwhelmingly be
running on non-EBCDIC machines, even when there are EBCDIC machines in the
system being observed. Finally, in order to matter, the communication has
to involve high-volume STRINGS: passing numbers and binary data has no
bearing on this argument.
It would be easier to assign a "likelihood" value if we knew what the use
cases were for agent-to-agent interactions. Right now we don't have such
use cases in front of us. Use cases are valuable for settling questions
like this and for documenting the foundation for a decision so we don't
constantly revisit it.
Point 3: The transcoding stack is heavy-weight. It adds complexity and
risk to the project. We'll have to define a communication broker that all
messages go through. That broker will need a plug-in architecture for
transcoders. Each transcoder registers the input and output formats it
knows how to transform. Every agent (or even every message in the system)
will be tagged with the string encoding it uses. As messages pass through
the broker, it will determine whether the sender and recipient are
compatible. If not, it will consult its roster of transcoders and find one
(or possibly more than one) to transform the sender's encoding to the
recipient's.
Then there are a million details: who loads transcoders? When? Reading
information from what registry store, using what mechanism? Are they
native shared libraries, Java classes, something else, or possibly all of
the above? Can they be unloaded when they're not needed any more?
All that work, all that complexity, implementation, and testing for a
broker that almost always sees UTF-8 to UTF-8 or UTF-8 to EBCDIC, which
cause it to do the same transcode we would have done if we'd made the
simple choice. That's what I see. To keep this new HCE protocol system
achievable, I think this is added risk and complexity we can live without.
-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM