[asciidoc-lang-dev] Unicode Issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[asciidoc-lang-dev] Unicode Issues

From: Lex Trotman <exciidoc@xxxxxxxxx>
Date: Wed, 3 Mar 2021 14:51:25 +1000
Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>

At some point in the specification process there are some issues about handling Unicode that need to be addressed and some terminology agreed.

Unicode is defined in terms of "code points", and a Unicode format input file is an encoded sequence of code points, so that is what AsciiDoc input will be.

Just a first touch on encoding, I suggest that the standard require encodings to be either explicitly specified by the document (BOM or :encoding: attribute) or explicitly specified on the command line or be assumed to be UTF-8. If an implementation picks up encodings from the environment that will make documents break if processed in another environment, so that should be actively discouraged.

Once the encoding hurdle is cleared its code points all the way down.

But unfortunately "code point" does not equal what we conventionally think of as "character".

In the unicode glossary "Character" has four definitions any of which is a reasonable use of the term.

But even if we pick one, the Unicode data such as classes and properties are assigned to code points not characters. Or even if we defined our own meaning for "character" it would not necessarily equate to Unicode classes and properties unless we defined a character to be purely a code point, but then I think it is better to be precise and say code point to avoid ambiguity.

Taking for my example of why it matters an accented Latin-1 character A grave U+00C0, most have a single code point assigned as it does, or they can be two code points as U+0041 followed by U+300, that is the base letter code point class L followed by the accent which is a combining character class M code point. Where a single code point has been assigned the normalisation process has been defined by Unicode to convert to the single code point which will be class L.

So should AsciiDoc specify that the input must be normalised, or require implementations to do so adding cost scanning the document?

But accents can be stacked, especially in some languages past the Latin-1 set, and there is no single code point assigned to many such multi-combinations, so normalisation won't help, they will still exist as multiple code points in AsciiDoc input. And combining code points can be applied to non-letter code points too.

So this suggests that since it is not a universal cure does this imply normalisation should not be required by or of AsciiDoc implementations?

Where does this all matter? Consider the most obvious example, unconstrained quotes, they need spacing (or a few specific code points) on one side and "letter like" entities on the other. But what if the "letter like" entity is not normalised.

I'll use bold `*` below simply to make it concrete and easier to explain but it applies to all.

If we define "letter like" as the Unicode L class (which I think we should, always delegate as much work to other specs as you can) thats fine for the opening `*`, since even if the following accented character has not been normalised, the base letter is the first code point and so has class L. But for the closing `*` the preceding code point of an unnormalised accented character is the combining code point class M not class L.

So what do we do?

a) Shrug and say use constrained quotes in those cases. But to be user friendly should the use of unconstrained quotes be a warning since the user may find it hard to notice a lone pair of `*` that has remained in their output? And how much work is that? And how many nuisance warnings does it produce?

b) Should AsciiDoc be specified to use the class of the base code point if a `*` is preceded by one or more class M code points and a class L code point? This is quite precise, but may become unwieldy as the various situations markups can be used in are mixed, ie nested quotes like _foo *blah*_ has the * between the _ and the letter.

This may all seem low level and picky, but its my experience that if Unicode's peculiarities are not addressed up front they will haunt you forever. And that is of course particularly true of something that involves human language handling as AsciiDoc does.

Cheers

Lex

Follow-Ups:
- [asciidoc-lang-dev] Unicode Issues
  - From: Sylvain Leroux

Prev by Date: Re: [asciidoc-lang-dev] Whitespace handling
Next by Date: Re: [asciidoc-lang-dev] Whitespace handling
Previous by thread: [asciidoc-lang-dev] On the evolution of Asciidoc species ... erm specifications
Next by thread: [asciidoc-lang-dev] Unicode Issues
Index(es):
- Date
- Thread

Breadcrumbs