Re: [asciidoc-lang-dev] Unicode Issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [asciidoc-lang-dev] Unicode Issues

From: Lex Trotman <exciidoc@xxxxxxxxx>
Date: Thu, 4 Mar 2021 11:08:40 +1000
Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>

On Wed, 3 Mar 2021 at 21:54, Sylvain Leroux <sylvain@xxxxxxxxxxx> wrote:

This probably need a closer attention, but as first toughs:

On 03/03/2021 05:51, Lex Trotman wrote:
> Just a first touch on encoding, I suggest that the standard require
> encodings to be either explicitly specified by the document (BOM or
> :encoding: attribute) or explicitly specified on the command line or be
> assumed to be UTF-8. If an implementation picks up encodings from the
> environment that will make documents break if processed in another
> environment, so that should be actively discouraged.
Given you can write a valid Asciidoc document using _only_ the 7 bit
ASCII set, I would be more restrictive here:

"An AsciiDoc document is a stream of Unicode code point. Any conforming
processor MUST accept UTF-8 encoded input streams. As an extension,
processors MAY accept other encodings."

I would agree with that as far as it goes. But it doesn't specify how encodings are determined. Oh, and what about mixed encodings in multi-file documents? If a document includes pieces from another file (for example a program source code) it has no control of the encoding of that file, or does the include:: directive need to accept an encoding parameter?

> So should AsciiDoc specify that the input must be normalised, or require
> implementations to do so adding cost scanning the document?
We must agree to one of the Unicode Normalization Form ("NFC", "NFD",
"NFKC", or "NFKD" [1]). The specs must be written assuming that
normalization form will be used for internal processing. A processor may
use another internal representation as long as the observable behavior
in conforming with the specs.

[1]: https://unicode.org/reports/tr15/

I didn't make a suggestion since I'm not sure it matters, the specification has to account for combining characters in any case since not all characters compose to a single code point. I don't believe AsciiDoc should be defined such that its implementations should need to know about equivalence classes or other characteristics of the human text between the markups, only the major classes of the code points as you have outlined below, and then normalisation doesn't matter as far as I can tell.

>
> But accents can be stacked, especially in some languages past the
> Latin-1 set, and there is no single code point assigned to many such
> multi-combinations, so normalisation won't help, they will still exist
> as multiple code points in AsciiDoc input. And combining code points
> can be applied to non-letter code points too.
> [...]
> But for the closing `*` the preceding code point of an
> unnormalised accented character is the combining code point class M not
> class L.
Assuming the NFD normalization form, it's easy to define a LETTER as:

LETTER := UNICODE_L_CLASS_CP (UNICODE_M_CLASS_CP)*

Even in NF(K)C some "characters" have this form, so we need to handle it in any normalisation I believe, and this would work in any of them.

Actually, I wonder if we couldn't simplify that as:

LETTER := UNICODE_L_CLASS_CP | UNICODE_M_CLASS_CP

Thats a good point and a question we need to decide. Are there combining characters that attach to code points we do not want to allow in conjunction with constrained quotes, Unicode is so big I don't know?

I believe we can possibly be conservative and use the former definition requiring a L class base code point since AsciiDoc also has unconstrained quotes that can be used next to any type of character.

> This is quite precise, but may become unwieldy as the
> various situations markups can be used in are mixed, ie nested quotes
> like _foo *blah*_ has the * between the _ and the letter.
Assuming a LL parser, "*blah*" was already recognized as an
"inline-strong" non-terminal when "_" is processed. So the rule won't
care about preceding character code:

That implies that an implementation with a separate lexer is aware of that context from the parser.

inline := inline-emphasis | inline-strong | ANY_LETTER_OR_MARK
inline-emphasis := <???> "_" inline (ANY_SPACE* inline)* "_" <???>
inline-strong := <???> "*" inline (ANY_SPACE* inline)* "*" <???>

In my personal implementation, I defined "<???>" as
"not(ANY_LETTER_OR_MARK)". But I'm pretty sure it isn't compliant with
the actual Asciidoctor behavior.

Hmmm, you are right it can't really be addressed in isolation from the whole question of handling of quotes markup. Thats a big one and should be its own thread[1] so I won't comment further here.

Cheers

Lex

[1] Pending the release of the guidelines for contribution (which I expect should have a period for comments before final promulgation meaning they are some way away) I'm taking the approach Dan mentioned previously that things should not be mixed in threads so that it is easy to ensure everything is addressed before the specification is wrapped up.

>
> This may all seem low level and picky, but its my experience that if
> Unicode's peculiarities are not addressed up front they will haunt you
> forever. And that is of course particularly true of something that
> involves human language handling as AsciiDoc does.
I agree. This is something that has to be specified upfront.

- Sylvain

_______________________________________________
asciidoc-lang-dev mailing list
asciidoc-lang-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev

References:
- [asciidoc-lang-dev] Unicode Issues
  - From: Lex Trotman
- [asciidoc-lang-dev] Unicode Issues
  - From: Sylvain Leroux

Prev by Date: [asciidoc-lang-dev] Unicode Issues
Next by Date: Re: [asciidoc-lang-dev] Whitespace handling
Previous by thread: [asciidoc-lang-dev] Unicode Issues
Next by thread: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
Index(es):
- Date
- Thread

Breadcrumbs