Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[asciidoc-lang-dev] Unicode Issues
  • From: Sylvain Leroux <sylvain@xxxxxxxxxxx>
  • Date: Wed, 3 Mar 2021 12:54:10 +0100
  • Autocrypt: addr=sylvain@xxxxxxxxxxx; keydata= xsFNBFdFUf4BEACl0a/nxBGmY4eqGLMYQTVTaUt+Z7SXkaYiiMx00suDDJpCsE3f6Qet4zaC 1EBBseb0x/164kC92cc8ZV5NN00qOKWEkf05/JrVEFFq4le78l/9yO5GTE9ORnrOEqbYrFYf +3ArkXHnxFmR1SCRyFGKTtgE2nGqbKicQgjOYQFS4DfRVkEyPfKsr7/J1GUUTHu/sD7nnNik +7trfLwva9D6EetRUnd+H/AV6QVw3jhgR9klpKMo7+bXi35IZShnYAN+kvuAvoCQDjv1L2L5 XkOf9gGNLJAdEKbBcK0UiQ80RvO6Vr0FejpA0tmRGGIqB5m6WNxRxpeFhgK32l1+pInjGIP3 1to6xf0+pJWuWL5ZfQq8+8+4J+5ibX/klD5D6b78aNV/B/NTO+wE2B1Umw1JWthnKlTbKLCj t4IvAXsQCJWXi55pyz2S2m2vMd1ffHKPl59jIJzUXy2nM9sQhFTzLeKUZ0V6RBUF9lGDAWwh 3pR0OaIvQzuBEf1qEdLBsjMsI9SJdMY4VOKWMCuSMm+KlaF3jsEPkgu+GymUDCbvv2ZIGwwK kXQbs2gqpicPUKXwiszbgx43wiwpTLQ+6ZRlaoKlbVlHoCC/eO2fMvfasUOJZzLZSHOPPsOr xCtygLrSBx5hLdAA7syJv1GVGQaE8IfQPM7P+5QPHVhgQ/mJEQARAQABzSRTeWx2YWluIExl cm91eCA8c3lsdmFpbkBjaGljb3JlZS5mcj7CwYIEEwEIACwCGyMFCQlmAYAHCwkIBwMCAQYV CAIJCgsEFgIDAQIeAQIXgAUCV+WKiQIZAQAKCRCrWB8dH2HFIpzYD/9KVcvI3xAlR+Ahxlvl AnxzwT1ZIhRT1YPbX3Fwr6l7lBuFfp8sGHejY9XNsGMDM/C4h+GxHKiY87KMLTI2P5TfHy2j MYHW4x2VhXTqOmUMtTO1/4DfamlTF/xwaXTy+jx5Z3ghaZDWWflaNXpbwB1j/gl0TjXCSeiK 7GPGFTPJt04JmTDxuTKXqdwHUpKQSZ5pqdufP2po+W/uxgamRXjHD7z8X04+xK5E7ic5pgaE YtquzZDRfnil3W4GSodX6dKdnhCN2r8tDqV0FsRSp3qRuvzBJ692WCH5FmXmvqiNpVCo+Fj1 T45TYB49yiRAzyJZwgZnEB0vH/HzybPmJC9z3wjPaoFmGOUp2imbHlu3ABWRnqPtdYcbDHBF Mrpop7oFAGxhxxiCGv30eEPYdHWgj0pwgja4Z/dauS1NlHBBAdOtG1ixV0+KgW4mP2RrA8aa epUinq7PydEAS9NoYSeSRaBeFjrZPCS+En6/2jyON5nmlgcnRFbTQWjnhRj5tNXPC/QKNBOd 55m+mZkolkF8wkx44bv+jQ8mmgtQGbrBFF9PAaPidPs4C3t7duIeW8zVXmqFH5lF1KmTsljf j79DhHbz3H5gg1UXFe+NYNVEC3rbTFYkdeuFnAOsWUbXl2B+yJ5KR899aKF5yz6pEWPcwjGk jKOx3wzbebkbVvvHX87BTQRXRVH+ARAAoOcKbTwX/+5hwyqgxF//jDo3eMwQUdXUdi5JkiRA dEmJAlAAAfL6IL03rcrKCViPD9W/hL8coa4uUTko5EXkVFLIvq2Npmlr26lGnE5Ae+L4KHn+ qtUUm5Mg9xjtUoukhYjBv6IDXuONcI1iC93tpTsHbNmqG3QXjRWwVs3cCflZLvpKqoC7cXYt 7bKcb/B7lAD3aYqo+plr6zlqSHKTigGIO64eu/TfcUAQxU+/wGfSv1wekHauvFgRumfPJxU0 s4VLUCtAN9huRuET3iqVRtQk1TayLyZDeryxVJhcMTs6qs2n/9s4aZHRBM1iPbFqZ5YXVF03 ySgCj0fXSZ40PY8tqjMSuowRUSA8979EBMi94j4MLGmBwwbp4P1RaNbvvSyYebr2nV+LPDqc oDEI3BpJDz5PCYJOoKZWc2vTWnCjjzufybhZfzRWfzALupdbKq5XkQwMXxlx40GBngpvXc9P yPp8XkbkeEjx4Z2LWU6SUuZmmzoTDzo7J9KA4X3Shdxjdev8xlhSOCooHre3yi1VfPkeuggn 3JYycrio1uJqGUE01XtKKqmqe0sPNgBA+YyV+QNLsDRzk/qTDvbfjq76onYllZTl5mTEN94B uTmS6vKbqg5wiL9usGzOM9MdLzZ2VEUd2y3FqoUMngNRzpotsTqICNFYTzu7mOr1ji8AEQEA AcLBZQQYAQgADwUCV0VR/gIbDAUJCWYBgAAKCRCrWB8dH2HFIh6ID/9s+rRqmUPJm95gMamc W2qvfXmB60xP+Pcbt9tiJEvHF9PdwfEaREH7DxDrq/URgBJ/EYhcDdKJgOzMzV8dGE/EbuO4 KgpEDwT6P8ZjEhEdGouyPYL9SX0nBoxigI7RCmk+4WJ8S4RNcI6guOgGYKSKo/CdGBQhlhK+ 2PoviUaWpy/pBzMwCr6V74qifu0VS2kneOUYOB5UzI/dOy7akFZl7U1Wk8gtJg+Vcvik+UPg T59MWQU+NVJt2ehllXccjC3ImApufu5Yq4GIFEZ/zmAYCdD4TzgfvknDFC4ibyKkddv+eJHd Vn2bWK24s8f/JekOdOboWEBRPJg1XuGVdiB2o79KOhx42/wxZrnG07+1sUyhcpszruLbGn6H 1sjcPL/ELVoicVB3VcguXw+t3ZrnPSnuwBBNkJsQbA4rcBxbYlHV9BINbaV3W7+7FBnhPMT3 7FZ/xDGcGKlOpQVkuNhP7Awa8DPqPbO63mjnrYhkCQe5ySvNdpMxHVd/j6TWg4XE/fJx+62X NFeLWXsl9tKrrYx0Eqbay7NpodCZ/YhijGi8im46VVXBUH+jA7GLm9D8+afmOCadJj6MQZh1 LO60K3XtOlvoG+1DpnQpb982/zPVmr66FyzD4wHDOtU76+fC7GwnbnoEZIUYnIrLom+qdbsP ZVTXbkoKWnXazv6EYQ==
  • Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
  • List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
  • List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
  • List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
  • List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>
  • Openpgp: preference=signencrypt
  • User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 Thunderbird/60.9.0

This probably need a closer attention, but as first toughs:

On 03/03/2021 05:51, Lex Trotman wrote:
> Just a first touch on encoding, I suggest that the standard require
> encodings to be either explicitly specified by the document (BOM or
> :encoding: attribute) or explicitly specified on the command line or be
> assumed to be UTF-8.  If an implementation picks up encodings from the
> environment that will make documents break if processed in another
> environment, so that should be actively discouraged.
Given you can write a valid Asciidoc document using _only_ the 7 bit
ASCII set, I would be more restrictive here:

"An AsciiDoc document is a stream of Unicode code point. Any conforming
processor MUST accept UTF-8 encoded input streams. As an extension,
processors MAY accept other encodings."

> So should AsciiDoc specify that the input must be normalised, or require
> implementations to do so adding cost scanning the document?
We must agree to one of the Unicode Normalization Form ("NFC", "NFD",
"NFKC", or "NFKD" [1]). The specs must be written assuming that
normalization form will be used for internal processing. A processor may
use another internal representation as long as the observable behavior
in conforming with the specs.

[1]: https://unicode.org/reports/tr15/

> 
> But accents can be stacked, especially in some languages past the
> Latin-1 set, and there is no single code point assigned to many such
> multi-combinations, so normalisation won't help, they will still exist
> as multiple code points in AsciiDoc input.  And combining code points
> can be applied to non-letter code points too.
> [...]
> But for the closing `*` the preceding code point of an
> unnormalised accented character is the combining code point class M not
> class L. 
Assuming the NFD normalization form, it's easy to define a LETTER as:

LETTER := UNICODE_L_CLASS_CP (UNICODE_M_CLASS_CP)*

Actually, I wonder if we couldn't simplify that as:

LETTER := UNICODE_L_CLASS_CP | UNICODE_M_CLASS_CP


> This is quite precise, but may become unwieldy as the
> various situations markups can be used in are mixed, ie nested quotes
> like _foo *blah*_ has the * between the _ and the letter.
Assuming a LL parser, "*blah*" was already recognized as an
"inline-strong" non-terminal when "_" is processed. So the rule won't
care about preceding character code:

inline := inline-emphasis | inline-strong | ANY_LETTER_OR_MARK
inline-emphasis := <???> "_" inline (ANY_SPACE* inline)* "_" <???>
inline-strong := <???> "*" inline (ANY_SPACE* inline)* "*" <???>

In my personal implementation, I defined "<???>" as
"not(ANY_LETTER_OR_MARK)". But I'm pretty sure it isn't compliant with
the actual Asciidoctor behavior.


> 
> This may all seem low level and picky, but its my experience that if
> Unicode's peculiarities are not addressed up front they will haunt you
> forever.  And that is of course particularly true of something that
> involves human language handling as AsciiDoc does.
I agree. This is something that has to be specified upfront.

- Sylvain



Attachment: signature.asc
Description: OpenPGP digital signature


Back to the top