[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
|
- From: Sylvain Leroux <sylvain@xxxxxxxxxxx>
- Date: Mon, 8 Mar 2021 09:54:53 +0100
- Autocrypt: addr=sylvain@xxxxxxxxxxx; keydata= xsFNBFdFUf4BEACl0a/nxBGmY4eqGLMYQTVTaUt+Z7SXkaYiiMx00suDDJpCsE3f6Qet4zaC 1EBBseb0x/164kC92cc8ZV5NN00qOKWEkf05/JrVEFFq4le78l/9yO5GTE9ORnrOEqbYrFYf +3ArkXHnxFmR1SCRyFGKTtgE2nGqbKicQgjOYQFS4DfRVkEyPfKsr7/J1GUUTHu/sD7nnNik +7trfLwva9D6EetRUnd+H/AV6QVw3jhgR9klpKMo7+bXi35IZShnYAN+kvuAvoCQDjv1L2L5 XkOf9gGNLJAdEKbBcK0UiQ80RvO6Vr0FejpA0tmRGGIqB5m6WNxRxpeFhgK32l1+pInjGIP3 1to6xf0+pJWuWL5ZfQq8+8+4J+5ibX/klD5D6b78aNV/B/NTO+wE2B1Umw1JWthnKlTbKLCj t4IvAXsQCJWXi55pyz2S2m2vMd1ffHKPl59jIJzUXy2nM9sQhFTzLeKUZ0V6RBUF9lGDAWwh 3pR0OaIvQzuBEf1qEdLBsjMsI9SJdMY4VOKWMCuSMm+KlaF3jsEPkgu+GymUDCbvv2ZIGwwK kXQbs2gqpicPUKXwiszbgx43wiwpTLQ+6ZRlaoKlbVlHoCC/eO2fMvfasUOJZzLZSHOPPsOr xCtygLrSBx5hLdAA7syJv1GVGQaE8IfQPM7P+5QPHVhgQ/mJEQARAQABzSRTeWx2YWluIExl cm91eCA8c3lsdmFpbkBjaGljb3JlZS5mcj7CwYIEEwEIACwCGyMFCQlmAYAHCwkIBwMCAQYV CAIJCgsEFgIDAQIeAQIXgAUCV+WKiQIZAQAKCRCrWB8dH2HFIpzYD/9KVcvI3xAlR+Ahxlvl AnxzwT1ZIhRT1YPbX3Fwr6l7lBuFfp8sGHejY9XNsGMDM/C4h+GxHKiY87KMLTI2P5TfHy2j MYHW4x2VhXTqOmUMtTO1/4DfamlTF/xwaXTy+jx5Z3ghaZDWWflaNXpbwB1j/gl0TjXCSeiK 7GPGFTPJt04JmTDxuTKXqdwHUpKQSZ5pqdufP2po+W/uxgamRXjHD7z8X04+xK5E7ic5pgaE YtquzZDRfnil3W4GSodX6dKdnhCN2r8tDqV0FsRSp3qRuvzBJ692WCH5FmXmvqiNpVCo+Fj1 T45TYB49yiRAzyJZwgZnEB0vH/HzybPmJC9z3wjPaoFmGOUp2imbHlu3ABWRnqPtdYcbDHBF Mrpop7oFAGxhxxiCGv30eEPYdHWgj0pwgja4Z/dauS1NlHBBAdOtG1ixV0+KgW4mP2RrA8aa epUinq7PydEAS9NoYSeSRaBeFjrZPCS+En6/2jyON5nmlgcnRFbTQWjnhRj5tNXPC/QKNBOd 55m+mZkolkF8wkx44bv+jQ8mmgtQGbrBFF9PAaPidPs4C3t7duIeW8zVXmqFH5lF1KmTsljf j79DhHbz3H5gg1UXFe+NYNVEC3rbTFYkdeuFnAOsWUbXl2B+yJ5KR899aKF5yz6pEWPcwjGk jKOx3wzbebkbVvvHX87BTQRXRVH+ARAAoOcKbTwX/+5hwyqgxF//jDo3eMwQUdXUdi5JkiRA dEmJAlAAAfL6IL03rcrKCViPD9W/hL8coa4uUTko5EXkVFLIvq2Npmlr26lGnE5Ae+L4KHn+ qtUUm5Mg9xjtUoukhYjBv6IDXuONcI1iC93tpTsHbNmqG3QXjRWwVs3cCflZLvpKqoC7cXYt 7bKcb/B7lAD3aYqo+plr6zlqSHKTigGIO64eu/TfcUAQxU+/wGfSv1wekHauvFgRumfPJxU0 s4VLUCtAN9huRuET3iqVRtQk1TayLyZDeryxVJhcMTs6qs2n/9s4aZHRBM1iPbFqZ5YXVF03 ySgCj0fXSZ40PY8tqjMSuowRUSA8979EBMi94j4MLGmBwwbp4P1RaNbvvSyYebr2nV+LPDqc oDEI3BpJDz5PCYJOoKZWc2vTWnCjjzufybhZfzRWfzALupdbKq5XkQwMXxlx40GBngpvXc9P yPp8XkbkeEjx4Z2LWU6SUuZmmzoTDzo7J9KA4X3Shdxjdev8xlhSOCooHre3yi1VfPkeuggn 3JYycrio1uJqGUE01XtKKqmqe0sPNgBA+YyV+QNLsDRzk/qTDvbfjq76onYllZTl5mTEN94B uTmS6vKbqg5wiL9usGzOM9MdLzZ2VEUd2y3FqoUMngNRzpotsTqICNFYTzu7mOr1ji8AEQEA AcLBZQQYAQgADwUCV0VR/gIbDAUJCWYBgAAKCRCrWB8dH2HFIh6ID/9s+rRqmUPJm95gMamc W2qvfXmB60xP+Pcbt9tiJEvHF9PdwfEaREH7DxDrq/URgBJ/EYhcDdKJgOzMzV8dGE/EbuO4 KgpEDwT6P8ZjEhEdGouyPYL9SX0nBoxigI7RCmk+4WJ8S4RNcI6guOgGYKSKo/CdGBQhlhK+ 2PoviUaWpy/pBzMwCr6V74qifu0VS2kneOUYOB5UzI/dOy7akFZl7U1Wk8gtJg+Vcvik+UPg T59MWQU+NVJt2ehllXccjC3ImApufu5Yq4GIFEZ/zmAYCdD4TzgfvknDFC4ibyKkddv+eJHd Vn2bWK24s8f/JekOdOboWEBRPJg1XuGVdiB2o79KOhx42/wxZrnG07+1sUyhcpszruLbGn6H 1sjcPL/ELVoicVB3VcguXw+t3ZrnPSnuwBBNkJsQbA4rcBxbYlHV9BINbaV3W7+7FBnhPMT3 7FZ/xDGcGKlOpQVkuNhP7Awa8DPqPbO63mjnrYhkCQe5ySvNdpMxHVd/j6TWg4XE/fJx+62X NFeLWXsl9tKrrYx0Eqbay7NpodCZ/YhijGi8im46VVXBUH+jA7GLm9D8+afmOCadJj6MQZh1 LO60K3XtOlvoG+1DpnQpb982/zPVmr66FyzD4wHDOtU76+fC7GwnbnoEZIUYnIrLom+qdbsP ZVTXbkoKWnXazv6EYQ==
- Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
- List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
- List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
- List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
- List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>
- Openpgp: preference=signencrypt
- User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 Thunderbird/60.9.0
Hi Lex,
You brought up many great questions. I answer each main topic
individually as my reflection advance (and _if_ I have something to say).
On 07/03/2021 01:58, Lex Trotman wrote:>
> 1.a. The constraints are context around the markup, not part of the
> token, so it does not violate the requirement that markup be ASCII for
> the context to allow any Unicode.
Agreed.
>
> 1.b. As proposed in another thread initial "letter like" could be
> defined as a Unicode class L code point and final "letter like" could be
> defined as Unicode class L code point followed by any number of Unicode
> class M (combining characters such as accents) code points.
_Or_ as any Unicode class L code point or Unicode class M code point.
The current asciidoctor documentation explicitly states that: "An
AsciiDoc processor always assumes the content is UTF-8 encoded". Are
stray Unicode class M code point forbidden by Unicode? Is this something
we should enforce?
[1]: https://docs.asciidoctor.org/asciidoc/latest/normalization/
>
> 1.c. currently the punctuation allowed is (,;".?!) which are common
> English punctuations but do not include any non-English punctuation.
>
> 1.d. So should common non-English punctuation be allowed and which?
>
> 1.e. Should all Unicode category P punctuation be allowed?
>
> 1.e. Should punctuation be allowed before the initial constrained
markup?
>
> 1.f. Should only Unicode category Pi and Ps be allowed before and Pf and
> Pe and Po after?
>
> 1.g. What is "space" (here I'm talking in the context of constrained
> markup, there is another thread that addresses it more generally), eg
> Unicode category Zs (https://www.compart.com/en/unicode/category/Zs) and
> AsciiDoc line separators?
>
From i18l perspective, allowing "Unicode category Zs | Unicode category
Ps" ([2],[3])before an opening constrained markup seems reasonable. We
can hope this wouldn't raise the number of false positives in markup
detection dramatically. I have no opinion regarding the inclusion or not
of the Unicode category Pi ([4]) in that set.
I have the same reasoning regarding the closing markups, using the
complementary Unicode categories, of course.
[2]: https://www.compart.com/en/unicode/category/Zs
[3]: https://www.compart.com/en/unicode/category/Ps
[4]: https://www.compart.com/en/unicode/category/Pi
> 1.h. Or since unconstrained markup is available should the specification
> be conservative on what is allowed bounding unconstrained markup, the
> markups (*_`#~^) are uncommon in general English text, but tend to occur
> when talking about programming code and math, and I don't know how
> common they are in other languages. The rules are intended to minimise
> nuisance recognition of such use-cases as markup, so the more situations
> that markup is allowed the more nuisance occurrences are likely.
This is the conservative approach, aiming toward maximizing the
compatibility with the existing implementation. But, even for Latin
scripts, the current implementation is not satisfactory. Think, for
example, to the inverted question and exclamation marks used in Spanish
[5]. So in fine, this has to evolve.
Should we consider that (1.h) option for the v1.x of the specs, knowing
we would evolve toward a more intl-aware solution based on the Unicode
categories in the v2.x? Or should we make the leap right now, assuming a
"good enough" compatibility with the existing document base?
[5]: https://en.wikipedia.org/wiki/Inverted_question_and_exclamation_marks
Attachment:
signature.asc
Description: OpenPGP digital signature