Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [asciidoc-lang-dev] Whitespace handling
  • From: Sylvain Leroux <sylvain@xxxxxxxxxxx>
  • Date: Tue, 2 Mar 2021 19:47:52 +0100
  • Autocrypt: addr=sylvain@xxxxxxxxxxx; keydata= xsFNBFdFUf4BEACl0a/nxBGmY4eqGLMYQTVTaUt+Z7SXkaYiiMx00suDDJpCsE3f6Qet4zaC 1EBBseb0x/164kC92cc8ZV5NN00qOKWEkf05/JrVEFFq4le78l/9yO5GTE9ORnrOEqbYrFYf +3ArkXHnxFmR1SCRyFGKTtgE2nGqbKicQgjOYQFS4DfRVkEyPfKsr7/J1GUUTHu/sD7nnNik +7trfLwva9D6EetRUnd+H/AV6QVw3jhgR9klpKMo7+bXi35IZShnYAN+kvuAvoCQDjv1L2L5 XkOf9gGNLJAdEKbBcK0UiQ80RvO6Vr0FejpA0tmRGGIqB5m6WNxRxpeFhgK32l1+pInjGIP3 1to6xf0+pJWuWL5ZfQq8+8+4J+5ibX/klD5D6b78aNV/B/NTO+wE2B1Umw1JWthnKlTbKLCj t4IvAXsQCJWXi55pyz2S2m2vMd1ffHKPl59jIJzUXy2nM9sQhFTzLeKUZ0V6RBUF9lGDAWwh 3pR0OaIvQzuBEf1qEdLBsjMsI9SJdMY4VOKWMCuSMm+KlaF3jsEPkgu+GymUDCbvv2ZIGwwK kXQbs2gqpicPUKXwiszbgx43wiwpTLQ+6ZRlaoKlbVlHoCC/eO2fMvfasUOJZzLZSHOPPsOr xCtygLrSBx5hLdAA7syJv1GVGQaE8IfQPM7P+5QPHVhgQ/mJEQARAQABzSRTeWx2YWluIExl cm91eCA8c3lsdmFpbkBjaGljb3JlZS5mcj7CwYIEEwEIACwCGyMFCQlmAYAHCwkIBwMCAQYV CAIJCgsEFgIDAQIeAQIXgAUCV+WKiQIZAQAKCRCrWB8dH2HFIpzYD/9KVcvI3xAlR+Ahxlvl AnxzwT1ZIhRT1YPbX3Fwr6l7lBuFfp8sGHejY9XNsGMDM/C4h+GxHKiY87KMLTI2P5TfHy2j MYHW4x2VhXTqOmUMtTO1/4DfamlTF/xwaXTy+jx5Z3ghaZDWWflaNXpbwB1j/gl0TjXCSeiK 7GPGFTPJt04JmTDxuTKXqdwHUpKQSZ5pqdufP2po+W/uxgamRXjHD7z8X04+xK5E7ic5pgaE YtquzZDRfnil3W4GSodX6dKdnhCN2r8tDqV0FsRSp3qRuvzBJ692WCH5FmXmvqiNpVCo+Fj1 T45TYB49yiRAzyJZwgZnEB0vH/HzybPmJC9z3wjPaoFmGOUp2imbHlu3ABWRnqPtdYcbDHBF Mrpop7oFAGxhxxiCGv30eEPYdHWgj0pwgja4Z/dauS1NlHBBAdOtG1ixV0+KgW4mP2RrA8aa epUinq7PydEAS9NoYSeSRaBeFjrZPCS+En6/2jyON5nmlgcnRFbTQWjnhRj5tNXPC/QKNBOd 55m+mZkolkF8wkx44bv+jQ8mmgtQGbrBFF9PAaPidPs4C3t7duIeW8zVXmqFH5lF1KmTsljf j79DhHbz3H5gg1UXFe+NYNVEC3rbTFYkdeuFnAOsWUbXl2B+yJ5KR899aKF5yz6pEWPcwjGk jKOx3wzbebkbVvvHX87BTQRXRVH+ARAAoOcKbTwX/+5hwyqgxF//jDo3eMwQUdXUdi5JkiRA dEmJAlAAAfL6IL03rcrKCViPD9W/hL8coa4uUTko5EXkVFLIvq2Npmlr26lGnE5Ae+L4KHn+ qtUUm5Mg9xjtUoukhYjBv6IDXuONcI1iC93tpTsHbNmqG3QXjRWwVs3cCflZLvpKqoC7cXYt 7bKcb/B7lAD3aYqo+plr6zlqSHKTigGIO64eu/TfcUAQxU+/wGfSv1wekHauvFgRumfPJxU0 s4VLUCtAN9huRuET3iqVRtQk1TayLyZDeryxVJhcMTs6qs2n/9s4aZHRBM1iPbFqZ5YXVF03 ySgCj0fXSZ40PY8tqjMSuowRUSA8979EBMi94j4MLGmBwwbp4P1RaNbvvSyYebr2nV+LPDqc oDEI3BpJDz5PCYJOoKZWc2vTWnCjjzufybhZfzRWfzALupdbKq5XkQwMXxlx40GBngpvXc9P yPp8XkbkeEjx4Z2LWU6SUuZmmzoTDzo7J9KA4X3Shdxjdev8xlhSOCooHre3yi1VfPkeuggn 3JYycrio1uJqGUE01XtKKqmqe0sPNgBA+YyV+QNLsDRzk/qTDvbfjq76onYllZTl5mTEN94B uTmS6vKbqg5wiL9usGzOM9MdLzZ2VEUd2y3FqoUMngNRzpotsTqICNFYTzu7mOr1ji8AEQEA AcLBZQQYAQgADwUCV0VR/gIbDAUJCWYBgAAKCRCrWB8dH2HFIh6ID/9s+rRqmUPJm95gMamc W2qvfXmB60xP+Pcbt9tiJEvHF9PdwfEaREH7DxDrq/URgBJ/EYhcDdKJgOzMzV8dGE/EbuO4 KgpEDwT6P8ZjEhEdGouyPYL9SX0nBoxigI7RCmk+4WJ8S4RNcI6guOgGYKSKo/CdGBQhlhK+ 2PoviUaWpy/pBzMwCr6V74qifu0VS2kneOUYOB5UzI/dOy7akFZl7U1Wk8gtJg+Vcvik+UPg T59MWQU+NVJt2ehllXccjC3ImApufu5Yq4GIFEZ/zmAYCdD4TzgfvknDFC4ibyKkddv+eJHd Vn2bWK24s8f/JekOdOboWEBRPJg1XuGVdiB2o79KOhx42/wxZrnG07+1sUyhcpszruLbGn6H 1sjcPL/ELVoicVB3VcguXw+t3ZrnPSnuwBBNkJsQbA4rcBxbYlHV9BINbaV3W7+7FBnhPMT3 7FZ/xDGcGKlOpQVkuNhP7Awa8DPqPbO63mjnrYhkCQe5ySvNdpMxHVd/j6TWg4XE/fJx+62X NFeLWXsl9tKrrYx0Eqbay7NpodCZ/YhijGi8im46VVXBUH+jA7GLm9D8+afmOCadJj6MQZh1 LO60K3XtOlvoG+1DpnQpb982/zPVmr66FyzD4wHDOtU76+fC7GwnbnoEZIUYnIrLom+qdbsP ZVTXbkoKWnXazv6EYQ==
  • Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
  • List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
  • List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
  • List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
  • List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>
  • Openpgp: preference=signencrypt
  • User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 Thunderbird/60.9.0

On 01/03/2021 00:54, Lex Trotman wrote:
> 
> 
> On Sun, 28 Feb 2021 at 22:10, Sylvain Leroux <sylvain@xxxxxxxxxxx
> <mailto:sylvain@xxxxxxxxxxx>> wrote:
> 
>     I already discussed that idea with Dan elsewhere. I assume it's not a
>     problem if I quote him here: "A valid space character is a space, a tab,
>     or a line feed (aka newline). It's questionable whether a non-breaking
>     space should be allowed. But it definitely shouldn't extend beyond
>     that."
> 
>     My understanding at that time was a parser only has to deal with Ascii
>     characters (Formally, the C0 Controls and Basic Latin block [1]). So the
>     only spacing allowed between markups and user's content were \u0009 and
>     \u0020. That won't prevent any other spacing character inside the text,
>     at least up to the DOM.
> 
> 
> Agree that the spacing which is markup (like paragraph separators or
> after section and list markups) is only ASCII.

But, what about the spaces around constrained markups (strong, emphasis,
...)? May we safely assume only ASCII is used there too? I didn't find
an actual example, but I can imagine a word processor silently replacing
the space before a "*" or "_" by a non-breaking space (U+00A0) or a thin
space (U+2009). Is this something we should support for compatibility
with the existing document base?

> 
> Spacing that is recognised for output purposes, such as PDF
> justification, is different and is in that processor's perview to
> define what it supports IMO.

I agree.

> 
> 
>     Dan also reminded me at that time `asciidoctor` "normalizes" lines early
>     in the document processing to remove trailing spaces ([2]). Once again,
>     I understood "to remove trailing \u0009 and \u0020." That also means
>     trailing spaces will not make their way into the DOM.
> 
> 
> Since trailing spacing has no semantic meaning in standard Asciidoc that
> has no effect on Asciidoc.  But it would however impact extensions that
> assigned semantics to it [...]  So at least on
> literal blocks it should be left so that included content can be
> addressed by extensions unmodified.

I'm not sure we need a spacial case here. Since trailing spacing has no
effect in Asciidoc, it won't be a breaking change if we keep them up to
the DOM for all blocks. According to my experiments, it would also
slightly simplify the grammar for the inline parser.

> 
> 
>     We didn't discuss the case of the `\r`. AFAIK, asciidoctor only
>     recognizes `\n` as the line terminator. So, we might extend the
>     normalization process to remove `\r` at the end of a line.
> 
> 
> I would phrase it that line separators are \n and \r\n, not sure how
> much \r for old macs exists any more.  That way the \r is simply part of
> the line ending.  And the concept of multibyte line endings allows NEL
> or LS and PS to be added in future if it becomes necessary.

Earlier in this tread, we agreed that "end of line" should makes its way
up to the DOM. Do you think we should normalize the internal
representation for the EOL so processors won't have to deal with its
actual encoding in the source document?


Regards,
- Sylvain

Attachment: signature.asc
Description: OpenPGP digital signature


Back to the top