|Re: [asciidoc-lang-dev] Whitespace handling|
> My understanding at that time was a parser only has to deal with Ascii
> characters (Formally, the C0 Controls and Basic Latin block ). So the
> only spacing allowed between markups and user's content were \u0009 and
> \u0020. That won't prevent any other spacing character inside the text,
> at least up to the DOM.
> Agree that the spacing which is markup (like paragraph separators or
> after section and list markups) is only ASCII.
But, what about the spaces around constrained markups (strong, emphasis,
...)? May we safely assume only ASCII is used there too? I didn't find
an actual example, but I can imagine a word processor silently replacing
the space before a "*" or "_" by a non-breaking space (U+00A0) or a thin
space (U+2009). Is this something we should support for compatibility
with the existing document base?
> Dan also reminded me at that time `asciidoctor` "normalizes" lines early
> in the document processing to remove trailing spaces (). Once again,
> I understood "to remove trailing \u0009 and \u0020." That also means
> trailing spaces will not make their way into the DOM.
> Since trailing spacing has no semantic meaning in standard Asciidoc that
> has no effect on Asciidoc. But it would however impact extensions that
> assigned semantics to it [...] So at least on
> literal blocks it should be left so that included content can be
> addressed by extensions unmodified.
I'm not sure we need a spacial case here. Since trailing spacing has no
effect in Asciidoc, it won't be a breaking change if we keep them up to
the DOM for all blocks. According to my experiments, it would also
slightly simplify the grammar for the inline parser.
> We didn't discuss the case of the `\r`. AFAIK, asciidoctor only
> recognizes `\n` as the line terminator. So, we might extend the
> normalization process to remove `\r` at the end of a line.
> I would phrase it that line separators are \n and \r\n, not sure how
> much \r for old macs exists any more. That way the \r is simply part of
> the line ending. And the concept of multibyte line endings allows NEL
> or LS and PS to be added in future if it becomes necessary.
Earlier in this tread, we agreed that "end of line" should makes its way
up to the DOM. Do you think we should normalize the internal
representation for the EOL so processors won't have to deal with its
actual encoding in the source document?
Back to the top