Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [asciidoc-lang-dev] Whitespace handling


>     My understanding at that time was a parser only has to deal with Ascii
>     characters (Formally, the C0 Controls and Basic Latin block [1]). So the
>     only spacing allowed between markups and user's content were \u0009 and
>     \u0020. That won't prevent any other spacing character inside the text,
>     at least up to the DOM.
> Agree that the spacing which is markup (like paragraph separators or
> after section and list markups) is only ASCII.

But, what about the spaces around constrained markups (strong, emphasis,
...)? May we safely assume only ASCII is used there too? I didn't find
an actual example, but I can imagine a word processor silently replacing
the space before a "*" or "_" by a non-breaking space (U+00A0) or a thin
space (U+2009). Is this something we should support for compatibility
with the existing document base?

Interesting question since the spacing is context, not part of the markup itself, just like the character on the other side.  

Thinking about it, (as well as some defined code points) the non-spacing context character must be able to be any Unicode letter code point or it prevents the markup being used on some non-English languages, and so I don't see why the spacing context should not be any code point with the appropriate spacing Unicode property as well.  If non-ASCII context on one side is valid, there is no reason it should not be valid on both sides.

>     Dan also reminded me at that time `asciidoctor` "normalizes" lines early
>     in the document processing to remove trailing spaces ([2]). Once again,
>     I understood "to remove trailing \u0009 and \u0020." That also means
>     trailing spaces will not make their way into the DOM.
> Since trailing spacing has no semantic meaning in standard Asciidoc that
> has no effect on Asciidoc.  But it would however impact extensions that
> assigned semantics to it [...]  So at least on
> literal blocks it should be left so that included content can be
> addressed by extensions unmodified.

I'm not sure we need a spacial case here. Since trailing spacing has no
effect in Asciidoc, it won't be a breaking change if we keep them up to
the DOM for all blocks. According to my experiments, it would also
slightly simplify the grammar for the inline parser.


>     We didn't discuss the case of the `\r`. AFAIK, asciidoctor only
>     recognizes `\n` as the line terminator. So, we might extend the
>     normalization process to remove `\r` at the end of a line.
> I would phrase it that line separators are \n and \r\n, not sure how
> much \r for old macs exists any more.  That way the \r is simply part of
> the line ending.  And the concept of multibyte line endings allows NEL
> or LS and PS to be added in future if it becomes necessary.

Earlier in this tread, we agreed that "end of line" should makes its way
up to the DOM. Do you think we should normalize the internal
representation for the EOL so processors won't have to deal with its
actual encoding in the source document?

What I'm doing in my experimental fully RD parser is that since a Markup_text node has a list of children for each type of markup (eg bold, italic or plain text runs), endline is simply one of those, so its a common entity when required, but has the value which is its original text as all nodes parsed from the source do.  So both use-cases are catered for in this case, but it is based on the full RD parsing which may need to be a future change since it alters document parsing in non-backward compatible ways.

> What we do know is that sequence spaces and newlines should not be visible in normal paragraph text in the output document. That I feel confident in saying is a requirement. If you can guarantee that contract, then for now I'd say it doesn't matter what goes on inside your parser.


One of the concepts I have been toying with is defining a set of common passes for the DOM that extensions and backends could call if they needed them.  Spacing normalisation is one of them.  That way these functions can be shared and give the same results rather than having the backends doing differing things, but HTML doing its own thing as well is the fly in the ointment.  I am not up to that stage yet, so I'm not convinced either way yet.


Back to the top