[asciidoc-lang-dev] Text Markup, syntax and parsing thereof

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[asciidoc-lang-dev] Text Markup, syntax and parsing thereof

From: Lex Trotman <exciidoc@xxxxxxxxx>
Date: Sun, 7 Mar 2021 10:58:44 +1000
Delivered-to: asciidoc-lang-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/asciidoc-lang-dev/>
List-help: <mailto:asciidoc-lang-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/asciidoc-lang-dev>, <mailto:asciidoc-lang-dev-request@eclipse.org?subject=unsubscribe>

There are a number of proposals to improve text markup, specifically constrained and unconstrained delimited text markup.

[Note] all AsciiDoc markup examples are enclosed in backquotes which are not part of the example (see also 6.)

Here are the questions I am aware of and my thoughts, please add others.

1. constrained markup uses two single characters as markup, so to avoid clashes with content they are only valid markup in constrained situations. The general intention is that they surround words or groups of words with the opening being before a word and the closing being after a word. Start of word and end of word are defined as space followed by letter like characters and the end of the word is defined to be letter like character then space or some punctuations.

Thoughts, questions:

1.a. The constraints are context around the markup, not part of the token, so it does not violate the requirement that markup be ASCII for the context to allow any Unicode.

1.b. As proposed in another thread initial "letter like" could be defined as a Unicode class L code point and final "letter like" could be defined as Unicode class L code point followed by any number of Unicode class M (combining characters such as accents) code points.

1.c. currently the punctuation allowed is (,;".?!) which are common English punctuations but do not include any non-English punctuation.

1.d. So should common non-English punctuation be allowed and which?

1.e. Should all Unicode category P punctuation be allowed?

1.e. Should punctuation be allowed before the initial constrained markup?

1.f. Should only Unicode category Pi and Ps be allowed before and Pf and Pe and Po after?

1.g. What is "space" (here I'm talking in the context of constrained markup, there is another thread that addresses it more generally), eg Unicode category Zs (https://www.compart.com/en/unicode/category/Zs) and AsciiDoc line separators?

1.h. Or since unconstrained markup is available should the specification be conservative on what is allowed bounding unconstrained markup, the markups (*_`#~^) are uncommon in general English text, but tend to occur when talking about programming code and math, and I don't know how common they are in other languages. The rules are intended to minimise nuisance recognition of such use-cases as markup, so the more situations that markup is allowed the more nuisance occurrences are likely.

2. Escaping of unwanted markups (see 1.h.), backslash? But if all ASCII punctuation is allowed in the context that may impact use of backslash as escaping.

3. Current implementations of AsciiDoc do no parsing of text markup and it does not exist in the DOM. Instead direct substitution in a specific order is used, meaning backend issues reach well forward into the implementation. Also currently occurrences of markup characters in legal context, but for which the matching open/close markup does not exist are silently left as text.

3.a. It is proposed that the specification deprecate this mechanism and move to a recursive definition and parsing the text markup into the DOM. But that definition will interpret overlaps in a different manner so it isn't backward compatible. For example `*foo _blah* bletch_` could currently parse as `<bold>foo _blah</bold> bletch_` or `*foo <italic>blah* bletch</italic>` or the illegal `<bold>foo <italic>blah</bold> bletch</italic>` depending on the order the markup is substituted and if substitution ignores previous markup.

3.b. Recursive definition would define the parse based on the order of the markup in the source rather than some order in the implementation, and prevent recognition of overlaps so the above is always `<bold>foo _blah</bold> bletch_` since the bold opening is recognised first and there is no closing underscore inside the bold markup and there is no opening underscore outside it.

3.c. This also allows option of warning of the possible unmatched markup (the underscores above) which is useful since its is easy for humans to miss a single character left in the text when proofreading.

3.d. A recursive definition allows nesting restrictions can be relaxed (see 5).

3.e. Parsing into the DOM allows the semantics to be separately defined for backends rather than as part of the language syntax.

4. Attribute lists are currently allowed on highlight (#) markups only. Should they be allowed on other markups? The use-case is that currently nesting only of differing types of markup is allowed so highlights don't nest so attributes cannot be specified on nested markup, whereas attributes on all markup would allow `[.arole]#foo [.brole]_blah_ footoo#` to be specified.

5. A recursive definition would allow nesting of the same markup:

5.a. so long as its inside a different markup it can be recognised as nested, eg it is possible to allow `*foo _blah *footoo* blah_ foo*`. How useful that is depends on the backend, but for example in HTML I'm sure its possible to use CSS to select and style the nested `footoo` as something other than just bold.

5.b. If attribute lists are allowed on markup other than highlight then the application of a role allows styling to be applied to nested markups even easier.

5.c. An alternative or additional method of providing nesting is to recognise that an attribute list can be used to distinguish an opening markup from a closing markup, so nesting of the same markup becomes possible, eg `[.red]#foo [.green]#blah# foo#` as `<bold, class=red>foo <bold, class=green>blah</bold> foo</bold>` is possible. Attributes and nesting of different markups would make it easier for humans to match opening and closing markups, eg `[.red]#lots and lots of text [.green]*lots and lots more text* so this text is far away from the opening markups#`

6. Changing the semantic of backquote ` markup. Currently the contents of this markup is parsed as normal marked up text and monospace style is applied by default. I would propose that the content be parsed as literal text instead. This is because the intended use-case for it is programming code embedded in the text, and code often contains markup characters. It also happens when using backquotes to enclose embedded AsciiDoc as I have used in this august tome. Having to escape all those would have been painful. As much as it pains me to admit it, I think Markdown is right in this case.

Thats enough for now, I'm going for a nap.

Cheers

Lex

Follow-Ups:
- Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: Sylvain Leroux
- Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: Sylvain Leroux
- Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: Sylvain Leroux
- Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
  - From: 马旋（MA Xuan）

Prev by Date: Re: [asciidoc-lang-dev] Whitespace handling
Next by Date: Re: [asciidoc-lang-dev] Whitespace handling
Previous by thread: [asciidoc-lang-dev] Unicode Issues
Next by thread: Re: [asciidoc-lang-dev] Text Markup, syntax and parsing thereof
Index(es):
- Date
- Thread

Breadcrumbs