From dae9422eafa0d3a170e34b7c059384467388eb1f Mon Sep 17 00:00:00 2001 From: Ed Page Date: Thu, 4 Sep 2025 12:54:01 -0500 Subject: [PATCH 1/2] Create Whitespace grammar productions This does not create any new productions, instead preferring comments. rust-lang/reference#1974 will involve pulling out the horizontal whitespace into a separate production. Comment wording (and casing) is modeled off of https://www.unicode.org/reports/tr31/#R3a. I left off a "unicode" prefix for ASCII items as they are likely common enough in that context that specifying them as "unicode" could cause more confusion. --- src/input-format.md | 6 ------ src/whitespace.md | 39 ++++++++++++++++++++++++++------------- 2 files changed, 26 insertions(+), 19 deletions(-) diff --git a/src/input-format.md b/src/input-format.md index 5d2a692755..cf35b2959d 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -6,12 +6,6 @@ r[input.syntax] @root CHAR -> NUL -> U+0000 - -TAB -> U+0009 - -LF -> U+000A - -CR -> U+000D ``` r[input.intro] diff --git a/src/whitespace.md b/src/whitespace.md index 45d58b3fa3..761a1428e1 100644 --- a/src/whitespace.md +++ b/src/whitespace.md @@ -1,21 +1,34 @@ r[lex.whitespace] # Whitespace +r[whitespace.syntax] +```grammar,lexer +@root WHITESPACE -> + // end of line + LF + | U+000B // vertical tabulation + | U+000C // form feed + | CR + | U+0085 // Unicode next line + | U+2028 // Unicode LINE SEPARATOR + | U+2029 // Unicode PARAGRAPH SEPARATOR + // Ignorable Code Point + | U+200E // Unicode LEFT-TO-RIGHT MARK + | U+200F // Unicode RIGHT-TO-LEFT MARK + // horizontal whitespace + | TAB + | U+0020 // space ' ' + +TAB -> U+0009 // horizontal tab ('\t') + +LF -> U+000A // line feed ('\n') + +CR -> U+000D // carriage return ('\r') +``` + r[lex.whitespace.intro] Whitespace is any non-empty string containing only characters that have the -[`Pattern_White_Space`] Unicode property, namely: - -- `U+0009` (horizontal tab, `'\t'`) -- `U+000A` (line feed, `'\n'`) -- `U+000B` (vertical tab) -- `U+000C` (form feed) -- `U+000D` (carriage return, `'\r'`) -- `U+0020` (space, `' '`) -- `U+0085` (next line) -- `U+200E` (left-to-right mark) -- `U+200F` (right-to-left mark) -- `U+2028` (line separator) -- `U+2029` (paragraph separator) +[`Pattern_White_Space`] Unicode property. r[lex.whitespace.token-sep] Rust is a "free-form" language, meaning that all forms of whitespace serve only From 60eb145d7e826d3e8dc8a7051fd5b5e5913070b7 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Tue, 23 Sep 2025 16:20:30 -0700 Subject: [PATCH 2/2] Reformat whitespace list for consistency --- src/whitespace.md | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/src/whitespace.md b/src/whitespace.md index 761a1428e1..b398d0c958 100644 --- a/src/whitespace.md +++ b/src/whitespace.md @@ -4,26 +4,23 @@ r[lex.whitespace] r[whitespace.syntax] ```grammar,lexer @root WHITESPACE -> - // end of line - LF - | U+000B // vertical tabulation - | U+000C // form feed - | CR - | U+0085 // Unicode next line - | U+2028 // Unicode LINE SEPARATOR - | U+2029 // Unicode PARAGRAPH SEPARATOR - // Ignorable Code Point - | U+200E // Unicode LEFT-TO-RIGHT MARK - | U+200F // Unicode RIGHT-TO-LEFT MARK - // horizontal whitespace - | TAB - | U+0020 // space ' ' - -TAB -> U+0009 // horizontal tab ('\t') - -LF -> U+000A // line feed ('\n') - -CR -> U+000D // carriage return ('\r') + U+0009 // Horizontal tab, `'\t'` + | U+000A // Line feed, `'\n'` + | U+000B // Vertical tab + | U+000C // Form feed + | U+000D // Carriage return, `'\r'` + | U+0020 // Space, `' '` + | U+0085 // Next line + | U+200E // Left-to-right mark + | U+200F // Right-to-left mark + | U+2028 // Line separator + | U+2029 // Paragraph separator + +TAB -> U+0009 // Horizontal tab, `'\t'` + +LF -> U+000A // Line feed, `'\n'` + +CR -> U+000D // Carriage return, `'\r'` ``` r[lex.whitespace.intro]