test/utf8.txt
author David Ludwig <dludwig@pobox.com>
Wed, 25 Dec 2013 21:39:48 -0500
changeset 8563 c0e68f3b6bbb
parent 1518 4d711949cd9a
permissions -rw-r--r--
WinRT: compiled the d3d11 renderer's shaders into SDL itself

Previously, the shaders would get compiled separately, the output of which would need to be packaged into the app. This change should make SDL's dll be the only binary needed to include SDL in a WinRT app.
slouken@1501
     1
UTF-8 decoder capability and stress test
slouken@1501
     2
----------------------------------------
slouken@1501
     3
slouken@1501
     4
Markus Kuhn <http://www.cl.cam.ac.uk/~mgk25/> - 2003-02-19
slouken@1501
     5
slouken@1501
     6
This test file can help you examine, how your UTF-8 decoder handles
slouken@1501
     7
various types of correct, malformed, or otherwise interesting UTF-8
slouken@1501
     8
sequences. This file is not meant to be a conformance test. It does
slouken@1501
     9
not prescribes any particular outcome and therefore there is no way to
slouken@1501
    10
"pass" or "fail" this test file, even though the texts suggests a
slouken@1501
    11
preferable decoder behaviour at some places. The aim is instead to
slouken@1501
    12
help you think about and test the behaviour of your UTF-8 on a
slouken@1501
    13
systematic collection of unusual inputs. Experience so far suggests
slouken@1501
    14
that most first-time authors of UTF-8 decoders find at least one
slouken@1501
    15
serious problem in their decoder by using this file.
slouken@1501
    16
slouken@1501
    17
The test lines below cover boundary conditions, malformed UTF-8
slouken@1501
    18
sequences as well as correctly encoded UTF-8 sequences of Unicode code
slouken@1501
    19
points that should never occur in a correct UTF-8 file.
slouken@1501
    20
slouken@1501
    21
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
slouken@1501
    22
receiving UTF-8 shall interpret a "malformed sequence in the same way
slouken@1501
    23
that it interprets a character that is outside the adopted subset" and
slouken@1501
    24
"characters that are not within the adopted subset shall be indicated
slouken@1501
    25
to the user" by a receiving device. A quite commonly used approach in
slouken@1501
    26
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
slouken@1501
    27
replacement character (U+FFFD), which looks a bit like an inverted
slouken@1501
    28
question mark, or a similar symbol. It might be a good idea to
slouken@1501
    29
visually distinguish a malformed UTF-8 sequence from a correctly
slouken@1501
    30
encoded Unicode character that is just not available in the current
slouken@1501
    31
font but otherwise fully legal, even though ISO 10646-1 doesn't
slouken@1501
    32
mandate this. In any case, just ignoring malformed sequences or
slouken@1501
    33
unavailable characters does not conform to ISO 10646, will make
slouken@1501
    34
debugging more difficult, and can lead to user confusion.
slouken@1501
    35
slouken@1501
    36
Please check, whether a malformed UTF-8 sequence is (1) represented at
slouken@1501
    37
all, (2) represented by exactly one single replacement character (or
slouken@1501
    38
equivalent signal), and (3) the following quotation mark after an
slouken@1501
    39
illegal UTF-8 sequence is correctly displayed, i.e. proper
slouken@1501
    40
resynchronization takes place immageately after any malformed
slouken@1501
    41
sequence. This file says "THE END" in the last line, so if you don't
slouken@1501
    42
see that, your decoder crashed somehow before, which should always be
slouken@1501
    43
cause for concern.
slouken@1501
    44
slouken@1501
    45
All lines in this file are exactly 79 characters long (plus the line
slouken@1501
    46
feed). In addition, all lines end with "|", except for the two test
slouken@1501
    47
lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
slouken@1501
    48
U+0000 and U+007F. If you display this file with a fixed-width font,
slouken@1501
    49
these "|" characters should all line up in column 79 (right margin).
slouken@1501
    50
This allows you to test quickly, whether your UTF-8 decoder finds the
slouken@1501
    51
correct number of characters in every line, that is whether each
slouken@1501
    52
malformed sequences is replaced by a single replacement character.
slouken@1501
    53
slouken@1501
    54
Note that as an alternative to the notion of malformed sequence used
slouken@1501
    55
here, it is also a perfectly acceptable (and in some situations even
slouken@1501
    56
preferable) solution to represent each individual byte of a malformed
slouken@1501
    57
sequence by a replacement character. If you follow this strategy in
slouken@1501
    58
your decoder, then please ignore the "|" column.
slouken@1501
    59
slouken@1501
    60
slouken@1501
    61
Here come the tests:                                                          |
slouken@1501
    62
                                                                              |
slouken@1501
    63
1  Some correct UTF-8 text                                                    |
slouken@1501
    64
                                                                              |
slouken@1518
    65
(The codepoints for this test are:                                            |
slouken@1518
    66
  U+03BA U+1F79 U+03C3 U+03BC U+03B5  --ryan.)                                |
slouken@1518
    67
                                                                              |
slouken@1501
    68
You should see the Greek word 'kosme':       "κόσμε"                          |
slouken@1501
    69
                                                                              |
slouken@1518
    70
                                                                              |
slouken@1501
    71
2  Boundary condition test cases                                              |
slouken@1501
    72
                                                                              |
slouken@1501
    73
2.1  First possible sequence of a certain length                              |
slouken@1501
    74
                                                                              |
slouken@1518
    75
(byte zero skipped...there's a null added at the end of the test. --ryan.)    |
slouken@1518
    76
                                                                              |
slouken@1501
    77
2.1.2  2 bytes (U-00000080):        "€"                                       |
slouken@1501
    78
2.1.3  3 bytes (U-00000800):        "ࠀ"                                       |
slouken@1501
    79
2.1.4  4 bytes (U-00010000):        "𐀀"                                       |
slouken@1518
    80
                                                                              |
slouken@1518
    81
(5 and 6 byte sequences were made illegal in rfc3629. --ryan.)                |
slouken@1501
    82
2.1.5  5 bytes (U-00200000):        ""                                       |
slouken@1501
    83
2.1.6  6 bytes (U-04000000):        ""                                       |
slouken@1501
    84
                                                                              |
slouken@1501
    85
2.2  Last possible sequence of a certain length                               |
slouken@1501
    86
                                                                              |
slouken@1518
    87
2.2.1  1 byte  (U-0000007F):        ""                                       |
slouken@1501
    88
2.2.2  2 bytes (U-000007FF):        "߿"                                       |
slouken@1518
    89
                                                                              |
slouken@1518
    90
(Section 5.3.2 below calls this illegal. --ryan.)                             |
slouken@1501
    91
2.2.3  3 bytes (U-0000FFFF):        "￿"                                       |
slouken@1518
    92
                                                                              |
slouken@1518
    93
(5 and 6 bytes sequences, and 4 bytes sequences > 0x10FFFF were made illegal  |
slouken@1518
    94
 in rfc3629, so these next three should be replaced with a invalid            |
slouken@1518
    95
 character codepoint. --ryan.)                                                |
slouken@1501
    96
2.2.4  4 bytes (U-001FFFFF):        ""                                       |
slouken@1501
    97
2.2.5  5 bytes (U-03FFFFFF):        ""                                       |
slouken@1501
    98
2.2.6  6 bytes (U-7FFFFFFF):        ""                                       |
slouken@1501
    99
                                                                              |
slouken@1501
   100
2.3  Other boundary conditions                                                |
slouken@1501
   101
                                                                              |
slouken@1501
   102
2.3.1  U-0000D7FF = ed 9f bf = "퟿"                                            |
slouken@1501
   103
2.3.2  U-0000E000 = ee 80 80 = ""                                            |
slouken@1501
   104
2.3.3  U-0000FFFD = ef bf bd = "�"                                            |
slouken@1501
   105
2.3.4  U-0010FFFF = f4 8f bf bf = "􏿿"                                         |
slouken@1518
   106
                                                                              |
slouken@1518
   107
(This one is bogus in rfc3629. --ryan.)                                       |
slouken@1501
   108
2.3.5  U-00110000 = f4 90 80 80 = ""                                         |
slouken@1501
   109
                                                                              |
slouken@1501
   110
3  Malformed sequences                                                        |
slouken@1501
   111
                                                                              |
slouken@1501
   112
3.1  Unexpected continuation bytes                                            |
slouken@1501
   113
                                                                              |
slouken@1501
   114
Each unexpected continuation byte should be separately signalled as a         |
slouken@1501
   115
malformed sequence of its own.                                                |
slouken@1501
   116
                                                                              |
slouken@1501
   117
3.1.1  First continuation byte 0x80: ""                                      |
slouken@1501
   118
3.1.2  Last  continuation byte 0xbf: ""                                      |
slouken@1501
   119
                                                                              |
slouken@1501
   120
3.1.3  2 continuation bytes: ""                                             |
slouken@1501
   121
3.1.4  3 continuation bytes: ""                                            |
slouken@1501
   122
3.1.5  4 continuation bytes: ""                                           |
slouken@1501
   123
3.1.6  5 continuation bytes: ""                                          |
slouken@1501
   124
3.1.7  6 continuation bytes: ""                                         |
slouken@1501
   125
3.1.8  7 continuation bytes: ""                                        |
slouken@1501
   126
                                                                              |
slouken@1501
   127
3.1.9  Sequence of all 64 possible continuation bytes (0x80-0xbf):            |
slouken@1501
   128
                                                                              |
slouken@1501
   129
   "                                                          |
slouken@1501
   130
                                                              |
slouken@1501
   131
                                                              |
slouken@1501
   132
    "                                                         |
slouken@1501
   133
                                                                              |
slouken@1501
   134
3.2  Lonely start characters                                                  |
slouken@1501
   135
                                                                              |
slouken@1501
   136
3.2.1  All 32 first bytes of 2-byte sequences (0xc0-0xdf),                    |
slouken@1501
   137
       each followed by a space character:                                    |
slouken@1501
   138
                                                                              |
slouken@1501
   139
   "                                                          |
slouken@1501
   140
                    "                                         |
slouken@1501
   141
                                                                              |
slouken@1501
   142
3.2.2  All 16 first bytes of 3-byte sequences (0xe0-0xef),                    |
slouken@1501
   143
       each followed by a space character:                                    |
slouken@1501
   144
                                                                              |
slouken@1501
   145
   "                "                                         |
slouken@1501
   146
                                                                              |
slouken@1501
   147
3.2.3  All 8 first bytes of 4-byte sequences (0xf0-0xf7),                     |
slouken@1501
   148
       each followed by a space character:                                    |
slouken@1501
   149
                                                                              |
slouken@1501
   150
   "        "                                                         |
slouken@1501
   151
                                                                              |
slouken@1501
   152
3.2.4  All 4 first bytes of 5-byte sequences (0xf8-0xfb),                     |
slouken@1501
   153
       each followed by a space character:                                    |
slouken@1501
   154
                                                                              |
slouken@1501
   155
   "    "                                                                 |
slouken@1501
   156
                                                                              |
slouken@1501
   157
3.2.5  All 2 first bytes of 6-byte sequences (0xfc-0xfd),                     |
slouken@1501
   158
       each followed by a space character:                                    |
slouken@1501
   159
                                                                              |
slouken@1501
   160
   "  "                                                                     |
slouken@1501
   161
                                                                              |
slouken@1501
   162
3.3  Sequences with last continuation byte missing                            |
slouken@1501
   163
                                                                              |
slouken@1501
   164
All bytes of an incomplete sequence should be signalled as a single           |
slouken@1501
   165
malformed sequence, i.e., you should see only a single replacement            |
slouken@1501
   166
character in each of the next 10 tests. (Characters as in section 2)          |
slouken@1501
   167
                                                                              |
slouken@1501
   168
3.3.1  2-byte sequence with last byte missing (U+0000):     ""               |
slouken@1501
   169
3.3.2  3-byte sequence with last byte missing (U+0000):     ""               |
slouken@1501
   170
3.3.3  4-byte sequence with last byte missing (U+0000):     ""               |
slouken@1501
   171
3.3.4  5-byte sequence with last byte missing (U+0000):     ""               |
slouken@1501
   172
3.3.5  6-byte sequence with last byte missing (U+0000):     ""               |
slouken@1501
   173
3.3.6  2-byte sequence with last byte missing (U-000007FF): ""               |
slouken@1501
   174
3.3.7  3-byte sequence with last byte missing (U-0000FFFF): ""               |
slouken@1501
   175
3.3.8  4-byte sequence with last byte missing (U-001FFFFF): ""               |
slouken@1501
   176
3.3.9  5-byte sequence with last byte missing (U-03FFFFFF): ""               |
slouken@1501
   177
3.3.10 6-byte sequence with last byte missing (U-7FFFFFFF): ""               |
slouken@1501
   178
                                                                              |
slouken@1501
   179
3.4  Concatenation of incomplete sequences                                    |
slouken@1501
   180
                                                                              |
slouken@1501
   181
All the 10 sequences of 3.3 concatenated, you should see 10 malformed         |
slouken@1501
   182
sequences being signalled:                                                    |
slouken@1501
   183
                                                                              |
slouken@1501
   184
   ""                                                               |
slouken@1501
   185
                                                                              |
slouken@1501
   186
3.5  Impossible bytes                                                         |
slouken@1501
   187
                                                                              |
slouken@1501
   188
The following two bytes cannot appear in a correct UTF-8 string               |
slouken@1501
   189
                                                                              |
slouken@1501
   190
3.5.1  fe = ""                                                               |
slouken@1501
   191
3.5.2  ff = ""                                                               |
slouken@1501
   192
3.5.3  fe fe ff ff = ""                                                   |
slouken@1501
   193
                                                                              |
slouken@1501
   194
4  Overlong sequences                                                         |
slouken@1501
   195
                                                                              |
slouken@1501
   196
The following sequences are not malformed according to the letter of          |
slouken@1501
   197
the Unicode 2.0 standard. However, they are longer then necessary and         |
slouken@1501
   198
a correct UTF-8 encoder is not allowed to produce them. A "safe UTF-8         |
slouken@1501
   199
decoder" should reject them just like malformed sequences for two             |
slouken@1501
   200
reasons: (1) It helps to debug applications if overlong sequences are         |
slouken@1501
   201
not treated as valid representations of characters, because this helps        |
slouken@1501
   202
to spot problems more quickly. (2) Overlong sequences provide                 |
slouken@1501
   203
alternative representations of characters, that could maliciously be          |
slouken@1501
   204
used to bypass filters that check only for ASCII characters. For              |
slouken@1501
   205
instance, a 2-byte encoded line feed (LF) would not be caught by a            |
slouken@1501
   206
line counter that counts only 0x0a bytes, but it would still be               |
slouken@1501
   207
processed as a line feed by an unsafe UTF-8 decoder later in the              |
slouken@1501
   208
pipeline. From a security point of view, ASCII compatibility of UTF-8         |
slouken@1501
   209
sequences means also, that ASCII characters are *only* allowed to be          |
slouken@1501
   210
represented by ASCII bytes in the range 0x00-0x7f. To ensure this             |
slouken@1501
   211
aspect of ASCII compatibility, use only "safe UTF-8 decoders" that            |
slouken@1501
   212
reject overlong UTF-8 sequences for which a shorter encoding exists.          |
slouken@1501
   213
                                                                              |
slouken@1501
   214
4.1  Examples of an overlong ASCII character                                  |
slouken@1501
   215
                                                                              |
slouken@1501
   216
With a safe UTF-8 decoder, all of the following five overlong                 |
slouken@1501
   217
representations of the ASCII character slash ("/") should be rejected         |
slouken@1501
   218
like a malformed UTF-8 sequence, for instance by substituting it with         |
slouken@1501
   219
a replacement character. If you see a slash below, you do not have a          |
slouken@1501
   220
safe UTF-8 decoder!                                                           |
slouken@1501
   221
                                                                              |
slouken@1501
   222
4.1.1 U+002F = c0 af             = ""                                        |
slouken@1501
   223
4.1.2 U+002F = e0 80 af          = ""                                        |
slouken@1501
   224
4.1.3 U+002F = f0 80 80 af       = ""                                        |
slouken@1501
   225
4.1.4 U+002F = f8 80 80 80 af    = ""                                        |
slouken@1501
   226
4.1.5 U+002F = fc 80 80 80 80 af = ""                                        |
slouken@1501
   227
                                                                              |
slouken@1501
   228
4.2  Maximum overlong sequences                                               |
slouken@1501
   229
                                                                              |
slouken@1501
   230
Below you see the highest Unicode value that is still resulting in an         |
slouken@1501
   231
overlong sequence if represented with the given number of bytes. This         |
slouken@1501
   232
is a boundary test for safe UTF-8 decoders. All five characters should        |
slouken@1501
   233
be rejected like malformed UTF-8 sequences.                                   |
slouken@1501
   234
                                                                              |
slouken@1501
   235
4.2.1  U-0000007F = c1 bf             = ""                                   |
slouken@1501
   236
4.2.2  U-000007FF = e0 9f bf          = ""                                   |
slouken@1501
   237
4.2.3  U-0000FFFF = f0 8f bf bf       = ""                                   |
slouken@1501
   238
4.2.4  U-001FFFFF = f8 87 bf bf bf    = ""                                   |
slouken@1501
   239
4.2.5  U-03FFFFFF = fc 83 bf bf bf bf = ""                                   |
slouken@1501
   240
                                                                              |
slouken@1501
   241
4.3  Overlong representation of the NUL character                             |
slouken@1501
   242
                                                                              |
slouken@1501
   243
The following five sequences should also be rejected like malformed           |
slouken@1501
   244
UTF-8 sequences and should not be treated like the ASCII NUL                  |
slouken@1501
   245
character.                                                                    |
slouken@1501
   246
                                                                              |
slouken@1501
   247
4.3.1  U+0000 = c0 80             = ""                                       |
slouken@1501
   248
4.3.2  U+0000 = e0 80 80          = ""                                       |
slouken@1501
   249
4.3.3  U+0000 = f0 80 80 80       = ""                                       |
slouken@1501
   250
4.3.4  U+0000 = f8 80 80 80 80    = ""                                       |
slouken@1501
   251
4.3.5  U+0000 = fc 80 80 80 80 80 = ""                                       |
slouken@1501
   252
                                                                              |
slouken@1501
   253
5  Illegal code positions                                                     |
slouken@1501
   254
                                                                              |
slouken@1501
   255
The following UTF-8 sequences should be rejected like malformed               |
slouken@1501
   256
sequences, because they never represent valid ISO 10646 characters and        |
slouken@1501
   257
a UTF-8 decoder that accepts them might introduce security problems           |
slouken@1501
   258
comparable to overlong UTF-8 sequences.                                       |
slouken@1501
   259
                                                                              |
slouken@1501
   260
5.1 Single UTF-16 surrogates                                                  |
slouken@1501
   261
                                                                              |
slouken@1501
   262
5.1.1  U+D800 = ed a0 80 = ""                                                |
slouken@1501
   263
5.1.2  U+DB7F = ed ad bf = ""                                                |
slouken@1501
   264
5.1.3  U+DB80 = ed ae 80 = ""                                                |
slouken@1501
   265
5.1.4  U+DBFF = ed af bf = ""                                                |
slouken@1501
   266
5.1.5  U+DC00 = ed b0 80 = ""                                                |
slouken@1501
   267
5.1.6  U+DF80 = ed be 80 = ""                                                |
slouken@1501
   268
5.1.7  U+DFFF = ed bf bf = ""                                                |
slouken@1501
   269
                                                                              |
slouken@1501
   270
5.2 Paired UTF-16 surrogates                                                  |
slouken@1501
   271
                                                                              |
slouken@1501
   272
5.2.1  U+D800 U+DC00 = ed a0 80 ed b0 80 = ""                               |
slouken@1501
   273
5.2.2  U+D800 U+DFFF = ed a0 80 ed bf bf = ""                               |
slouken@1501
   274
5.2.3  U+DB7F U+DC00 = ed ad bf ed b0 80 = ""                               |
slouken@1501
   275
5.2.4  U+DB7F U+DFFF = ed ad bf ed bf bf = ""                               |
slouken@1501
   276
5.2.5  U+DB80 U+DC00 = ed ae 80 ed b0 80 = ""                               |
slouken@1501
   277
5.2.6  U+DB80 U+DFFF = ed ae 80 ed bf bf = ""                               |
slouken@1501
   278
5.2.7  U+DBFF U+DC00 = ed af bf ed b0 80 = ""                               |
slouken@1501
   279
5.2.8  U+DBFF U+DFFF = ed af bf ed bf bf = ""                               |
slouken@1501
   280
                                                                              |
slouken@1501
   281
5.3 Other illegal code positions                                              |
slouken@1501
   282
                                                                              |
slouken@1501
   283
5.3.1  U+FFFE = ef bf be = "￾"                                                |
slouken@1501
   284
5.3.2  U+FFFF = ef bf bf = "￿"                                                |
slouken@1501
   285
                                                                              |
slouken@1501
   286
THE END                                                                       |
slouken@1518
   287