summaryrefslogtreecommitdiff
path: root/doc/case_mapping.rdoc
blob: 29d7bc6c3317e127713d44fac42bbc1a900adc45 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
== Case Mapping

Some string-oriented methods use case mapping.

In String:

- String#capitalize
- String#capitalize!
- String#casecmp
- String#casecmp?
- String#downcase
- String#downcase!
- String#swapcase
- String#swapcase!
- String#upcase
- String#upcase!

In Symbol:

- Symbol#capitalize
- Symbol#casecmp
- Symbol#casecmp?
- Symbol#downcase
- Symbol#swapcase
- Symbol#upcase

=== Default Case Mapping

By default, all of these methods use full Unicode case mapping,
which is suitable for most languages.
See {Unicode Latin Case Chart}[https://www.unicode.org/charts/case].

Non-ASCII case mapping and folding are supported for UTF-8,
UTF-16BE/LE, UTF-32BE/LE, and ISO-8859-1~16 Strings/Symbols.

Context-dependent case mapping as described in
{Table 3-17 of the Unicode standard}[https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf]
is currently not supported.

In most cases, case conversions of a string have the same number of characters.
There are exceptions (see also +:fold+ below):

  s = "\u00DF" # => "ß"
  s.upcase     # => "SS"
  s = "\u0149" # => "ʼn"
  s.upcase     # => "ʼN"

Case mapping may also depend on locale (see also +:turkic+ below):

  s = "\u0049"        # => "I"
  s.downcase          # => "i" # Dot above.
  s.downcase(:turkic) # => "ı" # No dot above.

Case changes may not be reversible:

  s = 'Hello World!' # => "Hello World!"
  s.downcase         # => "hello world!"
  s.downcase.upcase  # => "HELLO WORLD!" # Different from original s.

Case changing methods may not maintain Unicode normalization.
See String#unicode_normalize).

=== Options for Case Mapping

Except for +casecmp+ and +casecmp?+,
each of the case-mapping methods listed above
accepts optional arguments, <tt>*options</tt>.

The arguments may be:

- +:ascii+ only.
- +:fold+ only.
- +:turkic+ or +:lithuanian+ or both.

The options:

- +:ascii+:
  ASCII-only mapping:
  uppercase letters ('A'..'Z') are mapped to lowercase letters ('a'..'z);
  other characters are not changed

    s = "Foo \u00D8 \u00F8 Bar" # => "Foo Ø ø Bar"
    s.upcase                    # => "FOO Ø Ø BAR"
    s.downcase                  # => "foo ø ø bar"
    s.upcase(:ascii)            # => "FOO Ø ø BAR"
    s.downcase(:ascii)          # => "foo Ø ø bar"

- +:turkic+:
  Full Unicode case mapping, adapted for the Turkic languages
  that distinguish dotted and dotless I, for example Turkish and Azeri.

    s = 'Türkiye'       # => "Türkiye"
    s.upcase            # => "TÜRKIYE"
    s.upcase(:turkic)   # => "TÜRKİYE" # Dot above.

    s = 'TÜRKIYE'       # => "TÜRKIYE"
    s.downcase          # => "türkiye"
    s.downcase(:turkic) # => "türkıye" # No dot above.

- +:lithuanian+:
  Not yet implemented.

- +:fold+ (available only for String#downcase, String#downcase!,
  and Symbol#downcase):
  Unicode case folding,
  which is more far-reaching than Unicode case mapping.

    s = "\u00DF"      # => "ß"
    s.downcase        # => "ß"
    s.downcase(:fold) # => "ss"
    s.upcase          # => "SS"

    s = "\uFB04"      # => "ffl"
    s.downcase        # => "ffl"
    s.upcase          # => "FFL"
    s.downcase(:fold) # => "ffl"