ruby.git/enc/unicode.c, branch v4.0.3

Avoid negative character

2025-10-31T11:49:59+00:00

Better fix for k-takata/Onigmo#107.

https://github.com/k-takata/Onigmo/commit/85393e4a63223b538529e7095255ce1153c09cff

Fix lgtm.com warnings

2025-10-31T11:49:59+00:00

* Multiplication result may overflow 'int' before it is converted to
  'OnigDistance'.
* Comparison is always true because code <= 122.
* This statement makes ExprStmt unreachable.
* Empty block without comment

https://github.com/k-takata/Onigmo/commit/387ad616c3cb9370f99d2b11198c2135fa07030f

Add Encoding::UNICODE_VERSION constant

2025-04-23T05:14:36+00:00

Suppress warnings by gcc 10.1.0-RC-20200430

2020-05-04T03:28:24+00:00

* Folding results should not be empty.

  If `OnigCodePointCount(to->n)` were 0, `for` loop using `fn`
  wouldn't execute and `ncs` elements are not initialized.

  ```
  enc/unicode.c:557:21: warning: 'ncs[0]' may be used uninitialized in this function [-Wmaybe-uninitialized]
    557 |  for (i = 0; i < ncs[0]; i++) {
        |                  ~~~^~~
  ```

* Cast to `enum yytokentype`

  Additional enums for scanner events by ripper are not included
  in `yytokentype`.

  ```
  ripper.y:7274:28: warning: implicit conversion from 'enum ' to 'enum yytokentype' [-Wenum-conversion]
  ```

implement special behavior for Georgian for String#capitalize

2018-12-09T23:14:29+00:00

The modern Georgian script is special in that it has an 'uppercase'
variant called MTAVRULI which can be used for emphasis of whole words,
for screamy headlines, and so on. However, in contrast to all other
bicameral scripts, there is no usage of capitalizing the first letter
in a word or a sentence. Words with mixed capitalization are not used
at all.

We therefore implement special behavior for String#capitalize. Formally,
we define String#capitalize as first applying String#downcase for the
whole string, then using titlecase on the first letter. Because Georgian
defines titlecase as the identity function both for MTAVRULI ('uppercase')
and Mkhedruli (lowercase), this results in String#capitalize being
equivalent to String#downcase for Georgian. This avoids undesirable
mixed case.

* enc/unicode.c: Actual implementation

* string.c: Add mention of this special case for documentation

* test/ruby/enc/test_case_mapping.rb: Add two tests, a general one
  that uses String#capitalize on some (including nonsensical)
  combinations of MTAVRULI and Mkhedruli, and a canary test to
  detect the potential assignment of characters to the currently
  open slots (holes) at U+1CBB and U+1CBC.

* test/ruby/enc/test_case_comprehensive.rb: Tweak generation of
  expectation data.

Together with r65933, this closes issue #14839.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66300 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

remove obsolete data from unicode.c

2018-12-06T00:05:08+00:00

* unicode.c: Remove the arrays onigenc_unicode_GCB_ranges_GAZ,
  onigenc_unicode_GCB_ranges_E_Base, and onigenc_unicode_GCB_ranges_Emoji,
  because they are not needed anymore for Unicode 11.0.0.

* regparse.c: Remove external declarations for above arrays.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66232 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

solve the genie/zombie/wrestlers bug

2018-12-02T10:07:42+00:00

enc/unicode.c: - Add U+1F93C (WRESTLERS), U+1F9DE (GENIE), and U+1F9DF
                 to onigenc_unicode_GCB_ranges_E_Base.
               - Add comments with character names.
test/ruby/enc/test_emoji_breaks.rb: Activate tests for genie/zombie/wrestlers.
This closes issue #15343.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66133 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Added words in the comment at r65088 [ci skip]

2018-11-30T07:19:49+00:00

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66103 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

deal with ONIGENC_CASE_IS_TITLECASE flag on lowercase characters

2018-11-25T10:12:45+00:00

In the function onigenc_unicode_case_map() in enc/unicode.c, deal
with the case that the ONIGENC_CASE_IS_TITLECASE flag is set on
lowercase characters. This is in preparation for Georgian Mtavruli,
which are uppercase but not titlecase, in Unicode 11.0.0.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65971 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

enc/unicode.c: 'a' is bigger than 'A'

2018-11-16T02:34:00+00:00

In ASCII, 'a' is bigger than 'A'. Which means 'A' - 'a' is a negative
number (-32, to be precise). In C, the type of 'a' and 'A' are signed
int (cf: ISO/IEC 9899:1990 section 6.1.3.4). So 'A' - 'a' is also a
signed int. It is `(signed int)-32`.

The problem is, OnigCodePoint is unsigned int. Adding a negative
number to a variable of OnigCodepoint (`code` here) introduces an
unintentional cast of `(unsigned)(signed)-32`, which is
4,294,967,264. Adding this value to code then overflows, and the
result eventually becomes normal codepoint.

The series of operations are not a serious problem but because
`code >= 'a'` holds, we can `(code - 'a') + 'A'` to reroute this.

See also: https://github.com/k-takata/Onigmo/pull/107


git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65752 b2dd03c8-39d4-4d8f-98ff-823fe69b080e