| Age | Commit message (Collapse) | Author |
|
If we assume that even UTF-8 strings are mostly ASCII, we can implement a
fast path for the ASCII parts.
Before:
```
== Encoding mixed utf8 (20012001 bytes)
ruby 3.4.0dev (2024-10-18T15:12:54Z master https://github.com/ruby/json/commit/d1b5c10957) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
json 5.000 i/100ms
oj 9.000 i/100ms
rapidjson 2.000 i/100ms
Calculating -------------------------------------
json 49.403 (± 2.0%) i/s (20.24 ms/i) - 250.000 in 5.062647s
oj 100.120 (± 2.0%) i/s (9.99 ms/i) - 504.000 in 5.035349s
rapidjson 26.404 (± 0.0%) i/s (37.87 ms/i) - 132.000 in 5.001025s
Comparison:
json: 49.4 i/s
oj: 100.1 i/s - 2.03x faster
rapidjson: 26.4 i/s - 1.87x slower
```
After:
```
== Encoding mixed utf8 (20012001 bytes)
ruby 3.4.0dev (2024-10-18T15:12:54Z master https://github.com/ruby/json/commit/d1b5c10957) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
json 10.000 i/100ms
oj 9.000 i/100ms
rapidjson 2.000 i/100ms
Calculating -------------------------------------
json 95.686 (± 2.1%) i/s (10.45 ms/i) - 480.000 in 5.018575s
oj 96.875 (± 2.1%) i/s (10.32 ms/i) - 486.000 in 5.019097s
rapidjson 26.260 (± 3.8%) i/s (38.08 ms/i) - 132.000 in 5.033151s
Comparison:
json: 95.7 i/s
oj: 96.9 i/s - same-ish: difference falls within error
rapidjson: 26.3 i/s - 3.64x slower
```
https://github.com/ruby/json/commit/f8166c2d7f
|
|
Note where we currently stand, what the current bottlencks are
and what could or can't be done.
```
== Encoding small nested array (121 bytes)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 129.145k i/100ms
json (reuse) 239.395k i/100ms
oj 211.514k i/100ms
rapidjson 130.660k i/100ms
Calculating -------------------------------------
json 1.284M (± 0.3%) i/s (779.11 ns/i) - 6.457M in 5.030954s
json (reuse) 2.405M (± 0.1%) i/s (415.77 ns/i) - 12.209M in 5.076202s
oj 2.118M (± 0.0%) i/s (472.11 ns/i) - 10.787M in 5.092795s
rapidjson 1.325M (± 1.3%) i/s (754.82 ns/i) - 6.664M in 5.030763s
Comparison:
json: 1283514.8 i/s
json (reuse): 2405175.0 i/s - 1.87x faster
oj: 2118132.9 i/s - 1.65x faster
rapidjson: 1324820.8 i/s - 1.03x faster
== Encoding small hash (65 bytes)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 177.502k i/100ms
json (reuse) 485.963k i/100ms
oj 656.566k i/100ms
rapidjson 227.985k i/100ms
Calculating -------------------------------------
json 1.774M (± 3.1%) i/s (563.67 ns/i) - 8.875M in 5.007964s
json (reuse) 4.804M (± 3.0%) i/s (208.16 ns/i) - 24.298M in 5.062426s
oj 6.564M (± 1.9%) i/s (152.36 ns/i) - 32.828M in 5.003539s
rapidjson 2.229M (± 2.0%) i/s (448.59 ns/i) - 11.171M in 5.013299s
Comparison:
json: 1774084.6 i/s
oj: 6563547.8 i/s - 3.70x faster
json (reuse): 4804083.0 i/s - 2.71x faster
rapidjson: 2229209.5 i/s - 1.26x faster
== Encoding twitter.json (466906 bytes)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 212.000 i/100ms
oj 222.000 i/100ms
rapidjson 109.000 i/100ms
Calculating -------------------------------------
json 2.135k (± 0.7%) i/s (468.32 μs/i) - 10.812k in 5.063665s
oj 2.219k (± 1.9%) i/s (450.69 μs/i) - 11.100k in 5.004642s
rapidjson 1.093k (± 3.8%) i/s (914.66 μs/i) - 5.559k in 5.090812s
Comparison:
json: 2135.3 i/s
oj: 2218.8 i/s - 1.04x faster
rapidjson: 1093.3 i/s - 1.95x slower
== Encoding citm_catalog.json (500298 bytes)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 132.000 i/100ms
oj 126.000 i/100ms
rapidjson 96.000 i/100ms
Calculating -------------------------------------
json 1.304k (± 2.2%) i/s (766.96 μs/i) - 6.600k in 5.064483s
oj 1.272k (± 0.8%) i/s (786.14 μs/i) - 6.426k in 5.052044s
rapidjson 997.370 (± 4.8%) i/s (1.00 ms/i) - 4.992k in 5.016266s
Comparison:
json: 1303.9 i/s
oj: 1272.0 i/s - same-ish: difference falls within error
rapidjson: 997.4 i/s - 1.31x slower
== Encoding canada.json (2090234 bytes)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 2.000 i/100ms
oj 3.000 i/100ms
rapidjson 1.000 i/100ms
Calculating -------------------------------------
json 20.001 (± 0.0%) i/s (50.00 ms/i) - 102.000 in 5.100950s
oj 30.823 (± 0.0%) i/s (32.44 ms/i) - 156.000 in 5.061333s
rapidjson 19.446 (± 0.0%) i/s (51.42 ms/i) - 98.000 in 5.041884s
Comparison:
json: 20.0 i/s
oj: 30.8 i/s - 1.54x faster
rapidjson: 19.4 i/s - 1.03x slower
== Encoding many #to_json calls (2661 bytes)
oj does not match expected output. Skipping
rapidjson unsupported (Invalid object key type: Object)
ruby 3.3.4 (2024-07-09 revision https://github.com/ruby/json/commit/be1089c8ec) [arm64-darwin23]
Warming up --------------------------------------
json 2.200k i/100ms
Calculating -------------------------------------
json 22.253k (± 0.2%) i/s (44.94 μs/i) - 112.200k in 5.041962s
```
https://github.com/ruby/json/commit/77e97b3d4e
|
|
https://github.com/ruby/json/commit/dbf7e9f473
|
|
https://github.com/ruby/json/commit/7b68800991
|
|
There is a large number of outstanding performance PRs that I want to
merge, but we need a decent benchmark to judge if they are effective.
I went to borrow rapidjson's benchmark suite, which is a good start.
I only kept the comparison with Oj and RapidJSON, because YAJL is
slower on most benchmarks, so little point comparing to it.
Encoding:
```
== Encoding small nested array (121 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 88.225k i/100ms
oj 209.862k i/100ms
rapidjson 128.978k i/100ms
Calculating -------------------------------------
json 914.611k (± 0.4%) i/s (1.09 μs/i) - 4.588M in 5.016099s
oj 2.163M (± 0.2%) i/s (462.39 ns/i) - 10.913M in 5.045964s
rapidjson 1.392M (± 1.3%) i/s (718.55 ns/i) - 6.965M in 5.005438s
Comparison:
json: 914610.6 i/s
oj: 2162693.5 i/s - 2.36x faster
rapidjson: 1391682.6 i/s - 1.52x faster
== Encoding small hash (65 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 142.093k i/100ms
oj 651.412k i/100ms
rapidjson 237.706k i/100ms
Calculating -------------------------------------
json 1.478M (± 0.7%) i/s (676.78 ns/i) - 7.389M in 5.000866s
oj 7.150M (± 0.7%) i/s (139.85 ns/i) - 35.828M in 5.010756s
rapidjson 2.250M (± 1.6%) i/s (444.46 ns/i) - 11.410M in 5.072451s
Comparison:
json: 1477595.1 i/s
oj: 7150472.0 i/s - 4.84x faster
rapidjson: 2249926.7 i/s - 1.52x faster
== Encoding twitter.json (466906 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 101.000 i/100ms
oj 223.000 i/100ms
rapidjson 105.000 i/100ms
Calculating -------------------------------------
json 1.017k (± 0.7%) i/s (982.83 μs/i) - 5.151k in 5.062786s
oj 2.244k (± 0.7%) i/s (445.72 μs/i) - 11.373k in 5.069428s
rapidjson 1.069k (± 4.6%) i/s (935.20 μs/i) - 5.355k in 5.016652s
Comparison:
json: 1017.5 i/s
oj: 2243.6 i/s - 2.21x faster
rapidjson: 1069.3 i/s - same-ish: difference falls within error
== Encoding citm_catalog.json (500299 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 77.000 i/100ms
oj 129.000 i/100ms
rapidjson 96.000 i/100ms
Calculating -------------------------------------
json 767.217 (± 2.5%) i/s (1.30 ms/i) - 3.850k in 5.021957s
oj 1.291k (± 1.5%) i/s (774.45 μs/i) - 6.579k in 5.096439s
rapidjson 959.527 (± 1.1%) i/s (1.04 ms/i) - 4.800k in 5.003052s
Comparison:
json: 767.2 i/s
oj: 1291.2 i/s - 1.68x faster
rapidjson: 959.5 i/s - 1.25x faster
== Encoding canada.json (2090234 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 1.000 i/100ms
oj 3.000 i/100ms
rapidjson 1.000 i/100ms
Calculating -------------------------------------
json 19.748 (± 0.0%) i/s (50.64 ms/i) - 99.000 in 5.013336s
oj 31.016 (± 0.0%) i/s (32.24 ms/i) - 156.000 in 5.029732s
rapidjson 19.419 (± 0.0%) i/s (51.50 ms/i) - 98.000 in 5.050382s
Comparison:
json: 19.7 i/s
oj: 31.0 i/s - 1.57x faster
rapidjson: 19.4 i/s - 1.02x slower
== Encoding many #to_json calls (2661 bytes)
oj does not match expected output. Skipping
rapidjson unsupported (Invalid object key type: Object)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 2.129k i/100ms
Calculating -------------------------------------
json 21.599k (± 0.6%) i/s (46.30 μs/i) - 108.579k in 5.027198s
```
Parsing:
```
== Parsing small nested array (121 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 47.497k i/100ms
oj 54.115k i/100ms
oj strict 53.854k i/100ms
Oj::Parser 150.904k i/100ms
rapidjson 80.775k i/100ms
Calculating -------------------------------------
json 481.096k (± 1.1%) i/s (2.08 μs/i) - 2.422M in 5.035657s
oj 554.878k (± 0.6%) i/s (1.80 μs/i) - 2.814M in 5.071521s
oj strict 547.888k (± 0.7%) i/s (1.83 μs/i) - 2.747M in 5.013212s
Oj::Parser 1.545M (± 0.4%) i/s (647.16 ns/i) - 7.847M in 5.078302s
rapidjson 822.422k (± 0.6%) i/s (1.22 μs/i) - 4.120M in 5.009178s
Comparison:
json: 481096.4 i/s
Oj::Parser: 1545223.5 i/s - 3.21x faster
rapidjson: 822422.4 i/s - 1.71x faster
oj: 554877.7 i/s - 1.15x faster
oj strict: 547887.7 i/s - 1.14x faster
== Parsing small hash (65 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 154.479k i/100ms
oj 220.283k i/100ms
oj strict 249.928k i/100ms
Oj::Parser 445.062k i/100ms
rapidjson 289.615k i/100ms
Calculating -------------------------------------
json 1.581M (± 3.0%) i/s (632.55 ns/i) - 8.033M in 5.086476s
oj 2.202M (± 3.5%) i/s (454.08 ns/i) - 11.014M in 5.008146s
oj strict 2.498M (± 3.5%) i/s (400.25 ns/i) - 12.496M in 5.008245s
Oj::Parser 4.640M (± 0.4%) i/s (215.50 ns/i) - 23.588M in 5.083443s
rapidjson 3.111M (± 0.3%) i/s (321.44 ns/i) - 15.639M in 5.027097s
Comparison:
json: 1580898.5 i/s
Oj::Parser: 4640298.1 i/s - 2.94x faster
rapidjson: 3111005.2 i/s - 1.97x faster
oj strict: 2498421.4 i/s - 1.58x faster
oj: 2202276.6 i/s - 1.39x faster
== Parsing test from oj (256 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 37.580k i/100ms
oj 41.899k i/100ms
oj strict 50.731k i/100ms
Oj::Parser 74.589k i/100ms
rapidjson 50.954k i/100ms
Calculating -------------------------------------
json 382.150k (± 1.0%) i/s (2.62 μs/i) - 1.917M in 5.015737s
oj 420.282k (± 0.2%) i/s (2.38 μs/i) - 2.137M in 5.084338s
oj strict 511.758k (± 0.5%) i/s (1.95 μs/i) - 2.587M in 5.055821s
Oj::Parser 759.087k (± 0.3%) i/s (1.32 μs/i) - 3.804M in 5.011388s
rapidjson 518.273k (± 1.8%) i/s (1.93 μs/i) - 2.599M in 5.015867s
Comparison:
json: 382149.6 i/s
Oj::Parser: 759087.1 i/s - 1.99x faster
rapidjson: 518272.8 i/s - 1.36x faster
oj strict: 511758.4 i/s - 1.34x faster
oj: 420282.5 i/s - 1.10x faster
== Parsing twitter.json (567916 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 52.000 i/100ms
oj 63.000 i/100ms
oj strict 74.000 i/100ms
Oj::Parser 79.000 i/100ms
rapidjson 56.000 i/100ms
Calculating -------------------------------------
json 522.896 (± 0.4%) i/s (1.91 ms/i) - 2.652k in 5.071809s
oj 624.849 (± 0.6%) i/s (1.60 ms/i) - 3.150k in 5.041398s
oj strict 737.779 (± 0.4%) i/s (1.36 ms/i) - 3.700k in 5.015117s
Oj::Parser 789.254 (± 0.3%) i/s (1.27 ms/i) - 3.950k in 5.004764s
rapidjson 565.663 (± 0.4%) i/s (1.77 ms/i) - 2.856k in 5.049015s
Comparison:
json: 522.9 i/s
Oj::Parser: 789.3 i/s - 1.51x faster
oj strict: 737.8 i/s - 1.41x faster
oj: 624.8 i/s - 1.19x faster
rapidjson: 565.7 i/s - 1.08x faster
== Parsing citm_catalog.json (1727030 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 27.000 i/100ms
oj 31.000 i/100ms
oj strict 36.000 i/100ms
Oj::Parser 42.000 i/100ms
rapidjson 38.000 i/100ms
Calculating -------------------------------------
json 305.248 (± 0.3%) i/s (3.28 ms/i) - 1.539k in 5.041813s
oj 320.265 (± 3.4%) i/s (3.12 ms/i) - 1.612k in 5.039715s
oj strict 373.701 (± 1.6%) i/s (2.68 ms/i) - 1.872k in 5.010633s
Oj::Parser 457.792 (± 0.4%) i/s (2.18 ms/i) - 2.310k in 5.046049s
rapidjson 350.933 (± 8.8%) i/s (2.85 ms/i) - 1.748k in 5.052491s
Comparison:
json: 305.2 i/s
Oj::Parser: 457.8 i/s - 1.50x faster
oj strict: 373.7 i/s - 1.22x faster
rapidjson: 350.9 i/s - 1.15x faster
oj: 320.3 i/s - 1.05x faster
== Parsing canada.json (2251051 bytes)
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
json 2.000 i/100ms
oj 2.000 i/100ms
oj strict 2.000 i/100ms
Oj::Parser 2.000 i/100ms
rapidjson 28.000 i/100ms
Calculating -------------------------------------
json 29.216 (± 6.8%) i/s (34.23 ms/i) - 146.000 in 5.053753s
oj 24.899 (± 0.0%) i/s (40.16 ms/i) - 126.000 in 5.061915s
oj strict 24.828 (± 4.0%) i/s (40.28 ms/i) - 124.000 in 5.003067s
Oj::Parser 30.867 (± 3.2%) i/s (32.40 ms/i) - 156.000 in 5.057104s
rapidjson 285.761 (± 1.0%) i/s (3.50 ms/i) - 1.456k in 5.095715s
Comparison:
json: 29.2 i/s
rapidjson: 285.8 i/s - 9.78x faster
Oj::Parser: 30.9 i/s - same-ish: difference falls within error
oj: 24.9 i/s - 1.17x slower
oj strict: 24.8 i/s - 1.18x slower
```
|
|
Now that we've inlined the eden_heap into the size_pool, we should
rename the size_pool to heap. So that Ruby contains multiple heaps, with
different sized objects.
The term heap as a collection of memory pages is more in memory
management nomenclature, whereas size_pool was a name chosen out of
necessity during the development of the Variable Width Allocation
features of Ruby.
The concept of size pools was introduced in order to facilitate
different sized objects (other than the default 40 bytes). They wrapped
the eden heap and the tomb heap, and some related state, and provided a
reasonably simple way of duplicating all related concerns, to provide
multiple pools that all shared the same structure but held different
objects.
Since then various changes have happend in Ruby's memory layout:
* The concept of tomb heaps has been replaced by a global free pages list,
with each page having it's slot size reconfigured at the point when it
is resurrected
* the eden heap has been inlined into the size pool itself, so that now
the size pool directly controls the free_pages list, the sweeping
page, the compaction cursor and the other state that was previously
being managed by the eden heap.
Now that there is no need for a heap wrapper, we should refer to the
collection of pages containing Ruby objects as a heap again rather than
a size pool
Notes:
Merged: https://github.com/ruby/ruby/pull/11771
|
|
And refine uncommon date cases.
# Iteration per second (i/s)
| |compare-ruby|built-ruby|
|:---------------------------|-----------:|---------:|
|time.xmlschema | 5.020M| 14.192M|
| | -| 2.83x|
|utc_time.xmlschema | 6.454M| 15.331M|
| | -| 2.38x|
|time.xmlschema(6) | 4.216M| 10.043M|
| | -| 2.38x|
|utc_time.xmlschema(6) | 5.486M| 10.592M|
| | -| 1.93x|
|time.xmlschema(9) | 4.294M| 10.340M|
| | -| 2.41x|
|utc_time.xmlschema(9) | 4.784M| 10.909M|
| | -| 2.28x|
|fraction_sec.xmlschema(10) | 366.982k| 3.406M|
| | -| 9.28x|
|future_time.xmlschema | 994.595k| 15.853M|
| | -| 15.94x|
Notes:
Merged: https://github.com/ruby/ruby/pull/11665
|
|
The original function name in ao.c was orthoBasis.
I guess the function is generating orthonormal basis (https://en.wikipedia.org/wiki/Orthonormal_basis).
Notes:
Merged: https://github.com/ruby/ruby/pull/6056
|
|
[Feature #20707]
Converting Time into RFC3339 / ISO8601 representation is an significant
hotspot for applications that serialize data in JSON, XML or other formats.
By moving it into core we can optimize it much further than what `strftime` will
allow.
```
compare-ruby: ruby 3.4.0dev (2024-08-29T13:11:40Z master 6b08a50a62) +YJIT [arm64-darwin23]
built-ruby: ruby 3.4.0dev (2024-08-30T13:17:32Z native-xmlschema 34041ff71f) +YJIT [arm64-darwin23]
warming up......
| |compare-ruby|built-ruby|
|:-----------------------|-----------:|---------:|
|time.xmlschema | 1.087M| 5.190M|
| | -| 4.78x|
|utc_time.xmlschema | 1.464M| 6.848M|
| | -| 4.68x|
|time.xmlschema(6) | 859.960k| 4.646M|
| | -| 5.40x|
|utc_time.xmlschema(6) | 1.080M| 5.917M|
| | -| 5.48x|
|time.xmlschema(9) | 893.909k| 4.668M|
| | -| 5.22x|
|utc_time.xmlschema(9) | 1.056M| 5.707M|
| | -| 5.40x|
```
Notes:
Merged: https://github.com/ruby/ruby/pull/11510
|
|
Use a classic doubling of capacity rather than only adding
twice as much capacity as is already known to be needed.
```
compare-ruby: ruby 3.4.0dev (2024-09-04T09:21:53Z opt-strftime-2 ae98d19cf9) +YJIT [arm64-darwin23]
built-ruby: ruby 3.4.0dev (2024-09-04T11:46:02Z opt-strftime-growth 586263d6fb) +YJIT [arm64-darwin23]
warming up...
| |compare-ruby|built-ruby|
|:---------------------------|-----------:|---------:|
|time.strftime("%FT%T") | 1.754M| 1.889M|
| | -| 1.08x|
|time.strftime("%FT%T.%3N") | 1.508M| 1.749M|
| | -| 1.16x|
|time.strftime("%FT%T.%6N") | 1.488M| 1.756M|
| | -| 1.18x|
compare-ruby: ruby 3.4.0dev (2024-09-04T09:21:53Z opt-strftime-2 ae98d19cf9) +YJIT [arm64-darwin23]
built-ruby: ruby 3.4.0dev (2024-09-04T09:21:53Z opt-strftime-2 ae98d19cf9) +YJIT [arm64-darwin23]
warming up...
```
Notes:
Merged: https://github.com/ruby/ruby/pull/11542
|
|
[Feature #19236]
When building a large hash, pre-allocating it with enough
capacity can save many re-hashes and significantly improve
performance.
```
/opt/rubies/3.3.0/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
--executables="compare-ruby::../miniruby-master -I.ext/common --disable-gem" \
--executables="built-ruby::./miniruby --disable-gem" \
--output=markdown --output-compare -v $(find ./benchmark -maxdepth 1 -name 'hash_new' -o -name '*hash_new*.yml' -o -name '*hash_new*.rb' | sort)
compare-ruby: ruby 3.4.0dev (2024-03-25T11:48:11Z master f53209f023) +YJIT dev [arm64-darwin23]
last_commit=[ruby/irb] Cache RDoc::RI::Driver.new (https://github.com/ruby/irb/pull/911)
built-ruby: ruby 3.4.0dev (2024-03-25T15:29:40Z hash-new-rb 77652b08a2) +YJIT dev [arm64-darwin23]
warming up...
| |compare-ruby|built-ruby|
|:-------------------|-----------:|---------:|
|new | 7.614M| 5.976M|
| | 1.27x| -|
|new_with_capa_1k | 13.931k| 15.698k|
| | -| 1.13x|
|new_with_capa_100k | 124.746| 148.283|
| | -| 1.19x|
```
|
|
With embedded strings we often have some space left in the slot, which
we can use to store the string Hash code.
It's probably only worth it for string literals, as they are the ones
likely to be used as hash keys.
We chose to store the Hash code right after the string terminator as to
make it easy/fast to compute, and not require one more union in RString.
```
compare-ruby: ruby 3.4.0dev (2024-04-22T06:32:21Z main f77618c1fa) [arm64-darwin23]
built-ruby: ruby 3.4.0dev (2024-04-22T10:13:03Z interned-string-ha.. 8a1a32331b) [arm64-darwin23]
last_commit=Precompute embedded string literals hash code
| |compare-ruby|built-ruby|
|:-----------|-----------:|---------:|
|symbol | 39.275M| 39.753M|
| | -| 1.01x|
|dyn_symbol | 37.348M| 37.704M|
| | -| 1.01x|
|small_lit | 29.514M| 33.948M|
| | -| 1.15x|
|frozen_lit | 27.180M| 33.056M|
| | -| 1.22x|
|iseq_lit | 27.391M| 32.242M|
| | -| 1.18x|
```
Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
|
|
|
|
|
|
These show gains from the recent optimization commits:
```
arg_splat
miniruby: 7346039.9 i/s
miniruby-before: 4692240.8 i/s - 1.57x slower
arg_splat_block
miniruby: 6539749.6 i/s
miniruby-before: 4358063.6 i/s - 1.50x slower
splat_kw_splat
miniruby: 5433641.5 i/s
miniruby-before: 3851048.6 i/s - 1.41x slower
splat_kw_splat_block
miniruby: 4916137.1 i/s
miniruby-before: 3477090.1 i/s - 1.41x slower
splat_kw_block
miniruby: 2912829.5 i/s
miniruby-before: 2465611.7 i/s - 1.18x slower
arg_splat_post
miniruby: 2195208.2 i/s
miniruby-before: 1860204.3 i/s - 1.18x slower
```
zsuper only speeds up in the post argument case, because
it was already set to use splatarray false in cases where
there were no post arguments.
|
|
Thanks to the new semantics from [ruby-core:115808], `**nil` is now
equivalent to `**{}`. Since the only thing one could do with anonymous
keyword rest parameter is to delegate it with `**`, nil is just as good
as an empty hash. Using nil avoids allocating an empty hash.
This is particularly important for `...` methods since they now use
`**kwrest` under the hood after 4f77d8d328. Most calls don't pass
keywords.
Comparison:
fw_no_kw
post: 9816800.9 i/s
pre: 8570297.0 i/s - 1.15x slower
|
|
keyword splat
The following code previously caused a crash:
```ruby
h = {}
1000000.times{|i| h[i.to_s.to_sym] = i}
def f(kw: 1, **kws) end
f(**h)
```
Inside a thread or fiber, the size of the keyword splat could be much smaller
and still cause a crash.
I found this issue while optimizing method calling by reducing implicit
allocations. Given the following code:
```ruby
def f(kw: , **kws) end
kw = {kw: 1}
f(**kw)
```
The `f(**kw)` call previously allocated two hashes callee side instead of a
single hash. This is because `setup_parameters_complex` would extract the
keywords from the keyword splat hash to the C stack, to attempt to mirror
the case when literal keywords are passed without a keyword splat. Then,
`make_rest_kw_hash` would build a new hash based on the extracted keywords
that weren't used for literal keywords.
Switch the implementation so that if a keyword splat is passed, literal keywords
are deleted from the keyword splat hash (or a copy of the hash if the hash is
not mutable).
In addition to avoiding the crash, this new approach is much more
efficient in all cases. With the included benchmark:
```
1
miniruby: 5247879.9 i/s
miniruby-before: 2474050.2 i/s - 2.12x slower
1_mutable
miniruby: 1797036.5 i/s
miniruby-before: 1239543.3 i/s - 1.45x slower
10
miniruby: 1094750.1 i/s
miniruby-before: 365529.6 i/s - 2.99x slower
10_mutable
miniruby: 407781.7 i/s
miniruby-before: 225364.0 i/s - 1.81x slower
100
miniruby: 100992.3 i/s
miniruby-before: 32703.6 i/s - 3.09x slower
100_mutable
miniruby: 40092.3 i/s
miniruby-before: 21266.9 i/s - 1.89x slower
1000
miniruby: 21694.2 i/s
miniruby-before: 4949.8 i/s - 4.38x slower
1000_mutable
miniruby: 5819.5 i/s
miniruby-before: 2995.0 i/s - 1.94x slower
```
|
|
|
|
To avoid stack overflow, Ruby splits compilation of large arrays
into smaller arrays, and concatenates the small arrays together.
It previously used newarray/concatarray for this, which is
inefficient. This switches the compilation to use pushtoarray,
which is much faster. This makes almost all literal arrays only
allocate a single array.
For cases where there is a large amount of static values in the
array, Ruby will statically compile subarrays, and previously
added them using concatarray. This switches to concattoarray,
avoiding an array allocation for the append.
Keyword splats are also supported in arrays, and ignored if the
keyword splat is empty. Previously, this used newarraykwsplat and
concatarray. This still uses newarraykwsplat, but switches to
concattoarray to save an allocation. So large arrays with keyword
splats can allocate 2 arrays instead of 1.
Previously, for the following array sizes (assuming local variable
access for each element), Ruby allocated the following number of
arrays:
1000 elements: 7 arrays
10000 elements: 79 arrays
100000 elements: 781 arrays
With these changes, only a single array is allocated (or 2 for a
large array with a keyword splat.
Results using the included benchmark:
```
array_1000
miniruby: 34770.0 i/s
./miniruby-before: 10511.7 i/s - 3.31x slower
array_10000
miniruby: 4938.8 i/s
./miniruby-before: 483.8 i/s - 10.21x slower
array_100000
miniruby: 727.2 i/s
./miniruby-before: 4.1 i/s - 176.98x slower
```
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
|
|
Benchmark results:
```
named_multi_arg_splat
after: 5344097.6 i/s
before: 3088134.0 i/s - 1.73x slower
named_post_splat
after: 5401882.3 i/s
before: 2629321.8 i/s - 2.05x slower
anon_arg_splat
after: 12242780.9 i/s
before: 6845413.2 i/s - 1.79x slower
anon_arg_kw_splat
after: 11277398.7 i/s
before: 4329509.4 i/s - 2.60x slower
anon_multi_arg_splat
after: 5132699.5 i/s
before: 3018103.7 i/s - 1.70x slower
anon_post_splat
after: 5602915.1 i/s
before: 2645185.5 i/s - 2.12x slower
anon_kw_splat
after: 15403727.3 i/s
before: 6249504.6 i/s - 2.46x slower
anon_fw_to_named_splat
after: 2985715.3 i/s
before: 2049159.9 i/s - 1.46x slower
anon_fw_to_named_no_splat
after: 2941030.4 i/s
before: 2100380.0 i/s - 1.40x slower
fw_to_named_splat
after: 2801008.7 i/s
before: 2012416.4 i/s - 1.39x slower
fw_to_named_no_splat
after: 2742670.4 i/s
before: 1957707.2 i/s - 1.40x slower
fw_to_anon_to_named_splat
after: 2309246.6 i/s
before: 1375924.6 i/s - 1.68x slower
fw_to_anon_to_named_no_splat
after: 2193227.6 i/s
before: 1351184.1 i/s - 1.62x slower
```
|
|
|
|
* YJIT: Allow inlining ISEQ calls with a block
* Leave a TODO comment about u16 inline_block
|
|
|
|
This follows the same approach used for attr_reader/attr_writer in
2d98593bf54a37397c6e4886ccc7e3654c2eaf85, skipping the checking for
tracing after the first call using the call cache, and clearing the
call cache when tracing is turned on/off.
Fixes [Bug #18886]
|
|
`String#+@` is 2-3 times faster than `String#dup` because it can
directly go through `rb_str_dup` instead of using the generic
much slower `rb_obj_dup`.
This fact led to the existance of the ugly `Performance/UnfreezeString`
rubocop performance rule that encourage users to rewrite the much
more readable and convenient `"foo".dup` into the ugly `(+"foo")`.
Let's make that rubocop rule useless.
```
compare-ruby: ruby 3.3.0dev (2023-11-20T02:02:55Z master 701b0650de) [arm64-darwin22]
last_commit=[ruby/prism] feat: add encoding for IBM865 (https://github.com/ruby/prism/pull/1884)
built-ruby: ruby 3.3.0dev (2023-11-20T12:51:45Z faster-str-lit-dup 6b745bbc5d) [arm64-darwin22]
warming up..
| |compare-ruby|built-ruby|
|:------|-----------:|---------:|
|uplus | 16.312M| 16.332M|
| | -| 1.00x|
|dup | 5.912M| 16.329M|
| | -| 2.76x|
```
|
|
When an inline cache misses, it is very likely that the stale shape_id
and the current instance shape_id have a close common ancestor.
For example if the instance variable is sometimes frozen sometimes
not, one of the two shape will be the direct parent of the other.
Another pattern that commonly cause IC misses is "memoization",
in such case the object will have a "base common shape" and then
a number of close descendants.
In addition, when we find a common ancestor, we store it in the
inline cache instead of the current shape. This help prevent the
cache from flip-flopping, ensuring the next lookup will be marginally
faster and more generally avoid writing in memory too much.
However, now that shapes have an ancestors index, we only check
for a few ancestors before falling back to use the index.
So overall this change speeds up what is assumed to be the more common
case, but makes what is assumed to be the less common case a bit slower.
```
compare-ruby: ruby 3.3.0dev (2023-10-26T05:30:17Z master 701ca070b4) [arm64-darwin22]
built-ruby: ruby 3.3.0dev (2023-10-26T09:25:09Z shapes_double_sear.. a723a85235) [arm64-darwin22]
warming up......
| |compare-ruby|built-ruby|
|:------------------------------------|-----------:|---------:|
|vm_ivar_stable_shape | 11.672M| 11.679M|
| | -| 1.00x|
|vm_ivar_memoize_unstable_shape | 7.551M| 10.506M|
| | -| 1.39x|
|vm_ivar_memoize_unstable_shape_miss | 11.591M| 11.624M|
| | -| 1.00x|
|vm_ivar_unstable_undef | 9.037M| 7.981M|
| | 1.13x| -|
|vm_ivar_divergent_shape | 8.034M| 6.657M|
| | 1.21x| -|
|vm_ivar_divergent_shape_imbalanced | 10.471M| 9.231M|
| | 1.13x| -|
```
Co-Authored-By: John Hawthorn <john@hawthorn.email>
|
|
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
|
|
This is an experimental commit that uses a functional red-black tree to
create an index of the ancestor shapes. It uses an Okasaki style
functional red black tree:
https://www.cs.tufts.edu/comp/150FP/archive/chris-okasaki/redblack99.pdf
This tree is advantageous because:
* It offers O(n log n) insertions and O(n log n) lookups.
* It shares memory with previous "versions" of the tree
When we insert a node in the tree, only the parts of the tree that need
to be rebalanced are newly allocated. Parts of the tree that don't need
to be rebalanced are not reallocated, so "new trees" are able to share
memory with old trees. This is in contrast to a sorted set where we
would have to duplicate the set, and also resort the set on each
insertion.
I've added a new stat to RubyVM.stat so we can understand how the red
black tree increases.
|
|
|
|
|
|
|
|
|
|
|
|
Empty ranges do not overlap with any range.
Regarding benchmarks, PR#8242 is significantly faster in some cases,
but one of these two cases is a wrong result.
| |ActiveSupport| PR#8242|built-ruby|
|:--------------------------|------------:|-------:|---------:|
|(2..3).overlap?(1..1) | 7.761M| 15.053M| 32.368M|
| | -| 1.94x| 4.17x|
|(2..3).overlap?(2..4) | 25.720M| 55.070M| 21.981M|
| | 1.17x| 2.51x| -|
|(2..3).overlap?(4..5) | 7.616M| 15.048M| 21.730M|
| | -| 1.98x| 2.85x|
|(2..3).overlap?(2..1) | 25.585M| 56.545M| 32.786M|
| | -| 2.21x| 1.28x|
|(2..3).overlap?(0..1) | 7.554M| 14.755M| 32.545M|
| | -| 1.95x| 4.31x|
|(2..3).overlap?(...1) | 6.681M| 5.843M| 32.255M|
| | 1.14x| -| 5.52x|
|(2...3).overlap?(..2) | 6.676M| 5.817M| 21.572M|
| | 1.15x| -| 3.71x|
|(2...3).overlap?(3...) | 7.392M| 14.755M| 31.805M|
| | -| 2.00x| 4.30x|
|(2..3).overlap?('a'..'d') | 3.675M| 3.482M| 17.009M|
| | 1.06x| -| 4.89x|
|
|
On Range#bsearch for endless ranges, we try positions at `begin + 2**i` (i = 0, 1, 2, ...)
to find a point that satisfies a given condition.
Subsequently, we perform binary searching with the interval `[begin, begin + 2**n]`.
However, the interval `[begin + 2**(n-1), begin + 2**n]` is sufficient for binary search
because `begin + 2**(n-1)` does not satisfy the condition.
The same applies to beginless ranges.
|
|
Leave callers to convert byte index to char index, as well as
`rb_str_index`, so that `rb_str_rpartition` does not need to
re-convert char index to byte index.
Notes:
Merged: https://github.com/ruby/ruby/pull/8047
|
|
When copying from another regexp, copy already built `regex_t` instead
of re-compiling its source.
Notes:
Merged: https://github.com/ruby/ruby/pull/7922
|
|
In most of case `sort_by` works on primitive type.
Using `qsort_r` with function pointer is much slower than compare data directly.
I implement an intro sort which compare primitive data directly for `sort_by`.
We can even afford an O(n) type check before primitive data sort.
It still go faster.
Notes:
Merged: https://github.com/ruby/ruby/pull/7805
Merged-By: nobu <nobu@ruby-lang.org>
|
|
CALLER_ARG_SPLAT is not necessary for method_missing. We just need
to unshift the method name into the arguments.
This optimizes all method_missing calls:
* mm(recv) ~9%
* mm(recv, *args) ~215% for args.length == 200
* mm(recv, *args, **kw) ~55% for args.length == 200
* mm(recv, **kw) ~22%
* mm(recv, kw: 1) ~100%
Note that empty argument splats do get slower with this approach,
by about 30-40%. Other than non-empty argument splats, other
argument splats are faster, with the speedup depending on the
number of arguments.
Notes:
Merged: https://github.com/ruby/ruby/pull/7522
|
|
Similar to the bmethod/send optimization, this avoids using
CALLER_ARG_SPLAT if not necessary. As long as the receiver argument
can be shifted off, other arguments are passed through as-is.
This optimizes the following types of calls:
* symproc.(recv) ~5%
* symproc.(recv, *args) ~65% for args.length == 200
* symproc.(recv, *args, **kw) ~45% for args.length == 200
* symproc.(recv, **kw) ~30%
* symproc.(recv, kw: 1) ~100%
Note that empty argument splats do get slower with this approach,
by about 2-3%. This is probably because iseq argument setup is
slower for empty argument splats than CALLER_SETUP_ARG is. Other
than non-empty argument splats, other argument splats are faster,
with the speedup depending on the number of arguments.
The following types of calls are not optimized:
* symproc.(*args)
* symproc.(*args, **kw)
This is because the you cannot shift the receiver argument off
without first splatting the arg.
Notes:
Merged: https://github.com/ruby/ruby/pull/7522
|
|
Similar to the bmethod optimization, this avoids using
CALLER_ARG_SPLAT if not necessary. As long as the method argument
can be shifted off, other arguments are passed through as-is.
This optimizes the following types of calls:
* send(meth, arg) ~5%
* send(meth, *args) ~75% for args.length == 200
* send(meth, *args, **kw) ~50% for args.length == 200
* send(meth, **kw) ~25%
* send(meth, kw: 1) ~115%
Note that empty argument splats do get slower with this approach,
by about 20%. This is probably because iseq argument setup is
slower for empty argument splats than CALLER_SETUP_ARG is. Other
than non-empty argument splats, other argument splats are faster,
with the speedup depending on the number of arguments.
The following types of calls are not optimized:
* send(*args)
* send(*args, **kw)
This is because the you cannot shift the method argument off
without first splatting the arg.
Notes:
Merged: https://github.com/ruby/ruby/pull/7522
|
|
This optimizes the following calls:
* ~10-15% for f(*a) when a does not end with a flagged keywords hash
* ~10-15% for f(*a) when a ends with an empty flagged keywords hash
* ~35-40% for f(*a, **kw) if kw is empty
This still copies the array contents to the VM stack, but avoids some
overhead. It would be faster to use the array pointer directly,
but that could cause problems if the array was modified during
the call to the function. You could do that optimization for frozen
arrays, but as splatting frozen arrays is uncommon, and the speedup
is minimal (<5%), it doesn't seem worth it.
The vm_send_cfunc benchmark has been updated to test additional cfunc
call types, and the numbers above were taken from the benchmark results.
Notes:
Merged: https://github.com/ruby/ruby/pull/7522
|
|
Currently, bmethod arguments are copied from the VM stack to the
C stack in vm_call_bmethod, then copied from the C stack to the VM
stack later in invoke_iseq_block_from_c. This is inefficient.
This adds vm_call_iseq_bmethod and vm_call_noniseq_bmethod.
vm_call_iseq_bmethod is an optimized method that skips stack
copies (though there is one copy to remove the receiver from
the stack), and avoids calling vm_call_bmethod_body,
rb_vm_invoke_bmethod, invoke_block_from_c_proc,
invoke_iseq_block_from_c, and vm_yield_setup_args.
Th vm_call_iseq_bmethod argument handling is similar to the
way normal iseq methods are called, and allows for similar
performance optimizations when using splats or keywords.
However, even in the no argument case it's still significantly
faster.
A benchmark is added for bmethod calling. In my environment,
it improves bmethod calling performance by 38-59% for simple
bmethod calls, and up to 180% for bmethod calls passing
literal keywords on both sides.
```
./miniruby-iseq-bmethod: 18159792.6 i/s
./miniruby-m: 13174419.1 i/s - 1.38x slower
bmethod_simple_1
./miniruby-iseq-bmethod: 15890745.4 i/s
./miniruby-m: 10008972.7 i/s - 1.59x slower
bmethod_simple_0_splat
./miniruby-iseq-bmethod: 13142804.3 i/s
./miniruby-m: 11168595.2 i/s - 1.18x slower
bmethod_simple_1_splat
./miniruby-iseq-bmethod: 12375791.0 i/s
./miniruby-m: 8491140.1 i/s - 1.46x slower
bmethod_no_splat
./miniruby-iseq-bmethod: 10151258.8 i/s
./miniruby-m: 8716664.1 i/s - 1.16x slower
bmethod_0_splat
./miniruby-iseq-bmethod: 8138802.5 i/s
./miniruby-m: 7515600.2 i/s - 1.08x slower
bmethod_1_splat
./miniruby-iseq-bmethod: 8028372.7 i/s
./miniruby-m: 5947658.6 i/s - 1.35x slower
bmethod_10_splat
./miniruby-iseq-bmethod: 6953514.1 i/s
./miniruby-m: 4840132.9 i/s - 1.44x slower
bmethod_100_splat
./miniruby-iseq-bmethod: 5287288.4 i/s
./miniruby-m: 2243218.4 i/s - 2.36x slower
bmethod_kw
./miniruby-iseq-bmethod: 8931358.2 i/s
./miniruby-m: 3185818.6 i/s - 2.80x slower
bmethod_no_kw
./miniruby-iseq-bmethod: 12281287.4 i/s
./miniruby-m: 10041727.9 i/s - 1.22x slower
bmethod_kw_splat
./miniruby-iseq-bmethod: 5618956.8 i/s
./miniruby-m: 3657549.5 i/s - 1.54x slower
```
Notes:
Merged: https://github.com/ruby/ruby/pull/7522
|
|
|
|
Notes:
Merged-By: k0kubun <takashikkbn@gmail.com>
|
|
Notes:
Merged: https://github.com/ruby/ruby/pull/6965
|
|
* Rewrite Kernel#loop in Ruby
* Use enum_for(:loop) { Float::INFINITY }
Co-authored-by: Ufuk Kayserilioglu <ufuk@paralaus.com>
* Limit the scope to rescue StopIteration
Co-authored-by: Ufuk Kayserilioglu <ufuk@paralaus.com>
Notes:
Merged-By: k0kubun <takashikkbn@gmail.com>
|
|
`Time.new` now parses strings such as the result of `Time#inspect`
and restricted ISO-8601 formats.
Notes:
Merged: https://github.com/ruby/ruby/pull/4825
|
|
Prior to this commit the `OPTIMIZED_CMP` macro relied on a method lookup
to determine whether `<=>` was overridden. The result of the lookup was
cached, but only for the duration of the specific method that
initialized the cmp_opt_data cache structure.
With this method lookup, `[x,y].max` is slower than doing `x > y ?
x : y` even though there's an optimized instruction for "new array max".
(John noticed somebody a proposed micro-optimization based on this fact
in https://github.com/mastodon/mastodon/pull/19903.)
```rb
a, b = 1, 2
Benchmark.ips do |bm|
bm.report('conditional') { a > b ? a : b }
bm.report('method') { [a, b].max }
bm.compare!
end
```
Before:
```
Comparison:
conditional: 22603733.2 i/s
method: 19820412.7 i/s - 1.14x (± 0.00) slower
```
This commit replaces the method lookup with a new CMP basic op, which
gives the examples above equivalent performance.
After:
```
Comparison:
method: 24022466.5 i/s
conditional: 23851094.2 i/s - same-ish: difference falls within
error
```
Relevant benchmarks show an improvement to Array#max and Array#min when
not using the optimized newarray_max instruction as well. They are
noticeably faster for small arrays with the relevant types, and the same
or maybe a touch faster on larger arrays.
```
$ make benchmark COMPARE_RUBY=<master@5958c305> ITEM=array_min
$ make benchmark COMPARE_RUBY=<master@5958c305> ITEM=array_max
```
The benchmarks added in this commit also look generally improved.
Co-authored-by: John Hawthorn <jhawthorn@github.com>
|
|
for consistency with YJIT
Notes:
Merged-By: k0kubun <takashikkbn@gmail.com>
|