| Age | Commit message (Collapse) | Author |
|
|
|
`gc_prof_mark_timer_start` and `gc_prof_mark_timer_stop` include DTrace
hooks for the `MARK_BEGIN` and `MARK_END` events, respectively.
Previously, those probes are only triggered in `gc_marks`. However,
`gc_marks_continue` and `gc_rest` also contain marking activities, but
are not captured by the probes.
We move the invocation of `gc_prof_mark_timer_start` and
`gc_prof_mark_timer_stop` into `gc_marking_enter` and `gc_marking_exit`
to ensure all marking activities are captured by the probes.
|
|
|
|
Snapshotting at start of marking lets sweep-time frees count against the
next epoch, which roughly halves GC frequency on alloc-heavy workloads.
Move the snapshot to end of sweep so the next epoch starts from a clean
baseline.
|
|
|
|
|
|
Several SETs in rb_gc_impl_stat may allocate a T_BIGNUM RVALUE when
the value exceeds FIXNUM_MAX
This is invisible on LP64 but trips on LLP64 Windows and ILP32 Linux
where FIXNUM_MAX ~= 1.07GB.
If those allocations happen *after* setting heap_live_slots then
stat[:heap_live_slots] reflects a stale snapshot, and tests that assert
on it fail.
This commit reorders everything so every potentially-allocating SET runs
first, and the slot counters are SET last.
|
|
|
|
Replace the single
objspace->malloc_counters.{increase,oldmalloc_increase} size_t fields
with pairs of monotonically-increasing counters. Snapshots of these
counters are taken at each GC, so that the live malloc_increase is computed as
(malloc - malloc_at_last_gc) - (free - free_at_last_gc)
We update the baselines at each GC. Minor GC's update malloc and free
associated with young objects only (counters).
Major GC's update based on "oldcounters" as well.
The malloc/free counters are 64 bits wide which should provide ample
headroom for real world programs (>500 years at 1Gb/sec allocation rate
XD). We use size_t on 64-bit and uint64_t on 32-bit, wrapped by a
gc_counter_t struct.
However, because updating a uint64_t is a multi-instruction
operation on 32 bit architectures we have to introduce a lock to the
malloc_counters struct to avoid racing.
We introduced 2 new macros MALLOC_COUNTERS_LOCK and
MALLOC_COUNTERS_UNLOCK that use `rb_nativethread_lock_t`.
The lock is initialized in rb_gc_impl_objspace_init and destroyed in
rb_gc_impl_objspace_free. We chose this because it mirrors existing
finalizer_lock pattern in wbcheck.
On 64 bit platforms aligned 64 bit loads are atomic, and writes are
already using RUBY_ATOMIC_SIZE_ADD so the locks are not needed and the
macros do nothing.
|
|
We made a mistake calculating slot sizes during the heap slot sizes
refactor. Previously BASE_SLOT_SIZE included RVALUE_OVERHEAD, this was
lost during the refactor to use the SLOT macro.
The result of this was that when Ruby was compiled with -DRUBY_DEBUG it
was assumed that the last word of each slot was RVALUE_OVERHEAD. Because
this hadn't been taken into account at allocation time, all slots were
effectively one word shorter.
This PR adds RVALUE_OVERHEAD to the size calcualted in the SLOT macro
directly, so it will be added on to the physically allocated size at
allocation time.
|
|
|
|
|
|
|
|
rb_gc_initialize_vm_context calls GET_EC, which does VM_ASSERT(ec !=
NULL).
When Ruby is built with RUBY_DEBUG=1 and GC stress is set to run at boot
with RUBY_DEBUG=gc_stress then GC gets run inside Init_BareVM when we're
setting up the main_thread.
In gc_start we gate the GC with some early returns that prevent us
actually attempting a GC if the heap and objspace are not ready yet, but
we're attempting to initialize the gc's VM context before those gates,
causing the assertion to fail (because the VM isn't ready yet).
This commit moves the vm_context setup after the gates, so we don't
attempt it before objspace and the heap are fully set up.
To repro this bug configure with --enable-dev-env and
cflags=-DRUBY_DEBUG and then run
RUBY_DEBUG=gc_stress ./ruby -v
|
|
This would allow rb_gc_event_hook to run in a GC thread that is a
non-Ruby thread.
|
|
Back when this code was added, moving a T_OBJECT to a different
size pool required to rebuilt its shape tree, which could allocate,
potentially triggering GC during GC.
Ref: https://github.com/ruby/ruby/pull/6926
Ref: https://github.com/ruby/ruby/pull/6938
However, this is no longer a concern.
`SHAPE_T_OBJECT` has been removed, and now transitioning a
shape from one size pool to another never involve an allocation.
Ref: https://github.com/ruby/ruby/pull/13519
Hence we can remove a lot of complexity, and directly update
the shape right after moving the object.
|
|
Since EC is thread-local, we previously used rb_gc_worker_thread_set_vm_context
in MMTk worker threads to temporarily set the EC. However, this was inelegant
and also occasionally caused crashes when marking threads/fibers for the
current EC since it will mark the current machine stack twice (once during
root marking and once for the fiber). However, since the machine
stack is actively being used, the contents may be different when marking
the fiber. Since all objects on the machine stack are pinned, this may
cause an unpinned object to be pinned, which is not allowed in Immix.
The following crash can be observed:
Object 0x200fffbc7d8 is trying to pin 0x200ffc80188
0: mmtk_ruby::handle_gc_thread_panic
1: mmtk_ruby::set_panic_hook::{{closure}}
2: <alloc::boxed::Box<dyn for<'a, 'b> core::ops::function::Fn<(&'a std::panic::PanicHookInfo<'b>,), Output = ()> + core::marker::Sync + core::marker::Send> as core::ops::function::Fn<(&std::panic::PanicHookInfo,)>>::call
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/alloc/src/boxed.rs:2254:9
3: std::panicking::panic_with_hook
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/std/src/panicking.rs:833:13
4: std::panicking::panic_handler::{closure#0}
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/std/src/panicking.rs:698:13
5: std::sys::backtrace::__rust_end_short_backtrace::<std::panicking::panic_handler::{closure#0}, !>
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/std/src/sys/backtrace.rs:182:18
6: __rustc::rust_begin_unwind
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/std/src/panicking.rs:689:5
7: core::panicking::panic_fmt
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/core/src/panicking.rs:80:14
8: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
9: mmtk_ruby::abi::ObjectClosure::c_function_registered
10: rb_mmtk_call_object_closure
at gc/mmtk/mmtk.c:976:19
11: rb_gc_impl_mark_and_pin
at gc/mmtk/mmtk.c:1008:5
12: rb_gc_impl_mark_and_pin
at gc/mmtk/mmtk.c:1004:1
13: gc_mark_maybe_internal
at gc.c:2908:5
14: gc_mark_maybe_internal
at gc.c:2906:1
15: gc_mark_maybe_each_location
at gc.c:2939:5
16: gc_mark_maybe_each_location
at gc.c:2937:1
17: each_location
at gc.c:2924:9
18: each_location_ptr
at gc.c:2933:5
19: each_location_ptr
at gc.c:2930:1
20: rb_gc_mark_machine_context
at gc.c:3200:5
21: rb_execution_context_mark
at vm.c:3768:9
22: cont_mark
at cont.c:1155:5
23: fiber_mark
at cont.c:1284:5
24: rb_mmtk_call_gc_mark_children
at gc/mmtk/mmtk.c:318:5
25: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
|
|
We implemented some bit twiddling logic with an unsigned int to have a
neat way of tracking which heaps were currently sweeping, but we
actually don't need to care which heap is sweeping right now, just
whether some are or not, so we can replace this with a counter.
|
|
|
|
Add a 7/8 multiplier to the min_free_slots checks in
gc_sweep_finish_heap and gc_marks_finish, allowing heaps to be up to
~12.5% below the free slots target without triggering a major GC or
forced growth.
With 12 heaps instead of 5, each heap independently hitting the exact
threshold would cause excessive memory growth. The slack prevents
cascading growth decisions while still ensuring heaps stay close to
their target occupancy.
|
|
|
|
Add GC::INTERNAL_CONSTANTS[:RVALUE_SIZE] to store the usable size
(excluding debug overhead) of the smallest pool that can hold a standard
RVALUE.
|
|
Replace the RVALUE_SLOT_SIZE-multiplier based pool sizes with explicit
power-of-two (and near-power-of-two) slot sizes. On 64-bit this gives
12 heaps (32, 40, 64, 80, 96, 128, 160, 256, 512, 640, 768, 1024)
instead of 5, providing finer granularity and less internal
fragmentation. On 32-bit the layout is 5 heaps (32, 64, 128, 256, 512).
|
|
|
|
This reverts commit c617c5ec85ff69a5a8b13c56d51fcd234c00e1e2.
|
|
Anchors the historical 2048/1024 slot counts on the 80-byte
heap instead of the 40-byte heap. This isolates whether the
major GC elimination seen in railsbench was caused by heap 1's
halved budget in the previous commit.
|
|
Larger slot pools are less heavily used, so a fixed slot count
over-services them relative to allocation pressure. Divide a
byte budget by heap->slot_size so the effective per-step slot
count tapers inversely with slot size.
|
|
|
|
|
|
Replace per-heap GC_HEAP_INIT_SLOTS with a single GC_HEAP_INIT_BYTES
target.
Instead of allocating a fixed 10k slot budget for each heap to grow
into. This PR gives each heap a fixed 2.5Mb heap growth allowance. This
keeps the overall heap size budget roughly the same, but allows the
smaller pools to grow much larger before more pages are allocated.
```
Heap 0: 10,000 × 40 = 400,000 bytes
Heap 1: 10,000 × 80 = 800,000 bytes
Heap 2: 10,000 × 160 = 1,600,000 bytes
Heap 3: 10,000 × 320 = 3,200,000 bytes
Heap 4: 10,000 × 640 = 6,400,000 bytes
Total: 12,400,000 bytes (50,000 slots)
```
```
Heap 0: 2,621,440 / 40 = 65,536 slots
Heap 1: 2,621,440 / 80 = 32,768 slots
Heap 2: 2,621,440 / 160 = 16,384 slots
Heap 3: 2,621,440 / 320 = 8,192 slots
Heap 4: 2,621,440 / 640 = 4,096 slots
Total: 13,107,200 bytes (126,976 slots)
```
|
|
|
|
Index on 8 byte chunks instead of individual bytes. This works because
all pool stot sizes are pointer aligned, so all sizes in an 8 byte range
map to the same heap.
|
|
Also remove BASE_SLOT_SIZE.
|
|
Use the appropriate modifier. `size_t` is not always `unsigned long`,
even if the size is the same.
|
|
Previously classes and modules were pre-aged. Ie. as soon as they're
allocated they are aged to old_age - 1. This was done with the
assumption that classes are generally always long lived, so we should
assume that any that survive a single GC can be considered old.
This commit keeps the same semantics, but moves the logic out of the
allocation path, in order to simplify allocation. Classes and modules
are now set to old-age the first time they are marked.
|
|
In gc_sweep_plane, VALGRIND_MAKE_MEM_UNDEFINED was using BASE_SLOT_SIZE
which only covers the smallest pool's slot size. For larger size pools
this left the tail of the slot with stale state. Use the page's actual
slot_size instead.
In gc_prof_set_heap_info, heap_use_size and heap_total_size were computed
as object_count * BASE_SLOT_SIZE, undercounting memory for objects in
larger size pools. Sum across all heaps using each pool's actual slot
size for correct byte totals.
|
|
This is being used to calculate the starting point of the slots in a
page in order to make them evenly divisible by a bitmap plane.
Since https://github.com/ruby/ruby/pull/16150 we restructured the
bitmaps in order to pack them such that 1 bit == 1 slot, and remove the
masking, meaning that we no longer need to align against planes.
This is the last remining use for the NUM_IN_PAGE macro so we can remove
that as well.
|
|
This was useful when there was only a single size pool to have an easy
way of referencing the average number of objects a page could hold (this
would vary by a few in real terms because of page alignment).
But with multiple heaps, each heap contains pages with different numbers
of objects because slot sizes are different.
So when we use HEAP_PAGE_OBJ_LIMIT to do any kind of calculations: such
as calculating freeable pages), then we're significantly underestimating
the number of freeable pages in the larger size pools, which will cause
us to hold on to pages unnecessarily.
This commit replaces uses of HEAP_PAGE_OBJ_LIMIT with a more accurate
approximation for the actual heap being manipulated.
It also removes HEAP_PAGE_OBJ_LIMIT from GC::INTERNAL_CONSTANTS
|
|
As @jhawthorn pointed out, the original calculation used `(1 << 32) /
heap->slot_size + 1)` which leads to a subtle off by one error that gets
shifted away because our slot sizes aren't powers of 2.
This is still worth fixing now, so that we don't trip up over it if we
change slot sizes in the future.
|
|
because BASE_SLOT_SIZE changes on 32 bit, and when debug/devel symbols
are added
|
|
|
|
instead of computing them on page add
|
|
Replace the BASE_SLOT_SIZE-granularity bitmap scheme with slot-based
indexing where each bit represents one slot regardless of size.
Key changes:
- Add slot_div_magic field to heap_page for fast division
- Use Go-inspired formula: slot_index = (offset * div_magic) >> 32
- Update all bitmap iteration to use one-bit-per-slot scheme
- Remove slot_bits_mask from rb_heap_t (no longer needed)
This enables arbitrary slot sizes (not just power-of-two multiples of
BASE_SLOT_SIZE) by decoupling bitmap indexing from slot size.
Functions updated:
- gc_sweep_plane/gc_sweep_page
- rgengc_rememberset_mark/rgengc_rememberset_mark_plane
- gc_marks_wb_unprotected_objects/gc_marks_wb_unprotected_objects_plane
- gc_compact_plane/gc_compact_page
- invalidate_moved_plane/invalidate_moved_page
- RVALUE_AGE_GET/RVALUE_AGE_SET_BITMAP
Inspired by Go runtime's mbitmap.go divideByElemSize().
|
|
|
|
This aims to speed up sweeping by clearing all age and wb_unprotected
bits for unmarked objects. This should be faster because we can clear
up to a whole plane of objects (64 slots) at once.
|
|
Previously we used two adjacent bits in the same word to store the
object's age. This changes that to instead store the age in the same bit
position across two adjacent words. This makes age use the exact same
bit positions as the other bitmaps (just across two words).
|
|
|
|
|
|
For now the provided size is just for GC statistics, but in the future
we may want to forward it to C23's `free_sized` and passing an incorrect
size to it is undefined behavior.
|
|
|