-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Description
What version of protobuf and what language are you using?
Version: 4.29.0
Language: Ruby
What operating system (Linux, Windows, ...) and version?
- macOS 15.1.1
- Ubuntu 24.04
What runtime / compiler are you using (e.g., python version or gcc version)
ruby -v
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]
protoc --version
libprotoc 29.0
What did you do?
Steps to reproduce the behavior:
- Create a message with a repeated recursive field
- In code, create a long-held reference of this type (e.g., in an instance variable)
- Create new instances that link to the long-lived object
- Observe memory grows unbounded
I've pulled together a small reproduction. To see the issue:
git clone https://github.com/nirvdrum/grpc-protobuf-experiments
cd grpc-protobuf-experiments
bundle install
bundle exec rake
bundle exec ruby leak-simple.rb
At the heart of the problem is a simple message definition:
syntax = "proto3";
package proto.leak;
message Recursive {
repeated Recursive data = 1;
}
The repository includes a Gemfile to lock down the versions I used to reproduce the problem and a Rakefile to ease the protoc
instantiation. You must be using an older Ubuntu uses an older release of protoc
(e.g., brew install protobuf@21
) because there appears to be a Ruby code generation bug in newer releases. This older version of protoc
has no bearing on the runtime, where we do use a modern version of the google-protobuf gem.protoc
(3.12.21), not a newer -- apologies for the confusion.
The leak-simple.rb script in the linked repo creates local instances of Proto::Leak::Recursive
in a loop with each object linking back to a global instance of Proto::Leak::Recursive
via the repeated data
field. After an inner loop finishes executing we trigger GC and then print out the RSS of the process along with Ruby's view of how much memory is allocated. Increasing the inner loop iterations will print out larger RSS values.
In simplified form, the loop looks like:
require_relative "gen/protobuf/simple_pb"
datum = Proto::Leak::Recursive.new
100_000.times { Proto::Leak::Recursive.new(data: [datum]) }
The repository version expands on that simple loop to print out the total memory growth (RSS) over the duration of the script execution. Running with VERBOSE=true
will also print out periodic memory usage numbers. The repo also has leak-bigtable.rb and leak-bigtable-extracted.rb scripts. These can also be run with bundle exec ruby leak-bigtable.rb
and bundle exec ruby leak-bigtable-extracted.rb
. All three scripts are functionally similar, but vary on the source of the proto definitions.
The leak-bigtable.rb file uses the google-cloud-bigtable gem to create instances of Google::Cloud::Bigtable::V2::RowFilter::Chain
. This script reproduces the memory leak we have that's impacting production workloads.
The leak-bigtable-extracted.rb file uses a subset of the proto definitions from the google-cloud-bigtable gem. The goal here was to use the source proto definitions but strip them down to the bare minimum.
The leak-simple.rb gets at the heart of the problem and reproduces without the large google-cloud-bigtable dependency.
What did you expect to see
I'd expect objects created in a loop that don't escape will not lead to a growth in memory consumption.
What did you see instead?
Memory continues to grow unbounded. However, ObjectSpace
does not reflect this growth, suggesting that the memory is growing off-heap in the the google-protobuf extension. After debugging, however, we believe the problem has to do with the fuse operation for Google::Protobuf::Internal::Arena
. It looks like there is an arena that is held for a long time and constantly appended to. However, that would be memory on the Ruby heap and ObjectSpace
is telling us that the heap isn't growing. We believe that's an artifact of a second bug where the memsize
hook (i.e., Arena_memsize
or a related function) is not properly reporting the post-fuse size. In particular, dividing the size by the fused count looks like the wrong operation:
memsize /= fused_count; |
The total memsize is divided by the total number of arenas fused into it, presumably because the other arenas would report their size, but those arenas are no longer alive and thus have no size to report.