Skip to content

Memory leak in Ruby gem with repeated recursive messages #19498

@nirvdrum

Description

@nirvdrum

What version of protobuf and what language are you using?
Version: 4.29.0
Language: Ruby

What operating system (Linux, Windows, ...) and version?

  • macOS 15.1.1
  • Ubuntu 24.04

What runtime / compiler are you using (e.g., python version or gcc version)

ruby -v
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]

protoc --version
libprotoc 29.0

What did you do?
Steps to reproduce the behavior:

  1. Create a message with a repeated recursive field
  2. In code, create a long-held reference of this type (e.g., in an instance variable)
  3. Create new instances that link to the long-lived object
  4. Observe memory grows unbounded

I've pulled together a small reproduction. To see the issue:

  1. git clone https://github.com/nirvdrum/grpc-protobuf-experiments
  2. cd grpc-protobuf-experiments
  3. bundle install
  4. bundle exec rake
  5. bundle exec ruby leak-simple.rb

At the heart of the problem is a simple message definition:

syntax = "proto3";

package proto.leak;

message Recursive {
    repeated Recursive data = 1;
}

The repository includes a Gemfile to lock down the versions I used to reproduce the problem and a Rakefile to ease the protoc instantiation. You must be using an older protoc (e.g., brew install protobuf@21) because there appears to be a Ruby code generation bug in newer releases. This older version of protoc has no bearing on the runtime, where we do use a modern version of the google-protobuf gem. Ubuntu uses an older release of protoc (3.12.21), not a newer -- apologies for the confusion.

The leak-simple.rb script in the linked repo creates local instances of Proto::Leak::Recursive in a loop with each object linking back to a global instance of Proto::Leak::Recursive via the repeated data field. After an inner loop finishes executing we trigger GC and then print out the RSS of the process along with Ruby's view of how much memory is allocated. Increasing the inner loop iterations will print out larger RSS values.

In simplified form, the loop looks like:

require_relative "gen/protobuf/simple_pb"
datum = Proto::Leak::Recursive.new

100_000.times { Proto::Leak::Recursive.new(data: [datum]) }

The repository version expands on that simple loop to print out the total memory growth (RSS) over the duration of the script execution. Running with VERBOSE=true will also print out periodic memory usage numbers. The repo also has leak-bigtable.rb and leak-bigtable-extracted.rb scripts. These can also be run with bundle exec ruby leak-bigtable.rb and bundle exec ruby leak-bigtable-extracted.rb. All three scripts are functionally similar, but vary on the source of the proto definitions.

The leak-bigtable.rb file uses the google-cloud-bigtable gem to create instances of Google::Cloud::Bigtable::V2::RowFilter::Chain. This script reproduces the memory leak we have that's impacting production workloads.

The leak-bigtable-extracted.rb file uses a subset of the proto definitions from the google-cloud-bigtable gem. The goal here was to use the source proto definitions but strip them down to the bare minimum.

The leak-simple.rb gets at the heart of the problem and reproduces without the large google-cloud-bigtable dependency.

What did you expect to see

I'd expect objects created in a loop that don't escape will not lead to a growth in memory consumption.

What did you see instead?

Memory continues to grow unbounded. However, ObjectSpace does not reflect this growth, suggesting that the memory is growing off-heap in the the google-protobuf extension. After debugging, however, we believe the problem has to do with the fuse operation for Google::Protobuf::Internal::Arena. It looks like there is an arena that is held for a long time and constantly appended to. However, that would be memory on the Ruby heap and ObjectSpace is telling us that the heap isn't growing. We believe that's an artifact of a second bug where the memsize hook (i.e., Arena_memsize or a related function) is not properly reporting the post-fuse size. In particular, dividing the size by the fused count looks like the wrong operation:

memsize /= fused_count;

The total memsize is divided by the total number of arenas fused into it, presumably because the other arenas would report their size, but those arenas are no longer alive and thus have no size to report.

Metadata

Metadata

Assignees

Labels

inactiveDenotes the issue/PR has not seen activity in the last 90 days.ruby

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions