`S3` will cache to memory until it gets OOMkilled

Hey! We're using `metaflow.S3` in a standalone context where it's being used as the backend to a Pytorch dataset, running within a k8s cluster. From the [docs](https://docs.metaflow.org/api/S3):

> Note that in most cases when the data fits in memory, no local disk IO is needed as operations are cached by the operating system, which makes operations fast as long as there is enough memory available.

The memory usage on the associated pod appears to grow without bound, until it hits our cluster's alarm threshold at which point it is OOMkilled. Is there a way to manually set a memory limit on the S3 client, and have it spill over to disk past a certain threshold, or even better, if the cache could be tiered s.t. data is shuffled between disk and memory on a least-recently-used basis?

I noticed the existence of a `@resource` decorator, but it's not clear to me where this would be placed if the S3 client is used outside of a Metaflow environment (in this case it's a Ray job). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`S3` will cache to memory until it gets OOMkilled #2474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S3 will cache to memory until it gets OOMkilled #2474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`S3` will cache to memory until it gets OOMkilled #2474