Skip to content

S3 will cache to memory until it gets OOMkilled #2474

@j1ah0ng

Description

@j1ah0ng

Hey! We're using metaflow.S3 in a standalone context where it's being used as the backend to a Pytorch dataset, running within a k8s cluster. From the docs:

Note that in most cases when the data fits in memory, no local disk IO is needed as operations are cached by the operating system, which makes operations fast as long as there is enough memory available.

The memory usage on the associated pod appears to grow without bound, until it hits our cluster's alarm threshold at which point it is OOMkilled. Is there a way to manually set a memory limit on the S3 client, and have it spill over to disk past a certain threshold, or even better, if the cache could be tiered s.t. data is shuffled between disk and memory on a least-recently-used basis?

I noticed the existence of a @resource decorator, but it's not clear to me where this would be placed if the S3 client is used outside of a Metaflow environment (in this case it's a Ray job).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions