Skip to content

Conversation

perfectra1n
Copy link

@perfectra1n perfectra1n commented Aug 6, 2025

Describe what this PR does
This PR implements Kopia as a mover.

Is there anything that requires special attention?
To give this a spin yourself to see if/how it works, you can review the top of the fork's README. You can also see the deployed Kopia documentation here. I have a handful of commits in this branch so that the fork works as expected, but I've reverted them so that this branch can be used for the explicit purpose of merging.

Related issues:
Closes #320

Kopia

Unlike the other mover tools in VolSync, Kopia operates as a content-addressable backup system with fundamentally different approaches to data storage, deduplication, and repository management.

How Kopia works:

All Backups: [Content Hash ABC123] [Content Hash DEF456] [Content Hash GHI789]
                     ↑                    ↑                    ↑
Backup 1 Index: → ABC123 (File A) → DEF456 (File B) → GHI789 (File C)
Backup 2 Index: → ABC123 (File A) → DEF456 (File B) → JKL012 (File C')  # Only new content stored
Backup 3 Index: → ABC123 (File A) → MNO345 (File B') → JKL012 (File C') # Even more deduplication

Kopia automatically deduplicates identical content across all backups, while Restic/Rsync store incremental changes.

How Concurrent Access Works

Kopia's approach:

Client A ──┐
Client B ──┼─► Kopia Repository ◄─── Safe concurrent writes
Client C ──┘    (Content-addressed storage prevents conflicts)

Kopia's content-addressable design means multiple clients can write simultaneously because identical content gets the same hash - no lock conflicts.

Kopia's multi-tenant design:

// Each backup gets unique identity in shared repository
username: "team-frontend"     // Tenant identifier
hostname: "prod-cluster-db"   // Application identifier

// Repository structure:
repository/
├─ snapshots/
│  ├─ team-frontend@prod-cluster-db/  # Isolated namespace
│  ├─ team-backend@prod-cluster-app/   # Different tenant
│  └─ team-data@prod-cluster-db/       # Different app, same tenant
├─ content/ (shared - deduplication across tenants!)
└─ policies/ (per-tenant policies)

Benefits include shared storage with isolation, cross-tenant deduplication saves space, and per-tenant policies and access control.

Kopia uses pluggable storage drivers that abstract the underlying storage while providing consistent repository semantics:

Application Layer:     [Snapshot Management] [Deduplication] [Compression]
                                      ↕
Storage Abstraction:         [Kopia Storage Interface]
                                      ↕
Storage Drivers:       [S3] [GCS] [Azure] [B2] [SFTP] [WebDAV] [FS] [Rclone]
                        ↕     ↕      ↕       ↕      ↕       ↕      ↕      ↕
Physical Storage:   [AWS] [Google] [Azure] [B2] [SSH] [WebDAV] [NFS] [100+ via Rclone]

Unified Repository Format

# Same repository structure works across ALL backends:
KOPIA_REPOSITORY: s3://bucket/backups        # S3
KOPIA_REPOSITORY: gcs://bucket/backups       # Google Cloud  
KOPIA_REPOSITORY: azure://container/backups  # Azure
KOPIA_REPOSITORY: sftp://server/path         # SFTP
# ... exact same backup format and features regardless of backend!

Native Cloud Integration
Unlike Restic which uses generic S3 API, Kopia has native drivers for S3, GCS, and Azure (as examples)

S3 Intelligent Path Parsing:

# Kopia automatically handles complex nested paths
KOPIA_REPOSITORY: s3://my-bucket/team-a/prod/databases/postgres
# ↓ Kopia parses this as:
# Bucket: my-bucket
# Prefix: team-a/prod/databases/postgres
# Features: Lifecycle policies, versioning, encryption all work

Multi-Backend Credential Flexibility:

# Same secret can contain credentials for multiple backends
apiVersion: v1
kind: Secret
stringData:
  # Primary backend
  KOPIA_REPOSITORY: s3://primary-bucket/backups
  AWS_ACCESS_KEY_ID: primary-key
  
  # Fallback backend (Kopia can migrate between backends)
  KOPIA_FALLBACK_REPOSITORY: gcs://fallback-bucket/backups  
  GOOGLE_APPLICATION_CREDENTIALS: |
    {"type": "service_account", ...}

Configuration Options

The ReplicationSource spec includes Kopia-specific options:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: database-backup
spec:
  sourcePVC: my-database
  kopia:
    repository: kopia-config
    # Performance tuning
    compression: zstd              # zstd, gzip, s2, none
    parallelism: 4                 # Parallel upload streams
    
    # Retention policy
    retain:
      hourly: 24                   # Keep 24 hourly backups
      daily: 7                     # Keep 7 daily backups  
      weekly: 4                    # Keep 4 weekly backups
      monthly: 12                  # Keep 12 monthly backups
      yearly: 5                    # Keep 5 yearly backups
    
    # Cache configuration (flexible EmptyDir fallback)
    cacheCapacity: 8Gi             # Cache size
    cacheStorageClassName: fast-ssd # Optional PVC cache
    cacheAccessModes: [ReadWriteOnce]
    
    # Multi-tenancy
    username: team-frontend        # Custom username
    hostname: prod-cluster-db      # Custom hostname
    
    # Application consistency  
    actions:
      beforeSnapshot: "pg_start_backup('volsync')"
      afterSnapshot: "pg_stop_backup()"
    
    # Path preservation
    sourcePathOverride: /var/lib/postgresql/data
    
    # Maintenance
    maintenanceIntervalDays: 7     # Run maintenance weekly

Smart Cache Strategy Selection:

Instead of requiring users to understand cache implications, the mover automatically chooses the best strategy:

# Strategy 1: No cache config → EmptyDir fallback (good for testing)
spec:
  kopia: {}
# Result: EmptyDir with 8Gi limit, reasonable performance

# Strategy 2: Size-only config → EmptyDir with limit (good for constrained resources)  
spec:
  kopia:
    cacheCapacity: 4Gi
# Result: EmptyDir with 4Gi limit, predictable resource usage

# Strategy 3: Full config → Dedicated PVC (best for production)
spec:
  kopia:
    cacheCapacity: 20Gi
    cacheStorageClassName: fast-nvme
# Result: Persistent PVC, maximum performance

The mover implements lifecycle-based metrics collection throughout the backup process:

func (m *Mover) Synchronize(ctx context.Context) (mover.Result, error) {
    // 1. Record operation start and set connectivity
    operationStart := time.Now()
    m.metrics.RepositoryConnectivity.With(labels).Set(1)
    
    // 2. Track cache configuration decisions
    m.recordCacheMetrics()  // Records cache type (PVC vs EmptyDir) and size
    
    // 3. Monitor job execution and retries
    if job.Status.Failed > 0 {
        m.recordJobRetry(operation, "job_pod_failure")
    }
    
    // 4. Record final outcome with duration
    if result.Completed {
        duration := time.Since(operationStart)
        m.recordOperationSuccess(operation, duration)
        // Also record maintenance operations when applicable
        if m.shouldRunMaintenance() {
            m.recordMaintenanceOperation()
        }
    } else {
        m.recordOperationFailure(operation, "job_execution_failed")
        m.metrics.RepositoryConnectivity.With(labels).Set(0)
    }
}

How metrics are structured with labels:

func (m *Mover) getMetricLabels(operation string) prometheus.Labels {
    return prometheus.Labels{
        "obj_name":      m.owner.GetName(),        // ReplicationSource name
        "obj_namespace": m.owner.GetNamespace(),   // Namespace
        "role":          "source",                  // source vs destination
        "operation":     operation,                 // backup, restore, maintenance
        "repository":    m.repositoryName,         // Repository secret name
    }
}

Available metrics categories:

  • Performance: backup_duration_seconds, compression_ratio, data_transfer_rate
  • Reliability: backup_success_total, backup_failure_total, job_retries_total
  • Configuration: cache_type (pvc/emptydir), cache_size_bytes, policy_compliance
  • Repository Health: repository_connectivity, maintenance_operations_total

Policy Configuration

Support for structured repository configuration and policy files:

kopia:
  # Option 1: Structured JSON config (validated)
  policyConfig:
    repositoryConfig: |
      {
        "splitters": {"default": "DYNAMIC-4M-BUZHASH"},
        "compression": {"compressor": "zstd", "level": 3},
        "encryption": {"algorithm": "CHACHA20-POLY1305"},
        "caching": {"maxCacheSizeBytes": 1073741824}
      }
  
  # Option 2: File-based policies from ConfigMap/Secret  
  policyConfig:
    configMapName: kopia-policies
    globalPolicyFilename: global-policy.json
    repositoryConfigFilename: repo.config

Core Implementation:

  • internal/controller/mover/kopia/mover.go - Main mover implementation
  • internal/controller/mover/kopia/metrics.go - Prometheus metrics
  • internal/controller/mover/kopia/builder.go - Builder pattern for mover creation
  • mover-kopia/entry.sh - Container entry point with debug support
  • mover-kopia/Dockerfile - Kopia container with all backends

CRD Extensions:

  • api/v1alpha1/replicationsource_types.go - ReplicationSourceKopiaSpec
  • api/v1alpha1/replicationdestination_types.go - ReplicationDestinationKopiaSpec
  • api/v1alpha1/common_types.go - KopiaPolicySpec

Documentation:

  • docs/usage/kopia/index.rst - User guide
  • docs/usage/kopia/database_example.rst - Database backup examples
  • Example PrometheusRule for monitoring

Flexible Job Configuration:

func (m *Mover) ensureJob(ctx context.Context, ...) (*batchv1.Job, error) {
    // Supports EmptyDir OR PVC cache seamlessly
    // Mounts credentials for multiple backends  
    // Configures custom CA certificates
    // Sets up policy configuration files
    // Handles privileged vs unprivileged modes
}

Multi-Backend Environment Variables:

func (m *Mover) buildEnvironmentVariables(repo *corev1.Secret) []corev1.EnvVar {
    envVars := m.buildBasicEnvironmentVariables()
    envVars = append(envVars, m.buildRepositoryEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildAWSEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildAzureEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildGoogleEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildB2EnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildWebDAVEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildSFTPEnvironmentVariables(repo)...)
    envVars = append(envVars, m.buildRcloneEnvironmentVariables(repo)...)
    // ... additional configuration
}

Database Backup:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: postgres-backup
spec:
  sourcePVC: postgres-data
  trigger:
    schedule: "0 2 * * *"  # Daily at 2 AM
  kopia:
    repository: postgres-backup-config
    retain:
      daily: 30
      weekly: 12
    actions:
      beforeSnapshot: "pg_start_backup('volsync')"
      afterSnapshot: "pg_stop_backup()"

Multi-Cloud with Policies:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: app-backup-prod
spec:
  sourcePVC: app-data
  trigger:
    schedule: "0 */6 * * *"  # Every 6 hours
  kopia:
    repository: s3-backup-config
    compression: zstd
    parallelism: 8
    cacheCapacity: 32Gi
    cacheStorageClassName: fast-nvme
    username: production-app
    hostname: prod-cluster-east
    retain:
      hourly: 48
      daily: 30
      weekly: 26
      monthly: 24
      yearly: 7
    policyConfig:
      repositoryConfig: |
        {
          "compression": {"compressor": "zstd", "level": 9},
          "encryption": {"algorithm": "CHACHA20-POLY1305"},
          "splitters": {"default": "DYNAMIC-4M-BUZHASH"}
        }

Restore Example:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationDestination
metadata:
  name: restore-from-backup
spec:
  kopia:
    repository: s3-backup-config
    restoreAsOf: "2024-01-15T10:30:00Z"  # Point-in-time restore
    shallow: 1                           # Only restore recent data
  volumeSnapshotClassName: csi-snapclass
  capacity: 100Gi

Copy link
Contributor

openshift-ci bot commented Aug 6, 2025

Hi @perfectra1n. Thanks for your PR.

I'm waiting for a backube member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the size/XXL label Aug 6, 2025
@perfectra1n
Copy link
Author

That's a very scary test file!

@tesshuflower
Copy link
Contributor

@perfectra1n hey thanks for this PR! Obviously a lot of work went into this - please be patient as it may take some time before we can get around to reviewing....

One thing just from glancing over the description: The fix for "controller sync fix" - I think I'd prefer if there was an issue created for this and it was handled separately - I'm not sure I understand the need for a finalizer? However it does sound like we most likely do need a check for the deletion timestamp.... An issue would help to put more info around what is getting stuck (for example is there a specific resource that is always getting stuck?) and putting in a targeted fix.

@perfectra1n

This comment was marked as outdated.

@onedr0p
Copy link
Contributor

onedr0p commented Aug 13, 2025

I was able to kick the tires on this and can confirm backup and restores are working the same as I use restic. I tested copyMethod: Snapshot for the replicationsource and replicationdestination with the following software:

Kubernetes: v1.33.3
Talos Linux: v1.11.0
Rook-ceph: v1.17.7
Snapshot controller: v8.3.0

I noticed kopia is much faster than restic and we don't run into the repository locking issues. There's a lot of features to cover though, so maybe we need more bodies on it for testing.

@tesshuflower @JohnStrunk I understand this is a lengthy PR and the support burden it can add but I really hope you can get time to review and test as well. This would be a very nice addition to the project and I am sure Redhat/OpenShift customers currently using restic would be happy to have Kopia as an option.

Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: perfectra1n
Once this PR has been reviewed and has the lgtm label, please assign falconizmi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tesshuflower
Copy link
Contributor

Sure! I can remove that from this PR. It was happening to my resources during development and in my homelab - see below: chrome_GROY6oQzw2 chrome_GROY6oQzw2

I added a Finalizer as that's the pattern that other Kubernetes resources follow for lifecycle management, and it allows the ReplicationDestination or ReplicationSource to enter a state where it doesn't continuously reconcile resources.

I just couldn't take it anymore since it was extremely annoying to deal with during development (and I've run into it in the past when using Volsync throughout the past few years), so I just lumped in a fix 😂

I'll remove it from this PR though and just keep the fix in my fork for now, so that people that are testing out this Kopia implementation don't have to deal with infinite sync loops if they delete a ReplicationSource/ReplicationDestination before it completes lol.

@perfectra1n can you create a separate issue with details? I think this is something we'd like to fix, but would like more information on exactly what resources are holding things up, etc. Are there pvcs getting stuck in terminating for a long time or something of the sort?

@perfectra1n
Copy link
Author

Sure! I can put it in another PR, I don't want to derail the Kopia discussion though :)

@tesshuflower
Copy link
Contributor

Sure! I can put it in another PR, I don't want to derail the Kopia discussion though :)

I'd like an issue/bug report that explains it first please.

@perfectra1n
Copy link
Author

Sorry, I meant to say that I'll open an Issue / bug report for it (not a PR) :)

@perfectra1n
Copy link
Author

I'll rebase after this has been reviewed :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes to .github/workflows intended to be part of this PR? Updates to do extra things like publish documentation to gihub pages should instead be an enhancement request and not part of the kopia implementation.

Additionally, it looks like the other github workflows have been renamed to *.disabled, so I'm assuming maybe these changes were just accidentally added to the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to warn you, we are not yet ready to move to golang 1.24 - there's actually an issue with building on arm64 right now with golang 1.24. I would also like to move to golang 1.24 separately from the mover PR. So this would either have to wait, or you can start with a build that works with 1.23.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related golang issue: golang/go#75074

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest Kopia releases only build with Golang 1.24, so unless we get the release binary instead of building from source, I'm not quite sure...

Copy link
Author

@perfectra1n perfectra1n Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did fix that in my Dockerfile though, so I'm able to build on ARM64 via this change, but I'm not sure if that aligns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks for that - I may try using that myself as the golang issue seems to still be unresolved for the moment. At least that would unblock us for moving to 1.24.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? These permissions are required and should not be removed from the role

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that was unintentional. That was from me removing the "sync controller" fix earlier, I'll revert.

@tesshuflower
Copy link
Contributor

Note that the DCO check is failing because the commits are not signed off - please see the instructions here on how to resolve: https://github.com/backube/volsync/pull/1723/checks?check_run_id=48522602445

Signed-off-by: Mend Renovate <bot@renovateapp.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: Mend Renovate <bot@renovateapp.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
…t recent" first

Signed-off-by: perf3ct <jonfuller2012@gmail.com>
…ot selection is correct

Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
@perfectra1n
Copy link
Author

Alright, I'm done messing around with the rebase - I believe I've addressed the concerns you highlighted @tesshuflower. I can also remove the Dockerfile change for Golang 1.24 if you'd prefer to just keep it at Golang 1.23.

Signed-off-by: perf3ct <jonfuller2012@gmail.com>
… overflow

The Kopia mover logs were filling up the 1GB cache PVC due to high default
retention settings and verbose debug logging. This commit adds environment
variable controls for log configuration with sensible defaults:

- Set default file log level to 'warn' instead of 'debug'
- Limit log retention to 10 files and 24 hours by default
- Add environment variables for users to override these settings:
  - KOPIA_FILE_LOG_LEVEL (default: warn)
  - KOPIA_LOG_DIR_MAX_FILES (default: 10)
  - KOPIA_LOG_DIR_MAX_AGE (default: 24h)
  - KOPIA_CONTENT_LOG_DIR_MAX_FILES (default: 10)
  - KOPIA_CONTENT_LOG_DIR_MAX_AGE (default: 24h)

These settings prevent the cache PVC from filling up while still maintaining
useful logs for debugging when needed. Users can adjust these values through
their Kopia repository secret if different retention is required.

Signed-off-by: perf3ct <jonfuller2012@gmail.com>
…ocs for logging settings

Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Signed-off-by: perf3ct <jonfuller2012@gmail.com>
Copy link

sonarqubecloud bot commented Sep 3, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe you used testing.T rather than ginkgo to avoid the suite setup/teardown, but I think we'd prefer to have all tests using ginkgo to be consistent. Almost anytime these are run will be for the entire suite or at least the package, so I'd expect the suite setup/teardown to be invoked anyway.

We run the "make test" target to run tests in our CI which invokes ginkgo, and I'm not sure it'll even pickup these tests by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support a kopia-based backup
5 participants