Skip to content

Conversation

davidkopp
Copy link
Collaborator

Two of my colleagues have already encountered the issue of unintentionally using an ARM image as part of a usage scenario for a GMT measurement. Needless to say, the measurement run failed. However, the error message was confusing:

Health check of container "kadai-postgres" failed terminally with status "unhealthy" after 0s. Health check errors: {"Status":"unhealthy","FailingStreak":0,"Log":[]}

I think it would make sense to add a compatibility check, so a proper error message can be provided.

With this PR the run would fail early directly after the image pull and the error message would be

Architecture mismatch for image 'kadai-postgres': Image is built for arm64 but host platform is amd64.

davidkopp and others added 3 commits August 13, 2025 12:48
- Add _check_image_architecture_compatibility method to detect architecture mismatches early
- Validate image architecture compatibility immediately after Docker image pull
- Provide clear error messages for architecture mismatches instead of misleading health check failures
- Add comprehensive test coverage for compatible, incompatible, and nonexistent image scenarios
- Normalize architecture names (x86_64 → amd64, aarch64 → arm64) for proper comparison

This prevents the confusing "Health check failed" errors when attempting to run ARM64 images on x86 hosts,
providing immediate feedback about architecture incompatibility.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Explain why Docker vs host system architecture name normalization is needed
- Document the specific naming differences (amd64 vs x86_64, etc.)
- Justify why these 3 mappings are sufficient for modern container deployments
- Clarify the use cases for each architecture (servers, Apple Silicon, embedded devices)

This makes the code more maintainable by explaining the architectural decisions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ArneTR
Copy link
Member

ArneTR commented Aug 13, 2025

This is definitely something where the error message should be better.

I wonder though if GMT should have such a check routine and this information can always be provided by the docker cli and / or is always relevant.
What for instance happens if there is just a binary in the container?

To better understand: Can you tell which image was used so I can replay the error on a bare metal host and see why the Docker CLI even continues until the healthcheck without a more prominent error

@davidkopp
Copy link
Collaborator Author

In the meantime the image was converted from an ARM-only image to a multi-arch image. Here the commands to get the image with the ARM architecture and check its architecture:

docker pull --platform linux/arm64 registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-da
tabases:postgres-14.7-100k-tasks
docker image inspect -f "{{.Architecture}}" registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks

@davidkopp
Copy link
Collaborator Author

Test is failing in the CI pipeline:

FAILED test_runner.py::test_architecture_compatibility_check_compatible - AssertionError: Expected: Architecture should be compatible, Actual: img_arch: unknown, host_arch: unknown

Before I invest time into that, let me know if GMT should include the check at all.

@ribalba
Copy link
Member

ribalba commented Aug 16, 2025

I think it should. This is something I see happening more and more.

@ArneTR
Copy link
Member

ArneTR commented Aug 19, 2025

I also agree, such a check should exist.

The docker error message is very opaque and does not really indicate what the underlying problem is.

However I believe a check after downloading the image is needed, but not the most efficient.

In case an image will be downloaded it should be checked if the image even exists for the target architecture beforehand.

docker manifest inspect can do that. The JSON output contains all possible architectures. Assumption is then that the docker client will fetch the correct architecture automatically

Still a local check must be done afterwards as a built image might still contain a broken architecture ... very unlikely but to my understanding possible if it is based on a broken architecture already and only minor layers are added.

@davidkopp If you agree please update the PR with the manifest check also. Rest of the PR LGTM

@davidkopp
Copy link
Collaborator Author

Executing docker manifest inspect takes a lot of time on my system. The time seems to increase proportionally with the number of manifests of the image.

Examples

1 manifest:

$ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-example-spring-boot:kadai-10.1.0
real    0m1.960s
user    0m0.044s
sys     0m0.035s

2 manifests:

$ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks
real    0m4.058s
user    0m0.055s
sys     0m0.081s

16 manifests:

$ time docker manifest inspect postgres:15
real    0m19.516s
user    0m0.249s
sys     0m0.174s

Question

Is this normal? If so, I would vote against the use of docker manifest inspect.

Also, docker manifest inspect is an experimental feature.

Mock Docker inspect and platform detection to prevent test failures in CI environments where:
- Docker images may not be available due to network restrictions
- platform.machine() may return unexpected values in virtualized environments

The test now uses consistent mocking to simulate compatible amd64 architecture matching.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

github-actions bot commented Aug 20, 2025

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17092419355]:

🌳 CO2 Data:
City: Chicago, Lat: 41.8835, Lon: -87.6305
IP: 135.232.208.132
CO₂ from energy is: 1.851494400 g
CO₂ from manufacturing (embodied carbon) is: 0.293228413 g
Carbon Intensity for this location: 448 gCO₂eq/kWh
SCI: 2.144723 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Label🖥 avg. CPU utilization [%]🔋 Total Energy [Joules]🔌 avg. Power [Watts]Duration [Seconds]
Measurement #125.43944132.84.021027.74
Total Run25.444132.804.021027.74
Additional overhead from Eco CIN/A12.324.083.02

@davidkopp
Copy link
Collaborator Author

The outcome of a meeting with Arne was, that docker manifest inspect should not be used at the moment because of its slowness and because of rate limits of DockerHub.

Regarding the failed test: I added mocks to the test to ensure it works on all platforms. So there is no test anymore, that reads the actual CPU architecture from the host.

I think the PR is ready for merge.

@ArneTR
Copy link
Member

ArneTR commented Aug 20, 2025

The mocks are atm confusing to me.

You are mocking so much that you are effectively comparing constant strings with each other. I think this is a bit too shallow for a test.

If I misunderstood this please explain a bit what still is dynamic and what is static after the mocking.

IMHO I would rather leave the test as it was before in the test suite and disable it for the GitHub Actions if it is really not possible to run like this.

But I would also really like to understand:

  • Why is it not possible to resolve an image architecture in GitHub actions? The info is in the image itself. Why can this not be read?
  • Why can the architecture inside a VM not be resolved? The information should be present ...?

@ArneTR
Copy link
Member

ArneTR commented Aug 25, 2025

We are currently blocked to investigate this further inside the GitHub VM as our extension to analyse is broken.

We have opened an issue here: lhotari/action-upterm#33

This PR is now waiting for response. If no response comes in we will continue to investigate through different means

davidkopp and others added 3 commits August 26, 2025 09:21
- Add helper method get_compatible_test_image() for platform-agnostic testing
- Update compatible test to work on amd64, arm64, and arm host architectures
- Add docker pull to ensure image exists before testing
- Improve nonexistent image test with guaranteed unique names and better error validation
- Fix linting issues (trailing whitespace, import placement)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

github-actions bot commented Aug 26, 2025

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17231021760]:

🌳 CO2 Data:
City: , Lat: , Lon:
IP:
CO₂ from energy is: 1.913772960 g
CO₂ from manufacturing (embodied carbon) is: 0.344559220 g
Carbon Intensity for this location: 396 gCO₂eq/kWh
SCI: 2.258332 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Label🖥 avg. CPU utilization [%]🔋 Total Energy [Joules]🔌 avg. Power [Watts]Duration [Seconds]
Measurement #124.65094832.764.001207.65
Total Run24.654832.764.001207.65
Additional overhead from Eco CIN/A13.203.943.35

@davidkopp
Copy link
Collaborator Author

Test works now on GitHub runner! It was just an error on my side. One issue was, that the test test_architecture_compatibility_check_compatible did not include a docker pull before the docker inspect. The test can now be executed also on systems with a different architecture than amd64.

I think the PR can now be merged.

@ArneTR
Copy link
Member

ArneTR commented Aug 27, 2025

Thanks for the overhaul. Here my review:

  • I think the test for the compatible image has a high cognitive load and potential for confusion in the future. Personally I would rather like multiple tests (one for each architecture) that are rather skipped if the tests do not run on the matching architecture. However having one dynamic tests which behaves differently under the hood is quite confusing. You can end up with a test with name A failing on machine 1 vs. machine 2. It would be nice to have test A_Arm, A_X86 ... etc. and then directly be able to spot that this test does not work on machine 1 but was skipped on machine 2. What do you think?
  • Also I would like a test that actually pulls the wrong image by force and then see if the check works. atm it is only tested with a non-existent image. Please add this test. This image looks like a good candidate: https://hub.docker.com/_/hello-world/tags - It comes in multiple variants and even has windows/arm64 ... which I have never seen! Would be interesting to understand how GMT handles this kind of image ...
    • To force a download you must address the image with the hash. Otherwise Docker will probably auto-select the architecture

@ArneTR
Copy link
Member

ArneTR commented Aug 27, 2025

Correction: Docker Hub does not work they way I expected. The images are not hash addressable like this. I will provide sample images under our namespace

@ArneTR
Copy link
Member

ArneTR commented Aug 27, 2025

Please use these images:

They have only one architecture. I hope this can reproduce the problem we where seeing with the image from the spring application you had that was only available in the wrong architecture. If not please update which Image architecture / setup I shall upload

davidkopp and others added 2 commits August 27, 2025 22:29
…y tests

- Fix syntax error in scenario_runner.py Docker pull exception chaining
- Add architecture mismatch detection during pull phase with clear error messages
- Remove redundant post-pull architecture compatibility checks
- Replace dynamic architecture tests with explicit platform-specific tests
- Simplify nonexistent image tests by removing interactive mode and dynamic naming
- Create separate scenario files for each test case with descriptive names

This improves error reporting for Docker pull failures and reduces test complexity
while maintaining comprehensive coverage of architecture compatibility scenarios.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
[skip-ci]
@davidkopp
Copy link
Collaborator Author

Thanks for your review and the provided images!

During implementation, Claude proposed moving the architecture check into the error handling of the pull logic instead of doing it afterwards. This resulted in a complete redesign.

Copy link

github-actions bot commented Aug 27, 2025

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17278003752]:

🌳 CO2 Data:
City: Boydton, Lat: 36.6694, Lon: -78.3877
IP: 48.211.213.33
CO₂ from energy is: 1.321829280 g
CO₂ from manufacturing (embodied carbon) is: 0.269130809 g
Carbon Intensity for this location: 348 gCO₂eq/kWh
SCI: 1.590960 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Label🖥 avg. CPU utilization [%]🔋 Total Energy [Joules]🔌 avg. Power [Watts]Duration [Seconds]
Measurement #125.54833798.364.03943.28
Total Run25.553798.364.03943.28
Additional overhead from Eco CIN/A10.994.092.69

Copy link

github-actions bot commented Aug 27, 2025

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17278081633]:

🌳 CO2 Data:
City: Des Moines, Lat: 41.6015, Lon: -93.6127
IP: 52.176.138.178
CO₂ from energy is: 2.220557100 g
CO₂ from manufacturing (embodied carbon) is: 0.335780114 g
Carbon Intensity for this location: 498 gCO₂eq/kWh
SCI: 2.556337 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Label🖥 avg. CPU utilization [%]🔋 Total Energy [Joules]🔌 avg. Power [Watts]Duration [Seconds]
Measurement #122.48214458.953.791176.88
Total Run22.484458.953.791176.88
Additional overhead from Eco CIN/A13.544.333.13

@ArneTR
Copy link
Member

ArneTR commented Sep 1, 2025

The code looks cleaner and also the new images have been incorporated.

However the error condition is the presence of 'no matching manifest'.

I remember we had quite a different error that was way more cryptic at the time. Why is that not emerging anymore? The fail was at the same location, was it not?

I remember this was the origin: #1289

Suprisingly the docker pull did not fail there but pulled the image correctly. So I feel our test cases are not capturing this error, or?

Docker allows pulling incompatible architecture images when using specific
digest references (e.g., alpine@sha256:...), but these fail at runtime with
"exec format error". This change adds proactive architecture validation
after successful Docker pulls to catch incompatibilities early.

Key changes:
- Add _validate_image_architecture() method using fast 'docker image inspect'
- Validate architecture compatibility immediately after successful pulls
- Extract architecture mapping logic into reusable map_host_to_docker_arch()
- Fail fast with clear error messages instead of runtime failures
- Clean up error handling to remove graceful degradation

The fix ensures that ARM64 digest images pulled on AMD64 hosts (and vice versa)
are detected immediately with clear error messages, rather than failing later
during container execution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@davidkopp
Copy link
Collaborator Author

I finally understand the issue. There are two different cases that we have to consider:

  1. Pulling an image with an incompatible architecture: this fails during the Docker pull.
  2. Pulling a multi-arch image with a digest referencing an incompatible architecture: this succeeds during the Docker pull (the image has at least one manifest with a compatible architecture), but the container will fail at start.

So, I think it makes sense to have two checks:

  1. Check whether an error that occurred during the Docker pull is architecture-related, and if so, provide a proper error message.
  2. Check after a successful pull whether the architecture is compatible with the host architecture.

Copy link

github-actions bot commented Sep 1, 2025

Eco CI Output [RUN-ID: 17386138735]:

🌳 CO2 Data:
City: Washington, Lat: 38.7095, Lon: -78.1539
IP: 20.55.15.224
CO₂ from energy is: 1.336493410 g
CO₂ from manufacturing (embodied carbon) is: 0.288135562 g
Carbon Intensity for this location: 329 gCO₂eq/kWh
SCI: 1.624629 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Label🖥 avg. CPU utilization [%]🔋 Total Energy [Joules]🔌 avg. Power [Watts]Duration [Seconds]
Measurement #125.26214062.294.021009.89
Total Run25.264062.294.021009.89
Additional overhead from Eco CIN/A10.743.992.69

@ArneTR
Copy link
Member

ArneTR commented Sep 2, 2025

great. How can I support? Do you need me to upload an image to docker hub that saties the criteria from your bullet number 2? If so, can you send me a Dockerfile and build string?

@davidkopp
Copy link
Collaborator Author

I have already implemented the necessary changes in commit 0c412f7

For a multi-arch docker image with digests pointing to specific architectures I have used Alpine in the new test cases:

  • multi-arch image without digest: alpine:3.22.1
  • multi-arch image with amd64 digest: alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
  • multi-arch image with arm64 digest: alpine@sha256:4562b419adf48c5f3c763995d6014c123b3ce1d2e0ef2613b189779caa787192

So this PR is ready for a hopefully final review.

@davidkopp davidkopp requested a review from ArneTR September 2, 2025 06:54
@ArneTR
Copy link
Member

ArneTR commented Sep 3, 2025

Nice to see the images that can be used. I tried using the index-hash which apparently auto-selects the correct architecture.

However I am still unsure if that would really lead to the error we saw back then.

When I run

$ docker run --rm -it -u 0 --entrypoint timeout alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Unable to find image 'alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f' locally
docker.io/library/alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f: Pulling from library/alpine
9824c27679d3: Pull complete
Digest: sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Status: Downloaded newer image for alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
exec /usr/bin/timeout: exec format error

I get a proper response from the daemon, that the image cannot be run.

However on my local box, which supports emulation via Docker Desktop, the image runs with only a minor warning.

So I feel two points are still unclear to me:

  • Did we catch the error from back then? if so: With what GMT run string can I reproduce it on an old checkout of GMT?
  • Should we fail GMT runs if technically the image can be run, but is only emulated? This PR enforces that. But it might be helpful in GMT to benchmark exactly that ...

@davidkopp
Copy link
Collaborator Author

davidkopp commented Sep 3, 2025

Did we catch the error from back then?

Answer: yes

You can use the following command to reproduce the error when you are using the latest version of GMT on main (commit 949f445).

On an amd64 host:

python3 runner.py --name "KADAI (using arm64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-arm64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe

On an arm64 host:

python3 runner.py --name "KADAI (using amd64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-amd64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe

Error:

Exception_context (NoneType): None
Final_exception (RuntimeError): Health check of container "kadai-postgres" failed terminally with status "unhealthy" after 0s. Health check errors: {"Status":"unhealthy","FailingStreak":0,"Log":[]}

Using the current implementation in this branch the error message is improved:

Final_exception (RuntimeError): Architecture incompatibility detected: Docker image 'registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases@sha256:ad5dfc2b4075233e385e92adad9fbeb8e3f874da4e998eb5eca9c9db4556729a' is not available for host architecture 'amd64'. Image architecture is 'arm64'

Should we fail GMT runs if technically the image can be run, but is only emulated?

No. Good point! I haven't thought about that and only tested the implementation with Docker native in WSL2 and not with Docker Desktop that would use the emulation of ARM.

I have implemented the necessary changes to allow emulation in the following PR:
#1313

Feel free to merge it into this one, if it makes sense to you. The check after the image pull was completely removed. Instead a delay of 1 second was introduced after the docker run to be able to check if the detached container fails immediately after start (e.g. due to an invalid architecture).
If emulation is used, the container will start without issues so there won't be any interruptions by an architecture check. In the tests I added a skip rule if they are run with Docker Desktop.

With the GMT run command from above it will now succeed if you use Docker Desktop (using emulation), but still fail if you use native Docker.

Error message if running an incompatible image architecture on native Docker:

Final_exception (RuntimeError): Container 'kadai-postgres' failed immediately after start, probably due to architecture incompatibility (exit code: 255). Image architecture is 'arm64' but host architecture is 'amd64'.

Copy link
Member

@ArneTR ArneTR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This base PR generally looks good and can be merged once we merge the follow-up improvement on top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants