Add check if container image architecture is compatible #1289

davidkopp · 2025-08-13T11:07:49Z

Two of my colleagues have already encountered the issue of unintentionally using an ARM image as part of a usage scenario for a GMT measurement. Needless to say, the measurement run failed. However, the error message was confusing:

Health check of container "kadai-postgres" failed terminally with status "unhealthy" after 0s. Health check errors: {"Status":"unhealthy","FailingStreak":0,"Log":[]}

I think it would make sense to add a compatibility check, so a proper error message can be provided.

With this PR the run would fail early directly after the image pull and the error message would be

Architecture mismatch for image 'kadai-postgres': Image is built for arm64 but host platform is amd64.

- Add _check_image_architecture_compatibility method to detect architecture mismatches early - Validate image architecture compatibility immediately after Docker image pull - Provide clear error messages for architecture mismatches instead of misleading health check failures - Add comprehensive test coverage for compatible, incompatible, and nonexistent image scenarios - Normalize architecture names (x86_64 → amd64, aarch64 → arm64) for proper comparison This prevents the confusing "Health check failed" errors when attempting to run ARM64 images on x86 hosts, providing immediate feedback about architecture incompatibility. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Explain why Docker vs host system architecture name normalization is needed - Document the specific naming differences (amd64 vs x86_64, etc.) - Justify why these 3 mappings are sufficient for modern container deployments - Clarify the use cases for each architecture (servers, Apple Silicon, embedded devices) This makes the code more maintainable by explaining the architectural decisions. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

ArneTR · 2025-08-13T11:30:11Z

This is definitely something where the error message should be better.

I wonder though if GMT should have such a check routine and this information can always be provided by the docker cli and / or is always relevant.
What for instance happens if there is just a binary in the container?

To better understand: Can you tell which image was used so I can replay the error on a bare metal host and see why the Docker CLI even continues until the healthcheck without a more prominent error

davidkopp · 2025-08-13T11:38:16Z

In the meantime the image was converted from an ARM-only image to a multi-arch image. Here the commands to get the image with the ARM architecture and check its architecture:

docker pull --platform linux/arm64 registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-da
tabases:postgres-14.7-100k-tasks
docker image inspect -f "{{.Architecture}}" registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks

davidkopp · 2025-08-13T12:33:05Z

Test is failing in the CI pipeline:

FAILED test_runner.py::test_architecture_compatibility_check_compatible - AssertionError: Expected: Architecture should be compatible, Actual: img_arch: unknown, host_arch: unknown

Before I invest time into that, let me know if GMT should include the check at all.

ribalba · 2025-08-16T11:21:42Z

I think it should. This is something I see happening more and more.

ArneTR · 2025-08-19T08:12:25Z

I also agree, such a check should exist.

The docker error message is very opaque and does not really indicate what the underlying problem is.

However I believe a check after downloading the image is needed, but not the most efficient.

In case an image will be downloaded it should be checked if the image even exists for the target architecture beforehand.

docker manifest inspect can do that. The JSON output contains all possible architectures. Assumption is then that the docker client will fetch the correct architecture automatically

Still a local check must be done afterwards as a built image might still contain a broken architecture ... very unlikely but to my understanding possible if it is based on a broken architecture already and only minor layers are added.

@davidkopp If you agree please update the PR with the manifest check also. Rest of the PR LGTM

davidkopp · 2025-08-19T10:32:35Z

Executing docker manifest inspect takes a lot of time on my system. The time seems to increase proportionally with the number of manifests of the image.

Examples

1 manifest:

$ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-example-spring-boot:kadai-10.1.0
real    0m1.960s
user    0m0.044s
sys     0m0.035s

2 manifests:

$ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks
real    0m4.058s
user    0m0.055s
sys     0m0.081s

16 manifests:

$ time docker manifest inspect postgres:15
real    0m19.516s
user    0m0.249s
sys     0m0.174s

Question

Is this normal? If so, I would vote against the use of docker manifest inspect.

Also, docker manifest inspect is an experimental feature.

Mock Docker inspect and platform detection to prevent test failures in CI environments where: - Docker images may not be available due to network restrictions - platform.machine() may return unexpected values in virtualized environments The test now uses consistent mocking to simulate compatible amd64 architecture matching. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-08-20T08:23:05Z

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17092419355]:

🌳 CO2 Data:
City: Chicago, Lat: 41.8835, Lon: -87.6305
IP: 135.232.208.132
CO₂ from energy is: 1.851494400 g
CO₂ from manufacturing (embodied carbon) is: 0.293228413 g
Carbon Intensity for this location: 448 gCO₂eq/kWh
SCI: 2.144723 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]

Measurement #1	25.4394	4132.8	4.02	1027.74

Total Run	25.44	4132.80	4.02	1027.74

Additional overhead from Eco CI	N/A	12.32	4.08	3.02

davidkopp · 2025-08-20T09:25:17Z

The outcome of a meeting with Arne was, that docker manifest inspect should not be used at the moment because of its slowness and because of rate limits of DockerHub.

Regarding the failed test: I added mocks to the test to ensure it works on all platforms. So there is no test anymore, that reads the actual CPU architecture from the host.

I think the PR is ready for merge.

ArneTR · 2025-08-20T12:27:34Z

The mocks are atm confusing to me.

You are mocking so much that you are effectively comparing constant strings with each other. I think this is a bit too shallow for a test.

If I misunderstood this please explain a bit what still is dynamic and what is static after the mocking.

IMHO I would rather leave the test as it was before in the test suite and disable it for the GitHub Actions if it is really not possible to run like this.

But I would also really like to understand:

Why is it not possible to resolve an image architecture in GitHub actions? The info is in the image itself. Why can this not be read?
Why can the architecture inside a VM not be resolved? The information should be present ...?

ArneTR · 2025-08-25T08:12:54Z

We are currently blocked to investigate this further inside the GitHub VM as our extension to analyse is broken.

We have opened an issue here: lhotari/action-upterm#33

This PR is now waiting for response. If no response comes in we will continue to investigate through different means

This reverts commit ab35780.

- Add helper method get_compatible_test_image() for platform-agnostic testing - Update compatible test to work on amd64, arm64, and arm host architectures - Add docker pull to ensure image exists before testing - Improve nonexistent image test with guaranteed unique names and better error validation - Fix linting issues (trailing whitespace, import placement) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-08-26T07:42:31Z

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17231021760]:

🌳 CO2 Data:
City: , Lat: , Lon:
IP:
CO₂ from energy is: 1.913772960 g
CO₂ from manufacturing (embodied carbon) is: 0.344559220 g
Carbon Intensity for this location: 396 gCO₂eq/kWh
SCI: 2.258332 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]

Measurement #1	24.6509	4832.76	4.00	1207.65

Total Run	24.65	4832.76	4.00	1207.65

Additional overhead from Eco CI	N/A	13.20	3.94	3.35

davidkopp · 2025-08-26T08:36:49Z

Test works now on GitHub runner! It was just an error on my side. One issue was, that the test test_architecture_compatibility_check_compatible did not include a docker pull before the docker inspect. The test can now be executed also on systems with a different architecture than amd64.

I think the PR can now be merged.

ArneTR · 2025-08-27T07:06:32Z

Thanks for the overhaul. Here my review:

I think the test for the compatible image has a high cognitive load and potential for confusion in the future. Personally I would rather like multiple tests (one for each architecture) that are rather skipped if the tests do not run on the matching architecture. However having one dynamic tests which behaves differently under the hood is quite confusing. You can end up with a test with name A failing on machine 1 vs. machine 2. It would be nice to have test A_Arm, A_X86 ... etc. and then directly be able to spot that this test does not work on machine 1 but was skipped on machine 2. What do you think?
Also I would like a test that actually pulls the wrong image by force and then see if the check works. atm it is only tested with a non-existent image. Please add this test. This image looks like a good candidate: https://hub.docker.com/_/hello-world/tags - It comes in multiple variants and even has windows/arm64 ... which I have never seen! Would be interesting to understand how GMT handles this kind of image ...
- To force a download you must address the image with the hash. Otherwise Docker will probably auto-select the architecture

ArneTR · 2025-08-27T07:09:56Z

Correction: Docker Hub does not work they way I expected. The images are not hash addressable like this. I will provide sample images under our namespace

ArneTR · 2025-08-27T08:21:59Z

Please use these images:

They have only one architecture. I hope this can reproduce the problem we where seeing with the image from the spring application you had that was only available in the wrong architecture. If not please update which Image architecture / setup I shall upload

…y tests - Fix syntax error in scenario_runner.py Docker pull exception chaining - Add architecture mismatch detection during pull phase with clear error messages - Remove redundant post-pull architecture compatibility checks - Replace dynamic architecture tests with explicit platform-specific tests - Simplify nonexistent image tests by removing interactive mode and dynamic naming - Create separate scenario files for each test case with descriptive names This improves error reporting for Docker pull failures and reduces test complexity while maintaining comprehensive coverage of architecture compatibility scenarios. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

[skip-ci]

davidkopp · 2025-08-27T20:35:56Z

Thanks for your review and the provided images!

During implementation, Claude proposed moving the architecture check into the error handling of the pull logic instead of doing it afterwards. This resulted in a complete redesign.

github-actions · 2025-08-27T20:46:11Z

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17278003752]:

🌳 CO2 Data:
City: Boydton, Lat: 36.6694, Lon: -78.3877
IP: 48.211.213.33
CO₂ from energy is: 1.321829280 g
CO₂ from manufacturing (embodied carbon) is: 0.269130809 g
Carbon Intensity for this location: 348 gCO₂eq/kWh
SCI: 1.590960 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]

Measurement #1	25.5483	3798.36	4.03	943.28

Total Run	25.55	3798.36	4.03	943.28

Additional overhead from Eco CI	N/A	10.99	4.09	2.69

github-actions · 2025-08-27T20:53:36Z

Eco CI Output - Old Energy Estimation

Eco CI Output [RUN-ID: 17278081633]:

🌳 CO2 Data:
City: Des Moines, Lat: 41.6015, Lon: -93.6127
IP: 52.176.138.178
CO₂ from energy is: 2.220557100 g
CO₂ from manufacturing (embodied carbon) is: 0.335780114 g
Carbon Intensity for this location: 498 gCO₂eq/kWh
SCI: 2.556337 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]

Measurement #1	22.4821	4458.95	3.79	1176.88

Total Run	22.48	4458.95	3.79	1176.88

Additional overhead from Eco CI	N/A	13.54	4.33	3.13

ArneTR · 2025-09-01T07:27:35Z

The code looks cleaner and also the new images have been incorporated.

However the error condition is the presence of 'no matching manifest'.

I remember we had quite a different error that was way more cryptic at the time. Why is that not emerging anymore? The fail was at the same location, was it not?

I remember this was the origin: #1289

Suprisingly the docker pull did not fail there but pulled the image correctly. So I feel our test cases are not capturing this error, or?

Docker allows pulling incompatible architecture images when using specific digest references (e.g., alpine@sha256:...), but these fail at runtime with "exec format error". This change adds proactive architecture validation after successful Docker pulls to catch incompatibilities early. Key changes: - Add _validate_image_architecture() method using fast 'docker image inspect' - Validate architecture compatibility immediately after successful pulls - Extract architecture mapping logic into reusable map_host_to_docker_arch() - Fail fast with clear error messages instead of runtime failures - Clean up error handling to remove graceful degradation The fix ensures that ARM64 digest images pulled on AMD64 hosts (and vice versa) are detected immediately with clear error messages, rather than failing later during container execution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

davidkopp · 2025-09-01T20:04:29Z

I finally understand the issue. There are two different cases that we have to consider:

Pulling an image with an incompatible architecture: this fails during the Docker pull.
Pulling a multi-arch image with a digest referencing an incompatible architecture: this succeeds during the Docker pull (the image has at least one manifest with a compatible architecture), but the container will fail at start.

So, I think it makes sense to have two checks:

Check whether an error that occurred during the Docker pull is architecture-related, and if so, provide a proper error message.
Check after a successful pull whether the architecture is compatible with the host architecture.

github-actions · 2025-09-01T20:10:22Z

Eco CI Output [RUN-ID: 17386138735]:

🌳 CO2 Data:
City: Washington, Lat: 38.7095, Lon: -78.1539
IP: 20.55.15.224
CO₂ from energy is: 1.336493410 g
CO₂ from manufacturing (embodied carbon) is: 0.288135562 g
Carbon Intensity for this location: 329 gCO₂eq/kWh
SCI: 1.624629 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]

Measurement #1	25.2621	4062.29	4.02	1009.89

Total Run	25.26	4062.29	4.02	1009.89

Additional overhead from Eco CI	N/A	10.74	3.99	2.69

ArneTR · 2025-09-02T06:22:06Z

great. How can I support? Do you need me to upload an image to docker hub that saties the criteria from your bullet number 2? If so, can you send me a Dockerfile and build string?

davidkopp · 2025-09-02T06:54:17Z

I have already implemented the necessary changes in commit 0c412f7

For a multi-arch docker image with digests pointing to specific architectures I have used Alpine in the new test cases:

multi-arch image without digest: alpine:3.22.1
multi-arch image with amd64 digest: alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
multi-arch image with arm64 digest: alpine@sha256:4562b419adf48c5f3c763995d6014c123b3ce1d2e0ef2613b189779caa787192

So this PR is ready for a hopefully final review.

ArneTR · 2025-09-03T09:12:39Z

Nice to see the images that can be used. I tried using the index-hash which apparently auto-selects the correct architecture.

However I am still unsure if that would really lead to the error we saw back then.

When I run

$ docker run --rm -it -u 0 --entrypoint timeout alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Unable to find image 'alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f' locally
docker.io/library/alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f: Pulling from library/alpine
9824c27679d3: Pull complete
Digest: sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Status: Downloaded newer image for alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
exec /usr/bin/timeout: exec format error

I get a proper response from the daemon, that the image cannot be run.

However on my local box, which supports emulation via Docker Desktop, the image runs with only a minor warning.

So I feel two points are still unclear to me:

Did we catch the error from back then? if so: With what GMT run string can I reproduce it on an old checkout of GMT?
Should we fail GMT runs if technically the image can be run, but is only emulated? This PR enforces that. But it might be helpful in GMT to benchmark exactly that ...

davidkopp · 2025-09-03T20:48:38Z

Did we catch the error from back then?

Answer: yes

You can use the following command to reproduce the error when you are using the latest version of GMT on main (commit 949f445).

On an amd64 host:

python3 runner.py --name "KADAI (using arm64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-arm64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe

On an arm64 host:

python3 runner.py --name "KADAI (using amd64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-amd64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe

Error:

Exception_context (NoneType): None
Final_exception (RuntimeError): Health check of container "kadai-postgres" failed terminally with status "unhealthy" after 0s. Health check errors: {"Status":"unhealthy","FailingStreak":0,"Log":[]}

Using the current implementation in this branch the error message is improved:

Final_exception (RuntimeError): Architecture incompatibility detected: Docker image 'registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases@sha256:ad5dfc2b4075233e385e92adad9fbeb8e3f874da4e998eb5eca9c9db4556729a' is not available for host architecture 'amd64'. Image architecture is 'arm64'

Should we fail GMT runs if technically the image can be run, but is only emulated?

No. Good point! I haven't thought about that and only tested the implementation with Docker native in WSL2 and not with Docker Desktop that would use the emulation of ARM.

I have implemented the necessary changes to allow emulation in the following PR:
#1313

Feel free to merge it into this one, if it makes sense to you. The check after the image pull was completely removed. Instead a delay of 1 second was introduced after the docker run to be able to check if the detached container fails immediately after start (e.g. due to an invalid architecture).
If emulation is used, the container will start without issues so there won't be any interruptions by an architecture check. In the tests I added a skip rule if they are run with Docker Desktop.

With the GMT run command from above it will now succeed if you use Docker Desktop (using emulation), but still fail if you use native Docker.

Error message if running an incompatible image architecture on native Docker:

Final_exception (RuntimeError): Container 'kadai-postgres' failed immediately after start, probably due to architecture incompatibility (exit code: 255). Image architecture is 'arm64' but host architecture is 'amd64'.

ArneTR

This base PR generally looks good and can be merged once we merge the follow-up improvement on top

davidkopp and others added 3 commits August 13, 2025 12:48

Remove verbose architecture compatibility success message

1ebd450

davidkopp added 6 commits August 20, 2025 15:29

Temporarily add tmate to CI pipeline

7eb8362

switch from tmate to upterm

a2570c6

adjust upterm config

fc4fb24

Fix upterm config

e5e9d47

Try installing upterm ourselves

2989f30

Going back to simple config

25e631f

davidkopp and others added 3 commits August 26, 2025 09:21

Reset workflow config

1bc43dc

Revert "Fix architecture compatibility test for GitHub Actions"

584ffae

This reverts commit ab35780.

davidkopp and others added 2 commits August 27, 2025 22:29

Cleanup

753c554

[skip-ci]

davidkopp requested a review from ArneTR September 2, 2025 06:54

davidkopp mentioned this pull request Sep 3, 2025

Improve container startup failure detection #1313

Open

ArneTR approved these changes Sep 5, 2025

View reviewed changes

Add check if container image architecture is compatible #1289

Are you sure you want to change the base?

Add check if container image architecture is compatible #1289

Uh oh!

Conversation

davidkopp commented Aug 13, 2025

Uh oh!

ArneTR commented Aug 13, 2025

Uh oh!

davidkopp commented Aug 13, 2025

Uh oh!

davidkopp commented Aug 13, 2025

Uh oh!

ribalba commented Aug 16, 2025

Uh oh!

ArneTR commented Aug 19, 2025

Uh oh!

davidkopp commented Aug 19, 2025

Examples

Question

Uh oh!

github-actions bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidkopp commented Aug 20, 2025

Uh oh!

ArneTR commented Aug 20, 2025

Uh oh!

ArneTR commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidkopp commented Aug 26, 2025

Uh oh!

ArneTR commented Aug 27, 2025

Uh oh!

ArneTR commented Aug 27, 2025

Uh oh!

ArneTR commented Aug 27, 2025

Uh oh!

davidkopp commented Aug 27, 2025

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArneTR commented Sep 1, 2025

Uh oh!

davidkopp commented Sep 1, 2025

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

ArneTR commented Sep 2, 2025

Uh oh!

davidkopp commented Sep 2, 2025

Uh oh!

ArneTR commented Sep 3, 2025

Uh oh!

davidkopp commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did we catch the error from back then?

Should we fail GMT runs if technically the image can be run, but is only emulated?

Uh oh!

ArneTR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 20, 2025 •

edited

Loading

github-actions bot commented Aug 26, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited

Loading

davidkopp commented Sep 3, 2025 •

edited

Loading