-
Notifications
You must be signed in to change notification settings - Fork 34
Add check if container image architecture is compatible #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add _check_image_architecture_compatibility method to detect architecture mismatches early - Validate image architecture compatibility immediately after Docker image pull - Provide clear error messages for architecture mismatches instead of misleading health check failures - Add comprehensive test coverage for compatible, incompatible, and nonexistent image scenarios - Normalize architecture names (x86_64 → amd64, aarch64 → arm64) for proper comparison This prevents the confusing "Health check failed" errors when attempting to run ARM64 images on x86 hosts, providing immediate feedback about architecture incompatibility. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Explain why Docker vs host system architecture name normalization is needed - Document the specific naming differences (amd64 vs x86_64, etc.) - Justify why these 3 mappings are sufficient for modern container deployments - Clarify the use cases for each architecture (servers, Apple Silicon, embedded devices) This makes the code more maintainable by explaining the architectural decisions. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This is definitely something where the error message should be better. I wonder though if GMT should have such a check routine and this information can always be provided by the docker cli and / or is always relevant. To better understand: Can you tell which image was used so I can replay the error on a bare metal host and see why the Docker CLI even continues until the healthcheck without a more prominent error |
In the meantime the image was converted from an ARM-only image to a multi-arch image. Here the commands to get the image with the ARM architecture and check its architecture: docker pull --platform linux/arm64 registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-da
tabases:postgres-14.7-100k-tasks
docker image inspect -f "{{.Architecture}}" registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks |
Test is failing in the CI pipeline:
Before I invest time into that, let me know if GMT should include the check at all. |
I think it should. This is something I see happening more and more. |
I also agree, such a check should exist. The docker error message is very opaque and does not really indicate what the underlying problem is. However I believe a check after downloading the image is needed, but not the most efficient. In case an image will be downloaded it should be checked if the image even exists for the target architecture beforehand.
Still a local check must be done afterwards as a built image might still contain a broken architecture ... very unlikely but to my understanding possible if it is based on a broken architecture already and only minor layers are added. @davidkopp If you agree please update the PR with the manifest check also. Rest of the PR LGTM |
Executing Examples1 manifest: $ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-example-spring-boot:kadai-10.1.0
real 0m1.960s
user 0m0.044s
sys 0m0.035s 2 manifests: $ time docker manifest inspect registry.gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-databases:postgres-14.7-100k-tasks
real 0m4.058s
user 0m0.055s
sys 0m0.081s 16 manifests: $ time docker manifest inspect postgres:15
real 0m19.516s
user 0m0.249s
sys 0m0.174s QuestionIs this normal? If so, I would vote against the use of Also, docker manifest inspect is an experimental feature. |
Mock Docker inspect and platform detection to prevent test failures in CI environments where: - Docker images may not be available due to network restrictions - platform.machine() may return unexpected values in virtualized environments The test now uses consistent mocking to simulate compatible amd64 architecture matching. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Eco CI Output - Old Energy EstimationEco CI Output [RUN-ID: 17092419355]: 🌳 CO2 Data: Total cost of whole PR so far:
|
The outcome of a meeting with Arne was, that Regarding the failed test: I added mocks to the test to ensure it works on all platforms. So there is no test anymore, that reads the actual CPU architecture from the host. I think the PR is ready for merge. |
The mocks are atm confusing to me. You are mocking so much that you are effectively comparing constant strings with each other. I think this is a bit too shallow for a test. If I misunderstood this please explain a bit what still is dynamic and what is static after the mocking. IMHO I would rather leave the test as it was before in the test suite and disable it for the GitHub Actions if it is really not possible to run like this. But I would also really like to understand:
|
We are currently blocked to investigate this further inside the GitHub VM as our extension to analyse is broken. We have opened an issue here: lhotari/action-upterm#33 This PR is now waiting for response. If no response comes in we will continue to investigate through different means |
This reverts commit ab35780.
- Add helper method get_compatible_test_image() for platform-agnostic testing - Update compatible test to work on amd64, arm64, and arm host architectures - Add docker pull to ensure image exists before testing - Improve nonexistent image test with guaranteed unique names and better error validation - Fix linting issues (trailing whitespace, import placement) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Eco CI Output - Old Energy EstimationEco CI Output [RUN-ID: 17231021760]: 🌳 CO2 Data: Total cost of whole PR so far:
|
Test works now on GitHub runner! It was just an error on my side. One issue was, that the test I think the PR can now be merged. |
Thanks for the overhaul. Here my review:
|
Correction: Docker Hub does not work they way I expected. The images are not hash addressable like this. I will provide sample images under our namespace |
Please use these images:
They have only one architecture. I hope this can reproduce the problem we where seeing with the image from the spring application you had that was only available in the wrong architecture. If not please update which Image architecture / setup I shall upload |
…y tests - Fix syntax error in scenario_runner.py Docker pull exception chaining - Add architecture mismatch detection during pull phase with clear error messages - Remove redundant post-pull architecture compatibility checks - Replace dynamic architecture tests with explicit platform-specific tests - Simplify nonexistent image tests by removing interactive mode and dynamic naming - Create separate scenario files for each test case with descriptive names This improves error reporting for Docker pull failures and reduces test complexity while maintaining comprehensive coverage of architecture compatibility scenarios. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Thanks for your review and the provided images! During implementation, Claude proposed moving the architecture check into the error handling of the pull logic instead of doing it afterwards. This resulted in a complete redesign. |
Eco CI Output - Old Energy EstimationEco CI Output [RUN-ID: 17278003752]: 🌳 CO2 Data: Total cost of whole PR so far:
|
Eco CI Output - Old Energy EstimationEco CI Output [RUN-ID: 17278081633]: 🌳 CO2 Data: Total cost of whole PR so far:
|
The code looks cleaner and also the new images have been incorporated. However the error condition is the presence of 'no matching manifest'. I remember we had quite a different error that was way more cryptic at the time. Why is that not emerging anymore? The fail was at the same location, was it not? I remember this was the origin: #1289 Suprisingly the docker pull did not fail there but pulled the image correctly. So I feel our test cases are not capturing this error, or? |
Docker allows pulling incompatible architecture images when using specific digest references (e.g., alpine@sha256:...), but these fail at runtime with "exec format error". This change adds proactive architecture validation after successful Docker pulls to catch incompatibilities early. Key changes: - Add _validate_image_architecture() method using fast 'docker image inspect' - Validate architecture compatibility immediately after successful pulls - Extract architecture mapping logic into reusable map_host_to_docker_arch() - Fail fast with clear error messages instead of runtime failures - Clean up error handling to remove graceful degradation The fix ensures that ARM64 digest images pulled on AMD64 hosts (and vice versa) are detected immediately with clear error messages, rather than failing later during container execution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
I finally understand the issue. There are two different cases that we have to consider:
So, I think it makes sense to have two checks:
|
Eco CI Output [RUN-ID: 17386138735]: 🌳 CO2 Data: Total cost of whole PR so far:
|
great. How can I support? Do you need me to upload an image to docker hub that saties the criteria from your bullet number 2? If so, can you send me a Dockerfile and build string? |
I have already implemented the necessary changes in commit 0c412f7 For a multi-arch docker image with digests pointing to specific architectures I have used Alpine in the new test cases:
So this PR is ready for a hopefully final review. |
Nice to see the images that can be used. I tried using the index-hash which apparently auto-selects the correct architecture. However I am still unsure if that would really lead to the error we saw back then. When I run $ docker run --rm -it -u 0 --entrypoint timeout alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Unable to find image 'alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f' locally
docker.io/library/alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f: Pulling from library/alpine
9824c27679d3: Pull complete
Digest: sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
Status: Downloaded newer image for alpine@sha256:eafc1edb577d2e9b458664a15f23ea1c370214193226069eb22921169fc7e43f
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
exec /usr/bin/timeout: exec format error I get a proper response from the daemon, that the image cannot be run. However on my local box, which supports emulation via Docker Desktop, the image runs with only a minor warning. So I feel two points are still unclear to me:
|
Did we catch the error from back then?Answer: yes You can use the following command to reproduce the error when you are using the latest version of GMT on main (commit 949f445). On an amd64 host: python3 runner.py --name "KADAI (using arm64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-arm64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe On an arm64 host: python3 runner.py --name "KADAI (using amd64 postgres image)" --uri "https://gitlab.com/envite-consulting/sustainable-software-architecture/kadai/kadai-resource-efficiency/" --branch "gmt-test-image-architecture" --filename "usage_scenario-amd64.yml" --skip-system-checks --dev-no-sleeps --dev-no-save --skip-unsafe Error:
Using the current implementation in this branch the error message is improved:
Should we fail GMT runs if technically the image can be run, but is only emulated?No. Good point! I haven't thought about that and only tested the implementation with Docker native in WSL2 and not with Docker Desktop that would use the emulation of ARM. I have implemented the necessary changes to allow emulation in the following PR: Feel free to merge it into this one, if it makes sense to you. The check after the image pull was completely removed. Instead a delay of 1 second was introduced after the With the GMT run command from above it will now succeed if you use Docker Desktop (using emulation), but still fail if you use native Docker. Error message if running an incompatible image architecture on native Docker:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This base PR generally looks good and can be merged once we merge the follow-up improvement on top
Two of my colleagues have already encountered the issue of unintentionally using an ARM image as part of a usage scenario for a GMT measurement. Needless to say, the measurement run failed. However, the error message was confusing:
I think it would make sense to add a compatibility check, so a proper error message can be provided.
With this PR the run would fail early directly after the image pull and the error message would be