[integ-tests-framework] Make capacity reservations for all instance types#7461
[integ-tests-framework] Make capacity reservations for all instance types#7461hanwen-cluster wants to merge 1 commit into
Conversation
…ypes 1. With aws#7440, we started to make capacity reservations for {"c5.xlarge", "m6g.xlarge", "m6i.xlarge"}, and use other similar instance types if a capacity reservation fails to creat. This commit expands the logic to all instance types. 1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in `develop.yaml`), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations. 1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group. 2. With this commit, resolve_instance_with_capacity allows specifying alternative_instance_types. Prior to this commit alternative_instance_types was always calculated with `get_similar_instance_types`, which could be too restrictive, so don't give too many alternatives for instance types like `c5n.18xlarge` 3. Improve test_efa in isolated_regions to take a flag to use any efa instances to avoid Insufficient Capacity Error. test_efa in commercial doesn't need this, because it could try out different regions. In isolated regions, the test has to run in a specific region.
| if ( | ||
| reservation.get("AvailabilityZoneId") == az_id | ||
| and reservation.get("InstancePlatform") == instance_platform | ||
| and bool(reservation.get("PlacementGroupArn")) == with_placement_group |
There was a problem hiding this comment.
[Blocking] Should we also check the CapacityResevations available capacity? Or is it handled in other place?
There was a problem hiding this comment.
This second layer of capacity reservation doesn't have visibility into how many instances are required by a test. So we just check a capacity reservation exists and don't care about the available capacity. If a test is peculiar on the available capacity, it should make explicit capacity reservation in develop.yaml
There was a problem hiding this comment.
If multiple tests run in parallel using the same ODCR, is there a risk of consuming all capacity and causing an ICE error?
| try: | ||
| return ec2_client.describe_placement_groups(GroupNames=[name])["PlacementGroups"][0]["GroupArn"] | ||
| except Exception: | ||
| return ec2_client.create_placement_group(GroupName=name, Strategy="cluster")["PlacementGroup"]["GroupArn"] |
There was a problem hiding this comment.
Where do we cleanup these PlacementGroup?
There was a problem hiding this comment.
We don't clean up. Since the naming of the placement groups is {instance_type}_placement_group_{az_id}, we won't create too many even without cleaning up. AWS limits 500 placement groups per region, which is safe for us
There was a problem hiding this comment.
Okay, makes sense, so these PlacementGroup are long live, and can avoid re-creating everytime.
Description of changes
1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in
develop.yaml), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations.1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group.
get_similar_instance_types, which could be too restrictive, so don't give too many alternatives for instance types likec5n.18xlargeTests
In the above tests, the test in us-east-1 passed completely. The test in ap-southeast-5 failed some checks in fabtest because it was using g6.8xlarge. This failure is not a regression from this PR, and won't surface in isolated regions because fabtest is not run in isolated regions.
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.