Skip to content

Adding the application Flower #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 160 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
b977e52
properly escape quotes so commands with json will work.
HacksonClark Mar 6, 2025
a39de4d
This error message isn't for actual errors, so I'm removing it
HacksonClark Mar 6, 2025
fb585e6
Added code for onboarding task evaluation
HacksonClark Mar 6, 2025
1952aa5
Add console logs since we no longer use env
HacksonClark Mar 6, 2025
2d3d79a
parse results before creating json
HacksonClark Mar 6, 2025
8bcdfaa
Get name for results file
HacksonClark Mar 6, 2025
e536280
typo fix
HacksonClark Mar 6, 2025
ce23d78
Add task message for assessment
HacksonClark Mar 6, 2025
ccb2620
Add hint to task message
HacksonClark Mar 6, 2025
9805c37
flwr basic example
adityapgupta Mar 9, 2025
abf9902
running flower with k8s
adityapgupta Mar 9, 2025
3702ca3
added flower
adityapgupta Mar 9, 2025
bc0b247
flower workload
adityapgupta Mar 10, 2025
615aaa1
added flower docker and k8s files
adityapgupta Mar 10, 2025
078388e
Merge remote-tracking branch 'upstream/main' into main
adityapgupta Mar 11, 2025
2d3b410
updated submodule
adityapgupta Mar 11, 2025
b91af01
testing edits
adityapgupta Mar 12, 2025
8a53f07
rename onboarding related files
yinfangchen Mar 19, 2025
c588825
Merge pull request #31 from xlab-uiuc/assessment
HacksonClark Mar 19, 2025
7cb7ff3
Handle errors from get_traces&get_metrics
daklqw Mar 19, 2025
fdbf62c
Merge pull request #33 from xlab-uiuc/ask_env-fix
HacksonClark Mar 20, 2025
9c0ef3c
fix the issue of getting log of the workload pod
yinfangchen Mar 20, 2025
17916ed
Merge pull request #34 from xlab-uiuc/default-namespace-handle
yinfangchen Mar 20, 2025
c7c44a3
Add helm as a requirement to README
HacksonClark Mar 20, 2025
32bbda8
Fixed link
HacksonClark Mar 21, 2025
12db7bc
Update README.md
yinfangchen Mar 21, 2025
ad0d5a5
Merge pull request #36 from xlab-uiuc/JacksonArthurClark-patch-1
yinfangchen Mar 21, 2025
cab9e0a
fix return value of read operations
yinfangchen Mar 27, 2025
45b0044
Merge pull request #39 from xlab-uiuc/fix-return-value-get
HacksonClark Mar 27, 2025
e984ed0
Update prometheus interface to use PVC everywhere
HacksonClark Mar 28, 2025
a858f6b
teardown openebs and prometheus during cleanup
HacksonClark Mar 28, 2025
c3a2ccf
Fix function name
HacksonClark Mar 28, 2025
ccc9c0d
fix function name
HacksonClark Mar 28, 2025
504f4d9
Delete pvc instead of pv
HacksonClark Mar 28, 2025
02a3e69
update parameter name
HacksonClark Mar 28, 2025
3509ac1
Fix issue 41 (Fix key error with TTA)
ChuanweiQu Apr 10, 2025
9711138
Issue 42 (Cleanup before exit)
ChuanweiQu Apr 10, 2025
30f4a42
This file is not closed
daklqw Apr 10, 2025
2cac6fc
Fix assign-non-existent-node evaluation issue
daklqw Apr 11, 2025
6d0fc72
Merge pull request #43 from xlab-uiuc/fix-TTA-key-error
HacksonClark Apr 14, 2025
5ac50aa
Merge pull request #45 from xlab-uiuc/patch-Apr-10
HacksonClark Apr 14, 2025
40029bb
Enhances trace processing with error and response tracking
daklqw Apr 14, 2025
59d923d
Merge remote-tracking branch 'origin/main' into trace-improve
daklqw Apr 14, 2025
4407c81
Merge pull request #40 from xlab-uiuc/teardown
yinfangchen Apr 14, 2025
529aefd
Merge pull request #47 from xlab-uiuc/trace-improve
HacksonClark Apr 14, 2025
4cacda9
fix service name issue
daklqw Apr 15, 2025
2751063
Merge remote-tracking branch 'origin/main' into trace-improve
daklqw Apr 15, 2025
2448117
Task prompts
daklqw Apr 15, 2025
3ac8379
Update README.md
HacksonClark Apr 16, 2025
8f2eb4d
Merge pull request #49 from xlab-uiuc/HacksonClark-patch-1
yinfangchen Apr 16, 2025
b86a686
Polish prompt
daklqw Apr 16, 2025
7b9e52c
Merge pull request #50 from xlab-uiuc/trace-improve
yinfangchen Apr 16, 2025
0dca58a
Recover fault before users catch exceptions
ChuanweiQu Apr 17, 2025
868f504
Merge branch 'main' into fault-recovery-before-exit
ChuanweiQu Apr 17, 2025
8026265
Merge pull request #51 from xlab-uiuc/fault-recovery-before-exit
HacksonClark Apr 17, 2025
0741e0c
Add remote chart parameter
HacksonClark Apr 18, 2025
82ed33f
Switch to remote chart
HacksonClark Apr 18, 2025
37ad626
nit: no need to print every loop iteration.
HacksonClark Apr 18, 2025
4dc7c8e
Remove local path
HacksonClark Apr 18, 2025
ec69911
Only use TARGET_MICROSERVICES for local charts
HacksonClark Apr 18, 2025
a380b1f
Fix configmap name
HacksonClark Apr 18, 2025
d89a947
Fix ad service failure name from configmap.
HacksonClark Apr 18, 2025
943ed0b
Restart so changes take effect
HacksonClark Apr 18, 2025
578d951
Update name.
HacksonClark Apr 18, 2025
740a4ae
Fix name to match configmap
HacksonClark Apr 18, 2025
487f522
Fix cm name
HacksonClark Apr 18, 2025
a2d65b5
change to cartFailure
HacksonClark Apr 18, 2025
a01abad
update cm name
HacksonClark Apr 18, 2025
35c9236
Fix names
HacksonClark Apr 18, 2025
f8f8374
Add proxy configuration tips for AIOpsLab users
Flemington8 Apr 20, 2025
67a0e74
Refine task descriptions and instructions for smaller models to ensur…
Flemington8 Apr 21, 2025
e14d413
Update README and llm.py to support .env file for API keys management
Flemington8 Apr 21, 2025
c52b816
Add Weights & Biases integration for session logging and orchestrator…
Flemington8 Apr 21, 2025
a862d39
Refactor W&B integration to use environment variable for configuratio…
Flemington8 Apr 21, 2025
a3d7403
Fix typo in W&B initialization comment for clarity
Flemington8 Apr 21, 2025
cad460d
Load environment variables from .env file to check the presence of "U…
Flemington8 Apr 21, 2025
5d19677
Merge pull request #52 from xlab-uiuc/astronomy-shop-fixes
yinfangchen Apr 21, 2025
d9e6aa4
Refactor Orchestrator initialization to use environment variable for …
Flemington8 Apr 22, 2025
fe63d75
Merge branch 'xlab-uiuc:main' into main
Flemington8 Apr 22, 2025
d887517
Merge pull request #53 from Flemington8/main
HacksonClark Apr 22, 2025
a1e0234
Add new clients and update README with usage instructions
Flemington8 Apr 22, 2025
ec64ea4
Add wandb dependency
HacksonClark Apr 22, 2025
c7edeae
Merge pull request #57 from xlab-uiuc/wandb-dep
HacksonClark Apr 22, 2025
30e1480
Merge remote-tracking branch 'origin/main' into flemington8-main
HacksonClark Apr 23, 2025
2b41afd
Regenerated lock file
HacksonClark Apr 23, 2025
d21f015
Merge pull request #56 from Flemington8/main
HacksonClark Apr 23, 2025
d820fa8
update faulty service to match actual pod names
HacksonClark Apr 23, 2025
aa0d4e0
Merge pull request #60 from xlab-uiuc/astro-localization
HacksonClark Apr 23, 2025
d3bfc37
Add helper function to get CPU architecture of k8s cluster
HacksonClark Apr 24, 2025
317ffd9
Update arch check to check on node and not local.
HacksonClark Apr 24, 2025
5e8d648
Merge pull request #62 from xlab-uiuc/remote-arch
HacksonClark Apr 24, 2025
93846c6
fix frontend service name
HacksonClark Apr 24, 2025
3c32fbf
Add noop problem for astronomy shop
HacksonClark Apr 24, 2025
7f0edd8
Merge pull request #64 from xlab-uiuc/noop-astro
HacksonClark Apr 24, 2025
e77d1ce
Remove operator problems, they are not mature enough for eval
HacksonClark Apr 24, 2025
704f462
Fix trace api in astronomy shop
HacksonClark Apr 24, 2025
da31cbc
Fix get_logs for astronomy shop
HacksonClark Apr 24, 2025
ec14444
Merge pull request #66 from xlab-uiuc/astro-observe
HacksonClark Apr 25, 2025
358ebce
Add agent implementations and registry for AIOpsLab clients
Flemington8 Apr 26, 2025
b728010
Refactor agent registry and add FastAPI service for AIOpsLab simulations
Flemington8 Apr 26, 2025
644af23
Fix simulate function to validate problem ID and retrieve agent corre…
Flemington8 Apr 26, 2025
4e46c3f
Update vLLMClient to use environment variable for base URL
Flemington7 Apr 27, 2025
4f38345
Merge branch 'microsoft:main' into main
adityapgupta Apr 27, 2025
14b8f40
Add vLLM support with configurable parameters in client and service
Flemington8 Apr 27, 2025
5ec9981
added llama client with working fault
adityapgupta Apr 27, 2025
63cdcea
Change service host to localhost for local development
Flemington8 Apr 28, 2025
1c6a63d
Remove unused parameters (top_k, min_p) from vLLMClient and related c…
Flemington8 Apr 28, 2025
61eb67e
Refactor vLLMClient and vLLMAgent to remove guided decoding regex par…
Flemington8 Apr 28, 2025
bf2f41b
Add scripts to load and pull Docker images from a specified list
Flemington8 Apr 28, 2025
4d7fe2d
Refactor CriticalSection to ensure signal handler is only set in the …
Flemington8 Apr 29, 2025
71bb764
Refactor out_tokens function to exclude disallowed special tokens and…
Flemington8 Apr 29, 2025
ddba052
Add system and user messages to simulation response trace
Flemington8 Apr 29, 2025
94f1502
Enhance simulation response by adding system and user messages to the…
Flemington8 Apr 29, 2025
9c71a2a
trace api threading fix
HacksonClark Apr 29, 2025
3ca8504
Merge pull request #69 from xlab-uiuc/threading-fix
yimingsu01 Apr 29, 2025
6566221
Fix loop to use problem registry object
HacksonClark Apr 29, 2025
7548951
Cleanly parse results from e2e run
HacksonClark Apr 29, 2025
692e994
Update flash code to run e2e and cleanly save results
HacksonClark Apr 29, 2025
d12080f
add agent name to results filename
HacksonClark Apr 29, 2025
114146b
Merge pull request #70 from xlab-uiuc/loop
yinfangchen Apr 29, 2025
c85fcaf
Merge https://github.com/xlab-uiuc/AIOpsLab into main
adityapgupta Apr 30, 2025
414de73
Merge branch 'xlab-uiuc:main' into main
Flemington8 Apr 30, 2025
590ee39
Fix mitigation oracle to check for binary
HacksonClark Apr 30, 2025
3240fc4
Add back oracle to check for healthy state
HacksonClark Apr 30, 2025
40e9323
Merge pull request #74 from xlab-uiuc/wrong-bin-oracle
yinfangchen Apr 30, 2025
67b7e35
Add error state to oracle check
HacksonClark Apr 30, 2025
3fcfb78
Merge pull request #75 from xlab-uiuc/misconfig-oracle-fix
yinfangchen Apr 30, 2025
0ff4651
Track threads globally for cleanup, also dynamically adjust port if n…
HacksonClark Apr 30, 2025
4166d74
Use tiktoken to avoid going over context window limit
HacksonClark May 1, 2025
bb2d98a
Merge pull request #76 from xlab-uiuc/eval-fixes
yimingsu01 May 1, 2025
e27416b
added delay to supernode stop
adityapgupta May 1, 2025
bf71fcd
Adding LLaMa client
Armxyz1 May 1, 2025
9440d95
Fix localization to consider both mongodb pods and normal pods as ans…
HacksonClark May 1, 2025
aae876b
Merge pull request #77 from xlab-uiuc/revoke_auth-localization-fix
yinfangchen May 1, 2025
4055bfe
hotfix: load generator flood homepage fault had wrong feature flag name.
HacksonClark May 2, 2025
a45a460
Merge pull request #78 from xlab-uiuc/hotfix-flood
yinfangchen May 2, 2025
50bbe00
Merge pull request #1 from adityapgupta/armaan
adityapgupta May 2, 2025
c6b6a7d
Fix localization oracle for storage_user_unregistered
HacksonClark May 2, 2025
f63f2df
Merge pull request #79 from xlab-uiuc/hotfix-user_unregistered
yinfangchen May 2, 2025
ee0beda
edited flower description
adityapgupta May 2, 2025
1ccc254
Merge branch 'xlab-uiuc:main' into main
Flemington8 May 3, 2025
0ed2f4a
docs: add optional steps for pre-pulling and loading required images …
Flemington8 May 3, 2025
60a5f0f
docs: update README with instructions for running agents locally and …
Flemington8 May 3, 2025
46bdd6b
wait for a few seconds to reduce evaluation false positives due to k8…
rMaxiQp May 3, 2025
fa8a244
Fix agents so they don't violate token limits
HacksonClark May 5, 2025
023a174
address PR comment
rMaxiQp May 5, 2025
098c684
Merge pull request #81 from xlab-uiuc/max/oracle-patch
HacksonClark May 5, 2025
00bc2da
cleanup
adityapgupta May 8, 2025
2fbf911
added model misconfig fault
adityapgupta May 8, 2025
54d2504
refactor: reorganize imports and modify simulation response handling
Flemington8 May 11, 2025
c2e3468
Merge branch 'xlab-uiuc:main' into main
Flemington8 May 11, 2025
a1d868e
fix: update service host to allow external access
Flemington8 May 11, 2025
551bb6c
fix: simplify error message in simulation endpoint
Flemington8 May 12, 2025
377f652
feat: add start_time and end_time to SimulationResponse model; update…
Flemington8 May 13, 2025
5272465
Merge branch 'main' into eval-agents
HacksonClark May 19, 2025
376f06e
Merge pull request #82 from xlab-uiuc/eval-agents
HacksonClark May 19, 2025
7c27429
Merge branch 'main' into main
HacksonClark May 21, 2025
0693e14
Merge pull request #80 from Flemington8/main
HacksonClark May 21, 2025
8300b2c
Merge https://github.com/xlab-uiuc/AIOpsLab into main
adityapgupta May 22, 2025
51da78a
fast forwarded aiopslab-applications
adityapgupta May 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,9 @@ cython_debug/
# Visual Studio Code
.vscode/

# Weight & Biases
wandb/

# Project specific
cache_dir
demos
Expand Down
84 changes: 81 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Moreover, AIOpsLab provides a built-in benchmark suite with a set of problems to

### Requirements
- Python >= 3.11
- [Helm](https://helm.sh/)

Recommended installation:
```bash
Expand Down Expand Up @@ -64,10 +65,23 @@ kind create cluster --config kind/kind-config-arm.yaml

If you're running into issues, consider building a Docker image for your machine by following this [README](kind/README.md). Please also open an issue.

### [Tips]
If you are running AIOpsLab using a proxy, beware of exporting the HTTP proxy as `172.17.0.1`. When creating the kind cluster, all the nodes in the cluster will inherit the proxy setting from the host environment and the Docker container.

The `172.17.0.1` address is used to communicate with the host machine. For more details, refer to the official guide: [Configure Kind to Use a Proxy](https://kind.sigs.k8s.io/docs/user/quick-start/#configure-kind-to-use-a-proxy).

Additionally, Docker doesn't support SOCKS5 proxy directly. If you're using a SOCKS5 protocol to proxy, you may need to use [Privoxy](https://www.privoxy.org) to forward SOCKS5 to HTTP.

If you're running VLLM and the LLM agent locally, Privoxy will by default proxy `localhost`, which will cause errors. To avoid this issue, you should set the following environment variable:

```bash
export no_proxy=localhost
```

After finishing cluster creation, proceed to the next "Update `config.yml`" step.

### b) Remote cluster
AIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks we have to setup clusters on providers like [CloudLab](https://www.cloudlab.us/) and our own machines. Follow this [README](./scripts/ansible/README.md) to set up your own cluster, and then proceed to the next "Update `config.yml`" step.
AIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks to setup clusters on providers like [CloudLab](https://www.cloudlab.us/) and our own machines. Follow this [README](./scripts/ansible/README.md) to set up your own cluster, and then proceed to the next "Update `config.yml`" step.

### Update `config.yml`
```bash
Expand All @@ -76,7 +90,7 @@ cp config.yml.example config.yml
```
Update your `config.yml` so that `k8s_host` is the host name of the control plane node of your cluster. Update `k8s_user` to be your username on the control plane node. If you are using a kind cluster, your `k8s_host` should be `kind`. If you're running AIOpsLab on cluster, your `k8s_host` should be `localhost`.

### Running agents
### Running agents locally
Human as the agent:

```bash
Expand All @@ -89,19 +103,83 @@ python3 cli.py
Run GPT-4 baseline agent:

```bash
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
# Create a .env file in the project root (if not exists)
echo "OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>" > .env
# Add more API keys as needed:
# echo "QWEN_API_KEY=<YOUR_QWEN_API_KEY>" >> .env
# echo "DEEPSEEK_API_KEY=<YOUR_DEEPSEEK_API_KEY>" >> .env

python3 clients/gpt.py # you can also change the problem to solve in the main() function
```

The clients will automatically load API keys from your .env file.

You can check the running status of the cluster using [k9s](https://k9scli.io/) or other cluster monitoring tools conveniently.

To browse your logged `session_id` values in the W&B app as a table:

1. Make sure you have W&B installed and configured.
2. Set the USE_WANDB environment variable:
```bash
# Add to your .env file
echo "USE_WANDB=true" >> .env
```
3. In the W&B web UI, open any run and click Tables → Add Query Panel.
4. In the key field, type `runs.summary` and click `Run`, then you will see the results displayed in a table format.

<h2 id="⚙️usage">⚙️ Usage</h2>

AIOpsLab can be used in the following ways:
- [Onboard your agent to AIOpsLab](#how-to-onboard-your-agent-to-aiopslab)
- [Add new applications to AIOpsLab](#how-to-add-new-applications-to-aiopslab)
- [Add new problems to AIOpsLab](#how-to-add-new-problems-to-aiopslab)

### Running agents remotely
You can run AIOpsLab on a remote machine with larger computational resources. This section guides you through setting up and using AIOpsLab remotely.

1. **On the remote machine, start the AIOpsLab service**:

```bash
SERVICE_HOST=<YOUR_HOST> SERVICE_PORT=<YOUR_PORT> SERVICE_WORKERS=<YOUR_WORKERS> python service.py
```
2. **Test the connection from your local machine**:
In your local machine, you can test the connection to the remote AIOpsLab service using `curl`:

```bash
# Check if the service is running
curl http://<YOUR_HOST>:<YOUR_PORT>/health

# List available problems
curl http://<YOUR_HOST>:<YOUR_PORT>/problems

# List available agents
curl http://<YOUR_HOST>:<YOUR_PORT>/agents
```

3. **Run vLLM on the remote machine (if using vLLM agent):**
If you're using the vLLM agent, make sure to launch the vLLM server on the remote machine:

```bash
# On the remote machine
chmod +x ./clients/launch_vllm.sh
./clients/launch_vllm.sh
```
You can customize the model by editing `launch_vllm.sh` before running it.

4. **Run the agent**:
In your local machine, you can run the agent using the following command:

```bash
curl -X POST http://<YOUR_HOST>:<YOUR_PORT>/simulate \
-H "Content-Type: application/json" \
-d '{
"problem_id": "misconfig_app_hotel_res-mitigation-1",
"agent_name": "vllm",
"max_steps": 10,
"temperature": 0.7,
"top_p": 0.9
}'
```

### How to onboard your agent to AIOpsLab?

Expand Down
1 change: 0 additions & 1 deletion aiopslab/generators/fault/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,6 @@ def _recover(
self._invoke_method("recover", fault_type, microservices)
elif fault_type:
self._invoke_method("recover", fault_type)
time.sleep(6)

def _invoke_method(self, action_prefix, *args):
"""helper: injects/recovers faults based on name"""
Expand Down
11 changes: 10 additions & 1 deletion aiopslab/generators/fault/inject_otel.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ class OtelFaultInjector(FaultInjector):
def __init__(self, namespace: str):
self.namespace = namespace
self.kubectl = KubeCtl()
self.configmap_name = f"{namespace}-flagd-config"
self.configmap_name = "flagd-config"

def inject_fault(self, feature_flag: str):
command = (
Expand Down Expand Up @@ -39,6 +39,11 @@ def inject_fault(self, feature_flag: str):
self.kubectl.create_or_update_configmap(
self.configmap_name, self.namespace, updated_data
)

self.kubectl.exec_command(
f"kubectl rollout restart deployment flagd -n {self.namespace}"
)

print(f"Fault injected: Feature flag '{feature_flag}' set to 'on'.")

def recover_fault(self, feature_flag: str):
Expand Down Expand Up @@ -70,6 +75,10 @@ def recover_fault(self, feature_flag: str):
self.kubectl.create_or_update_configmap(
self.configmap_name, self.namespace, updated_data
)

self.kubectl.exec_command(
f"kubectl rollout restart deployment flagd -n {self.namespace}"
)
print(f"Fault recovered: Feature flag '{feature_flag}' set to 'off'.")


Expand Down
32 changes: 31 additions & 1 deletion aiopslab/generators/fault/inject_virtual.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from aiopslab.service.kubectl import KubeCtl
from aiopslab.service.helm import Helm
from aiopslab.service.dock import Docker
from aiopslab.generators.fault.base import FaultInjector
from aiopslab.service.apps.base import Application
from aiopslab.paths import TARGET_MICROSERVICES
Expand All @@ -18,6 +19,7 @@ def __init__(self, namespace: str):
super().__init__(namespace)
self.namespace = namespace
self.kubectl = KubeCtl()
self.docker = Docker()
self.mongo_service_pod_map = {
"url-shorten-mongodb": "url-shorten-service",
}
Expand Down Expand Up @@ -248,7 +250,35 @@ def recover_wrong_bin_usage(self, microservices: list[str]):
self.kubectl.exec_command(apply_command)

print(f"Recovered from wrong binary usage fault for service: {service}")


def inject_container_stop(self, microservices: list[str]):
"""Inject a fault to stop a container."""
for service in microservices:
self.docker.get_container(service).stop()
print(f"Stopped container {service}.")

print("Waiting for faults to propagate...")
time.sleep(15)
print("Faults propagated.")

def recover_container_stop(self, microservices: list[str]):
for service in microservices:
self.docker.get_container(service).start()
print(f"Started container {service}.")

def inject_model_misconfig(self, microservices: list[str]):
"""Inject a fault to misconfigure the model in the Flower application."""
for service in microservices:
command = f""" docker exec -it {service} sh -c "sed -i '24s/84/80/' /app/.flwr/apps/*/task.py" """
self.docker.exec_command(command)
print(f"Changed model configuration for service: {service}")

def recover_model_misconfig(self, microservices: list[str]):
for service in microservices:
command = f""" docker exec -it {service} sh -c "sed -i '24s/80/84/' /app/.flwr/apps/*/task.py" """
self.docker.exec_command(command)
print(f"Recovered model configuration for service: {service}")

############# HELPER FUNCTIONS ################
def _wait_for_pods_ready(self, microservices: list[str], timeout: int = 30):
for service in microservices:
Expand Down
3 changes: 2 additions & 1 deletion aiopslab/observer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
root_path = pathlib.Path(__file__).parent
sys.path.append(root_path)
# read the configuration file
monitor_config = full_load(open(root_path / "monitor_config.yaml", "r"))
with open(root_path / "monitor_config.yaml", "r") as f:
monitor_config = full_load(f)


# root_config = full_load(open(root_path / "config.yaml", "r"))
Expand Down
31 changes: 28 additions & 3 deletions aiopslab/observer/metric_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,8 @@ class PrometheusAPI:
# disable_ssl – (bool) if True, will skip prometheus server's http requests' SSL certificate
def __init__(self, url: str, namespace: str):
self.namespace = namespace
self.port = 32000
self.output_threads = []
self.port = self.find_free_port()
self.port_forward_process = None
self.stop_event = threading.Event()
self.start_port_forward()
Expand All @@ -151,6 +152,13 @@ def __init__(self, url: str, namespace: str):
def is_port_in_use(self, port):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(("127.0.0.1", port)) == 0

def find_free_port(self, start=32000, end=32100):
for port in range(start, end):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
if s.connect_ex(("127.0.0.1", port)) != 0:
return port
raise RuntimeError("No free ports available in the range.")

def print_output(self, stream):
"""Thread function to print output from a subprocess stream non-blockingly."""
Expand Down Expand Up @@ -197,6 +205,7 @@ def start_port_forward(self):
)
thread_out.start()
thread_err.start()
self.output_threads.extend([thread_out, thread_err])

time.sleep(3) # Wait a bit for the port-forward to establish

Expand All @@ -209,13 +218,29 @@ def start_port_forward(self):
print("Failed to establish port forwarding after multiple attempts.")

def stop_port_forward(self):
"""Stops the kubectl port-forward command."""
"""Stops the kubectl port-forward command and cleans up resources."""
if self.port_forward_process:
self.port_forward_process.terminate()
self.port_forward_process.wait()
try:
self.port_forward_process.wait(timeout=5)
except subprocess.TimeoutExpired:
print("Port-forward process did not terminate in time, killing...")
self.port_forward_process.kill()

self.stop_event.set()

if self.port_forward_process.stdout:
self.port_forward_process.stdout.close()
if self.port_forward_process.stderr:
self.port_forward_process.stderr.close()

print("Port forwarding stopped.")

for thread in self.output_threads:
thread.join(timeout=5)
if thread.is_alive():
print(f"Warning: Thread {thread.name} did not terminate cleanly.")

def cleanup(self):
"""Cleanup resources like port-forwarding."""
self.stop_port_forward()
Expand Down
Loading