microsoft · adityapgupta · Mar 6, 2025 · Mar 6, 2025 · Mar 6, 2025 · Mar 6, 2025
diff --git a/.gitignore b/.gitignore
@@ -327,6 +327,9 @@ cython_debug/
 # Visual Studio Code
 .vscode/
 
+# Weight & Biases
+wandb/
+
 # Project specific
 cache_dir
 demos

diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@ Moreover, AIOpsLab provides a built-in benchmark suite with a set of problems to
 
 ### Requirements
 - Python >= 3.11
+- [Helm](https://helm.sh/)
 
 Recommended installation:
 ```bash
@@ -64,10 +65,23 @@ kind create cluster --config kind/kind-config-arm.yaml
 
 If you're running into issues, consider building a Docker image for your machine by following this [README](kind/README.md). Please also open an issue.
 
+### [Tips]
+If you are running AIOpsLab using a proxy, beware of exporting the HTTP proxy as `172.17.0.1`. When creating the kind cluster, all the nodes in the cluster will inherit the proxy setting from the host environment and the Docker container. 
+
+The `172.17.0.1` address is used to communicate with the host machine. For more details, refer to the official guide: [Configure Kind to Use a Proxy](https://kind.sigs.k8s.io/docs/user/quick-start/#configure-kind-to-use-a-proxy).
+
+Additionally, Docker doesn't support SOCKS5 proxy directly. If you're using a SOCKS5 protocol to proxy, you may need to use [Privoxy](https://www.privoxy.org) to forward SOCKS5 to HTTP.
+
+If you're running VLLM and the LLM agent locally, Privoxy will by default proxy `localhost`, which will cause errors. To avoid this issue, you should set the following environment variable:
+
+```bash
+export no_proxy=localhost
+``` 
+
 After finishing cluster creation, proceed to the next "Update `config.yml`" step.
 
 ### b) Remote cluster
-AIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks we have to setup clusters on providers like [CloudLab](https://www.cloudlab.us/) and our own machines. Follow this [README](./scripts/ansible/README.md) to set up your own cluster, and then proceed to the next "Update `config.yml`" step.
+AIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks to setup clusters on providers like [CloudLab](https://www.cloudlab.us/) and our own machines. Follow this [README](./scripts/ansible/README.md) to set up your own cluster, and then proceed to the next "Update `config.yml`" step.
 
 ### Update `config.yml`
 ```bash
@@ -76,7 +90,7 @@ cp config.yml.example config.yml
 ```
 Update your `config.yml` so that `k8s_host` is the host name of the control plane node of your cluster. Update `k8s_user` to be your username on the control plane node. If you are using a kind cluster, your `k8s_host` should be `kind`. If you're running AIOpsLab on cluster, your `k8s_host` should be `localhost`.
 
-### Running agents
+### Running agents locally
 Human as the agent:
 
 ```bash
@@ -89,19 +103,83 @@ python3 cli.py
 Run GPT-4 baseline agent:
 
 ```bash
-export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
+# Create a .env file in the project root (if not exists)
+echo "OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>" > .env
+# Add more API keys as needed:
+# echo "QWEN_API_KEY=<YOUR_QWEN_API_KEY>" >> .env
+# echo "DEEPSEEK_API_KEY=<YOUR_DEEPSEEK_API_KEY>" >> .env
+
 python3 clients/gpt.py # you can also change the problem to solve in the main() function
 ```
 
+The clients will automatically load API keys from your .env file.
+
 You can check the running status of the cluster using [k9s](https://k9scli.io/) or other cluster monitoring tools conveniently.
 
+To browse your logged `session_id` values in the W&B app as a table:
+
+1. Make sure you have W&B installed and configured.
+2. Set the USE_WANDB environment variable:
+    ```bash
+    # Add to your .env file
+    echo "USE_WANDB=true" >> .env
+    ```
+3. In the W&B web UI, open any run and click Tables → Add Query Panel.
+4. In the key field, type `runs.summary` and click `Run`, then you will see the results displayed in a table format.
+
 <h2 id="⚙️usage">⚙️ Usage</h2>
 
 AIOpsLab can be used in the following ways:
 - [Onboard your agent to AIOpsLab](#how-to-onboard-your-agent-to-aiopslab)
 - [Add new applications to AIOpsLab](#how-to-add-new-applications-to-aiopslab)
 - [Add new problems to AIOpsLab](#how-to-add-new-problems-to-aiopslab)
 
+### Running agents remotely
+You can run AIOpsLab on a remote machine with larger computational resources. This section guides you through setting up and using AIOpsLab remotely.
+
+1. **On the remote machine, start the AIOpsLab service**:
+
+    ```bash
+    SERVICE_HOST=<YOUR_HOST> SERVICE_PORT=<YOUR_PORT> SERVICE_WORKERS=<YOUR_WORKERS> python service.py
+    ```
+2. **Test the connection from your local machine**:
+    In your local machine, you can test the connection to the remote AIOpsLab service using `curl`:
+
+    ```bash
+    # Check if the service is running
+    curl http://<YOUR_HOST>:<YOUR_PORT>/health
+
+    # List available problems
+    curl http://<YOUR_HOST>:<YOUR_PORT>/problems
+
+    # List available agents
+    curl http://<YOUR_HOST>:<YOUR_PORT>/agents
+    ```
+
+3. **Run vLLM on the remote machine (if using vLLM agent):**
+    If you're using the vLLM agent, make sure to launch the vLLM server on the remote machine:
+
+    ```bash
+    # On the remote machine
+    chmod +x ./clients/launch_vllm.sh
+    ./clients/launch_vllm.sh
+    ```
+    You can customize the model by editing `launch_vllm.sh` before running it.
+
+4. **Run the agent**:
+    In your local machine, you can run the agent using the following command:
+
+    ```bash
+    curl -X POST http://<YOUR_HOST>:<YOUR_PORT>/simulate \
+      -H "Content-Type: application/json" \
+      -d '{
+        "problem_id": "misconfig_app_hotel_res-mitigation-1",
+        "agent_name": "vllm",
+        "max_steps": 10,
+        "temperature": 0.7,
+        "top_p": 0.9
+      }'
+    ```
 
 ### How to onboard your agent to AIOpsLab?
 

diff --git a/aiopslab-applications b/aiopslab-applications
diff --git a/aiopslab/generators/fault/base.py b/aiopslab/generators/fault/base.py
@@ -59,7 +59,6 @@ def _recover(
             self._invoke_method("recover", fault_type, microservices)
         elif fault_type:
             self._invoke_method("recover", fault_type)
-        time.sleep(6)
 
     def _invoke_method(self, action_prefix, *args):
         """helper: injects/recovers faults based on name"""

diff --git a/aiopslab/generators/fault/inject_otel.py b/aiopslab/generators/fault/inject_otel.py
@@ -8,7 +8,7 @@ class OtelFaultInjector(FaultInjector):
     def __init__(self, namespace: str):
         self.namespace = namespace
         self.kubectl = KubeCtl()
-        self.configmap_name = f"{namespace}-flagd-config"
+        self.configmap_name = "flagd-config"
 
     def inject_fault(self, feature_flag: str):
         command = (
@@ -39,6 +39,11 @@ def inject_fault(self, feature_flag: str):
         self.kubectl.create_or_update_configmap(
             self.configmap_name, self.namespace, updated_data
         )
+
+        self.kubectl.exec_command(
+            f"kubectl rollout restart deployment flagd -n {self.namespace}"
+        )
+
         print(f"Fault injected: Feature flag '{feature_flag}' set to 'on'.")
 
     def recover_fault(self, feature_flag: str):
@@ -70,6 +75,10 @@ def recover_fault(self, feature_flag: str):
         self.kubectl.create_or_update_configmap(
             self.configmap_name, self.namespace, updated_data
         )
+
+        self.kubectl.exec_command(
+            f"kubectl rollout restart deployment flagd -n {self.namespace}"
+        )
         print(f"Fault recovered: Feature flag '{feature_flag}' set to 'off'.")
 
 

diff --git a/aiopslab/generators/fault/inject_virtual.py b/aiopslab/generators/fault/inject_virtual.py
@@ -8,6 +8,7 @@
 
 from aiopslab.service.kubectl import KubeCtl
 from aiopslab.service.helm import Helm
+from aiopslab.service.dock import Docker
 from aiopslab.generators.fault.base import FaultInjector
 from aiopslab.service.apps.base import Application
 from aiopslab.paths import TARGET_MICROSERVICES
@@ -18,6 +19,7 @@ def __init__(self, namespace: str):
         super().__init__(namespace)
         self.namespace = namespace
         self.kubectl = KubeCtl()
+        self.docker = Docker()
         self.mongo_service_pod_map = {
             "url-shorten-mongodb": "url-shorten-service",
         }
@@ -248,7 +250,35 @@ def recover_wrong_bin_usage(self, microservices: list[str]):
             self.kubectl.exec_command(apply_command)
 
             print(f"Recovered from wrong binary usage fault for service: {service}")
-
+
+    def inject_container_stop(self, microservices: list[str]):
+        """Inject a fault to stop a container."""
+        for service in microservices:
+            self.docker.get_container(service).stop()
+            print(f"Stopped container {service}.")
+
+            print("Waiting for faults to propagate...")
+            time.sleep(15)
+            print("Faults propagated.") 
+
+    def recover_container_stop(self, microservices: list[str]):
+        for service in microservices:
+            self.docker.get_container(service).start()
+            print(f"Started container {service}.")
+
+    def inject_model_misconfig(self, microservices: list[str]):
+        """Inject a fault to misconfigure the model in the Flower application."""
+        for service in microservices:
+            command = f""" docker exec -it {service} sh -c "sed -i '24s/84/80/' /app/.flwr/apps/*/task.py" """
+            self.docker.exec_command(command)
+            print(f"Changed model configuration for service: {service}")
+
+    def recover_model_misconfig(self, microservices: list[str]):
+        for service in microservices:
+            command = f""" docker exec -it {service} sh -c "sed -i '24s/80/84/' /app/.flwr/apps/*/task.py" """
+            self.docker.exec_command(command)
+            print(f"Recovered model configuration for service: {service}")
+
     ############# HELPER FUNCTIONS ################
     def _wait_for_pods_ready(self, microservices: list[str], timeout: int = 30):
         for service in microservices:

diff --git a/aiopslab/observer/__init__.py b/aiopslab/observer/__init__.py
@@ -12,7 +12,8 @@
 root_path = pathlib.Path(__file__).parent
 sys.path.append(root_path)
 # read the configuration file
-monitor_config = full_load(open(root_path / "monitor_config.yaml", "r"))
+with open(root_path / "monitor_config.yaml", "r") as f:
+    monitor_config = full_load(f)
 
 
 # root_config = full_load(open(root_path / "config.yaml", "r"))

diff --git a/aiopslab/observer/metric_api.py b/aiopslab/observer/metric_api.py
@@ -138,7 +138,8 @@ class PrometheusAPI:
     # disable_ssl – (bool) if True, will skip prometheus server's http requests' SSL certificate
     def __init__(self, url: str, namespace: str):
         self.namespace = namespace
-        self.port = 32000
+        self.output_threads = []
+        self.port = self.find_free_port()
         self.port_forward_process = None
         self.stop_event = threading.Event()
         self.start_port_forward()
@@ -151,6 +152,13 @@ def __init__(self, url: str, namespace: str):
     def is_port_in_use(self, port):
         with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
             return s.connect_ex(("127.0.0.1", port)) == 0
+
+    def find_free_port(self, start=32000, end=32100):
+        for port in range(start, end):
+            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                if s.connect_ex(("127.0.0.1", port)) != 0:
+                    return port
+        raise RuntimeError("No free ports available in the range.")
 
     def print_output(self, stream):
         """Thread function to print output from a subprocess stream non-blockingly."""
@@ -197,6 +205,7 @@ def start_port_forward(self):
             )
             thread_out.start()
             thread_err.start()
+            self.output_threads.extend([thread_out, thread_err])
 
             time.sleep(3)  # Wait a bit for the port-forward to establish
 
@@ -209,13 +218,29 @@ def start_port_forward(self):
             print("Failed to establish port forwarding after multiple attempts.")
 
     def stop_port_forward(self):
-        """Stops the kubectl port-forward command."""
+        """Stops the kubectl port-forward command and cleans up resources."""
         if self.port_forward_process:
             self.port_forward_process.terminate()
-            self.port_forward_process.wait()
+            try:
+                self.port_forward_process.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                print("Port-forward process did not terminate in time, killing...")
+                self.port_forward_process.kill()
+
             self.stop_event.set()
+
+            if self.port_forward_process.stdout:
+                self.port_forward_process.stdout.close()
+            if self.port_forward_process.stderr:
+                self.port_forward_process.stderr.close()
+
             print("Port forwarding stopped.")
 
+        for thread in self.output_threads:
+            thread.join(timeout=5)
+            if thread.is_alive():
+                print(f"Warning: Thread {thread.name} did not terminate cleanly.")
+
     def cleanup(self):
         """Cleanup resources like port-forwarding."""
         self.stop_port_forward()
+3 −0		.gitmodules
+32 −0		flower/README.md
+87 −0		flower/compose.yml
+38 −0		flower/kubernetes/clientapp/clientapp-1-deployment.yaml
+18 −0		flower/kubernetes/clientapp/clientapp-1-service.yaml
+38 −0		flower/kubernetes/clientapp/clientapp-2-deployment.yaml
+18 −0		flower/kubernetes/clientapp/clientapp-2-service.yaml
+23 −0		flower/kubernetes/serverapp/serverapp-pod.yaml
+18 −0		flower/kubernetes/serverapp/serverapp-service.yaml
+35 −0		flower/kubernetes/superlink/superlink-deployment.yaml
+18 −0		flower/kubernetes/superlink/superlink-service.yaml
+41 −0		flower/kubernetes/supernode/supernode-1-deployment.yaml
+18 −0		flower/kubernetes/supernode/supernode-1-service.yaml
+41 −0		flower/kubernetes/supernode/supernode-2-deployment.yaml
+18 −0		flower/kubernetes/supernode/supernode-2-service.yaml
+55 −0		flower/train/client_app.py
+40 −0		flower/train/pyproject.toml
+31 −0		flower/train/server_app.py
+128 −0		flower/train/task.py
+1 −0		path