Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 24 additions & 34 deletions docs/cloud/best-practices/triage-and-response.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Triage & response"
---

Maintaining high data quality is much more than adding tests - Its about creating processes.
Maintaining high data quality is much more than adding tests - It's about creating processes.

The processes that will improve your data quality, reduce response times, and prevent repeating incidents have to do with:

Expand All @@ -15,7 +15,7 @@ Elementary has tools in place to help, and this guide is meant to help get as mu

## Plan the response in advance

Your response to a data incident doesnt actually start when the failure happens. An effective response starts when you add a test / monitor / dataset.
Your response to a data incident doesn't actually start when the failure happens. An effective response starts when you add a test / monitor / dataset.

For every test or monitor you add, think about the following -

Expand Down Expand Up @@ -55,35 +55,34 @@ Alert distribution can be configured in the [Alert rules](/cloud/features/alerts
The alerts can be distributed to different channels (within Slack / MS Teams) and to different tools (Pagerduty, Ops Genie, etc).
Elementary users usually distribute alerts by:

1. Business domain tags - In teams where each domain has their own data teams, its recommended to have a separate Slack channel for alerts on that domains models. The domain alert rules are usually defined by tags.
1. Business domain tags - In teams where each domain has their own data teams, it's recommended to have a separate Slack channel for alerts on that domain's models. The domain alert rules are usually defined by tags.
2. Responsible team - For example, if there is a problem with null values in a Salesforce source, it makes sense to send the alert straight to the Salesforce team. These alert rules can be defined by model / source name, tag or owner.
3. Criticality - The most critical alerts are usually model error alerts, and handling them is critical because it blocks the pipeline. Since those issues are sometimes time sensitive, some teams choose to send them to Pager Duty or Ops Genie, or at least a dedicated Slack channel with different notification settings.
4. Low priority alerts / warnings - We generally recommend refraining from sending Slack alerts for failures that dont have a clear response plan yet. These failures can not be sent at all, or sent to a muted channel that will operate as a feed.
4. Low priority alerts / warnings - We generally recommend refraining from sending Slack alerts for failures that don't have a clear response plan yet. These failures can not be sent at all, or sent to a muted channel that will operate as a "feed".
Such failures can be: 1. Newley configured anomaly detection tests or explicit tests where you have low certainty about the threshold / expectation. 2. Anomaly detection tests that you consider as a safety measure, not a clear failure.
This is not to say that they are not interesting - but, they can be investigated within the Elementary UI, using the incidents page, at a time of convenience. We believe alerts are an interruption to the daily schedule and such an interruption should only occur if its justified. To avoid getting such alerts, we recommend filtering your alert rules on Failure or Error statuses.
This is not to say that they are not interesting - but, they can be investigated within the Elementary UI, using the incidents page, at a time of convenience. We believe alerts are an interruption to the daily schedule and such an interruption should only occur if it's justified. To avoid getting such alerts, we recommend filtering your alert rules on "Failure" or "Error" statuses.

## Notifying stakeholders

There are several ways to notify data consumers and stakeholders about ongoing problems.
While some customers prefer to do it personally after triaging the incidents, others prefer saving this time and going with automated notifications.
For models intended for public consumption (by BI dashboards, ML models etc) we recommend setting up [subscribers](/cloud/features/alerts-and-incidents/owners-and-subscribers#subscribers). Those subscribers will be tagged in Slack on every alert that is sent on those tables. Unlike owners, there can be many subscribers to an alert.
Tagging subscribers is of course optional, and simply adding them to the relevant channels can also suffice.
Coming soon:
As part of the data health scores release, we will be supporting a new type of alerts, that notifies a drop in the health score of an asset. This type of alert is intended for data consumers, who don’t need the details and just want a high-level notification in case the data asset shouldn’t be used. We will also support sending daily digests on all assets’ health scores.

You can also use [Incidents Digest](/cloud/features/alerts-and-incidents/incident-digest) to send scheduled summaries of incidents to broader teams, reducing noise while keeping stakeholders informed.

## Incident management

Elementary has an incident page, new failures will either create an incident or be attached to an open incident.
This page is designed to enable your team to stay on top of open incidents and collaborate on resolving them. The page gives a comprehensive overview of all current and previous incidents, where users can view the status, prioritize, assign and resolve incidents.
![Incident management dashboard](https://res.cloudinary.com/diuctyblm/image/upload/v1738149956/Docs/incident-management_up6jzx.png)
Elementary has an [Incidents page](/cloud/features/alerts-and-incidents/incidents) where failures automatically create incidents. The page gives a comprehensive overview of all current and previous incidents, organized by status (Open, Acknowledged, Resolved), where users can view, prioritize, assign, and resolve incidents.

### Incident management usability

- Each incident has 3 settings: Assignee, status and severity.
- These can be changed directly from the Slack notification.
- The severity is set to `high` for failures and `normal` for warnings. You can manually change to `critical` or `low` .
- You can select several incidents and make changes to the settings in bulk.
- Failures of the same test / model of an open incident will not open a new incident, these will be added to an ongoing incident.
- Incidents contain one or more monitors. You can [merge related incidents](/cloud/features/alerts-and-incidents/incident-merging) to group them by root cause, reducing noise.
- The incident detail drawer provides **Lineage**, **Monitors**, and **Timeline** tabs for investigation.

### Incident management best practices

Expand All @@ -98,17 +97,7 @@ This page is designed to enable your team to stay on top of open incidents and c
- Normal - Should be resolved by end of week.
- Low - Should be evaluated weekly, might trigger a change in coverage.
- If no one cares about an incident, this should impact coverage.

### Coming soon

Incidents is a beta feature, and we are working on adding functionality. The immediate roadmap includes:

- Notifications to assignees
- Mute / Snooze
- Advanced grouping of failures to incidents according to lineage (example: model failure + all downstream freshness and volume failures)
- Initiating triage from incident management (see picture)

![An interface showing initiating triage from incident management](https://res.cloudinary.com/diuctyblm/image/upload/v1738149956/Docs/triage-response-via-incident-management_acjqow.png)
- Use [auto-merge rules](/cloud/features/alerts-and-incidents/incident-merging#auto-merge-rules) to automatically group related freshness incidents, reducing the number of incidents your team needs to triage.

## Triage incidents

Expand All @@ -119,9 +108,13 @@ When triaging incidents, there are 4 steps to go through:
3. Resolution
4. Post mortem - Quality learnings from incidents is how you improve over time, and reduce the time to resolution and frequency of future incidents.

<Tip>
Use the **Investigate with AI** action on any incident to trigger the [Triage & Resolution Agent](/cloud/ai-agents/triage-resolution-agent). It can perform impact analysis, root cause analysis, and even suggest [merging related incidents](/cloud/features/alerts-and-incidents/incident-merging#ai-agent-assisted-merge) by shared root cause.
</Tip>

### Impact analysis

The goal of doing an impact analysis is to determine the severity and urgency of the incident, and understand if you need to communicate the incident to consumers (if there isnt a relevant alert rule).
The goal of doing an impact analysis is to determine the severity and urgency of the incident, and understand if you need to communicate the incident to consumers (if there isn't a relevant alert rule).
These are the questions that should be asked, and product tips on how to answer with Elementary:

- Was this a failure or just a warning?
Expand All @@ -137,12 +130,12 @@ These are the questions that should be asked, and product tips on how to answer


- How important is the data asset?
- Check in the catalog or node info section in the lineage if it has a tag like `critical` , `public` or a data product tag. You can also look at the description of the data asset, whether its a table or a column.
- Check in the catalog or node info section in the lineage if it has a tag like `critical` , `public` or a data product tag. You can also look at the description of the data asset, whether it's a table or a column.
- Does the failure impact important downstream assets? Did the issue propagate to downstream assets?

- A table might not be critical, but its upstream from a critical one, making it part of a critical path.
- A table might not be critical, but it's upstream from a critical one, making it part of a critical path.

- Check in the lineage if there are downstream important BI assets / public tables. To see the downstream assets you can navigate to the lineage directly from the test results, but clicking `view in lineage`. If the incident is a failed column test, you can filter only the downstream lineage of the specific column by clicking on `filter column`.
- Check in the lineage if there are downstream important BI assets / public tables. You can navigate to the lineage directly from the incident detail's **Lineage** tab, or from test results by clicking `view in lineage`. If the incident is a failed column test, you can filter only the downstream lineage of the specific column by clicking on `filter column`.

![Lineage filters](https://res.cloudinary.com/diuctyblm/image/upload/v1738149955/Docs/lineage-filters_ipjze3.png)

Expand All @@ -169,13 +162,10 @@ If the incident is important we need to start the investigation process, and und
- Is there a data issue at the source?
- Check in the lineage and see if there is coverage and failures on upstream tables, you can use the lineage filters to limit the scope to relevant failures (if `not_null` failed, filter on `not_null` tests).
- Check the test result sample. If you want to see more results copy the test SQL and run it in your DWH console.
- Sometimes an issue would be in a certain dimension, like a specific product event that stopped arriving or changed. Aggregate the test query by key dimensions in the table to understand if it’s relevant to specific subset of the data.
- _Coming soon - Automated post failure queries._
- Check if the test is flaky in the `test performance` screen. This usually means it’s a problem that happens frequently at the source data.
- _Coming soon - Check the metric graphs of the source tables._
- Sometimes an issue would be in a certain dimension, like a specific product event that stopped arriving or changed. Aggregate the test query by key dimensions in the table to understand if it's relevant to specific subset of the data.
- Check if the test is flaky in the `test performance` screen. This usually means it's a problem that happens frequently at the source data.
- Is it a code issue?
- Check recent PRs to the underlying monitored table.
- _Coming soon - Incident timeline with recent PRs and changes._
- Check recent PRs merged to upstream tables.
- Are there any other related failures that happen at the same time following a recent release?
- Check metrics and test results like volume of tables to see if there is a wrong join.
Expand All @@ -188,8 +178,8 @@ Learning from incidents we had is how we improve our coverage, response times an

Here are some common actions to take following an incident:

- Incident wasnt important - If the incident wasnt important or significant, remove the test or change the severity to warning.
- Incident wasn't important - If the incident wasn't important or significant, remove the test or change the severity to warning.
- It was hard to determine the severity of the incident - Make changes to the tags and descriptions of the test / asset, to make it easier next time.
- The relevant people werent notified - Make changes to owners and subscribers, and create the relevant alert rules.
- The relevant people weren't notified - Make changes to owners and subscribers, and create the relevant alert rules.
- The result sample was not helpful - Make changes to the test query, to make it easier next time.
- Reoccurring incidents at the source - For incidents that keep happening, the most productive approach is to have a conversation with your data providers, and figure out how to improve response. You can use the `test performance` page, and past incidents on the `incidents` page to communicate stats on the previous incidents.
- Reoccurring incidents at the source - For incidents that keep happening, the most productive approach is to have a conversation with your data providers, and figure out how to improve response. You can use the `test performance` page, and past incidents on the `incidents` page to communicate stats on the previous incidents.
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ you will be able to track all open and historical incidents, and get metrics on

- **Alerts customization** - Alerts should include relevant context for quick triage such as **owner**, **tags**, **description**. In Elementary, alerts can be customized to include this information.
- **Alert distribution rules** - Alerts should be sent to relevant recipients. By creating [Alert Rules](/cloud/features/alerts-and-incidents/alert-rules), alerts can be distributed to different channels and systems.
- **Incidents management** - When alerts are distributed to different channels, it can become hard to track what is open. Elementary offers a centralized Incidents page to monitor what is open, and manage incident properties: **assignee**, **status** and **severity**.
- **Grouping alerts to incidents** - New failures related to already open incidents will not trigger new alerts, and will be automatically added to the ongoing incident. This reduces noise and alert fatigue.
- **Automated resolution** - When there is a successful run that means an open incident is resolved, Elementary will automatically resolve the incident. This will help you manage the state of incidents and communicate it to stake holders in real time.
- **Incidents management** - When alerts are distributed to different channels, it can become hard to track what is open. Elementary offers a centralized [Incidents page](/cloud/features/alerts-and-incidents/incidents) to monitor what is open, and manage incident properties: **assignee**, **status** and **severity**.
- **[Incident merging](/cloud/features/alerts-and-incidents/incident-merging)** - Merge related incidents manually, let the AI agent suggest merges by root cause, or configure auto-merge rules to group incidents automatically. This reduces noise and helps your team focus on root causes instead of individual symptoms.
- **Automated resolution** - When all monitors in an incident pass again, Elementary automatically resolves the incident, keeping the state of incidents up to date in real time.
- **Mute test alerts** – Mute your test from the test configuration tab to run tests without triggering alerts, giving you more control over notifications while still monitoring data quality. This is useful when testing new data sets, refining thresholds, or adjusting test logic without unnecessary noise.

<video
Expand Down
Loading
Loading