Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions docs/decisions/0001-data-storage-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
Status: Draft
Date: 2026-04-17
Authors: VikiPeeva
Reviewers:
Replaces:
Superseded by:
---

# ADR-0001: Data Storage Format

## Context

We need to store _Local Process Models (LPMs) and their occurrence lists in an event log_. Event log sizes can vary,
but from the standard ones used for research in process mining, they go up to 1.GB. LPMs are usually small models
covering on average up to five activities, but for a single event log tens of thousands can be discovered.
Moreover, LPMs can come with additional attributes about them, either computed from the occurrence lists or maybe
manual annotations.

If one wants to be future-proof, a pessimistic estimation for size could be event logs processed by Celonis
(from a couple of hundreds GB to TB of data).

## Decision Drivers

The choice affects:
- portability,
- storage efficiency,
- long-term maintenance cost,
- operational complexity,
- query performance (future),
- streaming (future).

Expected access patterns for the data currently are:
- Complete read/write
- Streamed read/write
- Random access read/write
- Query-based read/write

There also exist various ways one can align LPMs to an event log:
- The storage focuses on one variation
- The storage includes occurrence lists for multiple variations
![Alignment Variations](assets/lpm-to-log-alignment-options.png)

## Options Considered

### Option A: ZIP of models and XES

Each LPM is stored in a separate file (the format can be any of the process model formats, pnml, bpmn,
pt, etc.). The occurrence list is stored directly in the event log such that for each event in the XES
there is an attribute `covering-lpms` where all LPM ids of the LPMs covering the event are listed. We use the file
name of each LPM as the id to denote it in the list.

#### Example:
$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$

$lpm1$: a -> d

$lpm2$: a -> b -> d

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<log xes.version="1.0" xes.features="nested-attributes" openxes.version="1.0RC7">
<trace>
<event>
<string key="concept:name" value="a"/>
<list key="covering-lpms" value="[lpm1, lpm2]"/>
</event>
<event>
<string key="concept:name" value="b"/>
<list key="covering-lpms" value="[lpm2]"/>
</event>
<event>
<string key="concept:name" value="a"/>
<list key="covering-lpms" value="[lpm1]"/>
</event>
<event>
<string key="concept:name" value="c"/>
<list key="covering-lpms" value="[]"/>
</event>
<event>
<string key="concept:name" value="d"/>
<list key="covering-lpms" value="[lpm1, lpm2]"/>
</event>
</trace>
<trace>
<event>
<string key="concept:name" value="a"/>
<list key="covering-lpms" value="[lpm1]"/>
</event>
<event>
<string key="concept:name" value="d"/>
<list key="covering-lpms" value="[lpm1]"/>
</event>
</trace>
</log>

```


**Pros**
- Using only existing formats

**Cons**
- Complete occurrence lists need to be reconstructed
- Additional attributes for the models cannot be stored
- Misusing the xes standard

**Alternatives**

To be able to store additional attributes, a new format for the models can be introduced or additional json file
can be used only for storing the additional attributes.

### Option B: Separate Full Alignments File in Human-Readable Format (e.g., JSON)

For each set of models for which an occurrence list is computed, a report is generated that includes:
- lpm set info,
- event log info,
- aligned traces per LPM

#### Example:
$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$

$lpm1$: a -> d

$lpm2$: a -> b -> d

```json
{
"meta": {
"eventlog": "name-id",
"lpm_set": "name-id",
"variant": ["gapped-2", "local-4"]
},
"alignments": {
"lpm1": {
"t1": {
"e1": "t_a",
"e2": ">>",
"e3": "t_a",
"e4": ">>",
"e5": "t_d"
},
"t2": {
"e1": "t_a",
"e2": ">>",
"e3": "t_d"
}
},
"lpm2": {
"t1": {
"e1": "t_a",
"e2": "t_b",
"e3": ">>",
"e4": ">>",
"e5": "t_d"
},
"t2": {
"e1": ">>",
"e2": ">>",
"e3": ">>"
}
}
}
}
```

**Pros**
- Alignments are directly available

**Cons**
- Problematic when one event is in two alignments of the same LPM
- The entire event log is duplicated for each LPM

**Alternatives**
Same as with Option A, where each event occurs once, but in a separate alignment file in JSON format.

### Option C: Separate Occurrence List File in Human-Readable Format (e.g., JSON)

For each set of models for which an occurrence list is computed, a report is generated that includes:
- lpm set info,
- event log info,
- occurrence list

#### Example:
$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$

$lpm1$: a -> d

$lpm2$: a -> b -> d

```json
{
"meta": {
"eventlog": "name-id",
"lpm_set": "name-id",
"variant": ["gapped-2", "local-4"]
},
"alignments": {
"lpm1": {
"alignment1": ["t1-e1", "t1-e5"],
"alignment2": ["t1-e3", "t1-e5"],
"alignment3": ["t2-e1", "t2-e3"]
},
"lpm2": {
"alignment1": ["t1-e1", "t1-e2", "t1-e5"]
}
}
}
```

**Pros**
- Occurrence lists are directly available
- Additional LPM attributes can be stored

**Cons**
- Events are multiplied for each alignment and LPM pair

## Decision

<!-- To be filled in once review is complete. -->

## Consequences

<!-- To be filled in once review is complete. -->

## References

-
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 90 additions & 0 deletions docs/decisions/template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
Status: Draft
Date: YYYY-MM-DD
Authors: <!-- @handle, @handle -->
Reviewers: <!-- @handle, @handle -->
Replaces: <!-- ADR-XXXX, or "–" -->
Superseded by: <!-- ADR-XXXX once this is replaced, or leave blank -->
---

# ADR-XXXX: <Short Title>

## Context

<!-- What is the situation that requires a decision?
Describe the problem, constraints, and relevant background.
Be specific: volume of data, access patterns, team size, timelines — whatever
is relevant. This section should be understandable without prior context. -->

## Decision Drivers

<!-- What matters most when evaluating options? List 3–6 criteria.
Examples: query performance, operational complexity, cost, portability,
team familiarity, licensing. -->

-
-
-

## Options Considered

<!-- One sub-section per option. Use the same structure for each so they are
easy to compare. Add a rough complexity or cost indicator if relevant. -->

### Option A: <Name>

<!-- One-line description -->

**Pros**
-

**Cons**
-

### Option B: <Name>

<!-- One-line description -->

**Pros**
-

**Cons**
-

### Option C: <Name>

<!-- One-line description -->

**Pros**
-

**Cons**
-

## Decision

<!-- Leave blank while Status is Draft or In Review.
State what was decided and, critically, why this option over the others.
Reference the decision drivers above. -->

> **Decided:** Option X — <one sentence rationale>

## Consequences

<!-- What becomes easier or harder as a result of this decision?
Include follow-up actions or open questions if any. -->

**Positive**
-

**Negative / Trade-offs**
-

**Follow-up actions**
- [ ]

## References

<!-- Links to benchmarks, tickets, prior discussions, external documentation. -->

-