diff --git a/docs/decisions/0001-data-storage-format.md b/docs/decisions/0001-data-storage-format.md new file mode 100644 index 0000000..7635baf --- /dev/null +++ b/docs/decisions/0001-data-storage-format.md @@ -0,0 +1,228 @@ +--- +Status: Draft +Date: 2026-04-17 +Authors: VikiPeeva +Reviewers: +Replaces: +Superseded by: +--- + +# ADR-0001: Data Storage Format + +## Context + +We need to store _Local Process Models (LPMs) and their occurrence lists in an event log_. Event log sizes can vary, +but from the standard ones used for research in process mining, they go up to 1.GB. LPMs are usually small models +covering on average up to five activities, but for a single event log tens of thousands can be discovered. +Moreover, LPMs can come with additional attributes about them, either computed from the occurrence lists or maybe +manual annotations. + +If one wants to be future-proof, a pessimistic estimation for size could be event logs processed by Celonis +(from a couple of hundreds GB to TB of data). + +## Decision Drivers + +The choice affects: +- portability, +- storage efficiency, +- long-term maintenance cost, +- operational complexity, +- query performance (future), +- streaming (future). + +Expected access patterns for the data currently are: +- Complete read/write +- Streamed read/write +- Random access read/write +- Query-based read/write + +There also exist various ways one can align LPMs to an event log: +- The storage focuses on one variation +- The storage includes occurrence lists for multiple variations +![Alignment Variations](assets/lpm-to-log-alignment-options.png) + +## Options Considered + +### Option A: ZIP of models and XES + +Each LPM is stored in a separate file (the format can be any of the process model formats, pnml, bpmn, +pt, etc.). The occurrence list is stored directly in the event log such that for each event in the XES +there is an attribute `covering-lpms` where all LPM ids of the LPMs covering the event are listed. We use the file +name of each LPM as the id to denote it in the list. + +#### Example: +$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$ + +$lpm1$: a -> d + +$lpm2$: a -> b -> d + +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + + +**Pros** +- Using only existing formats + +**Cons** +- Complete occurrence lists need to be reconstructed +- Additional attributes for the models cannot be stored +- Misusing the xes standard + +**Alternatives** + +To be able to store additional attributes, a new format for the models can be introduced or additional json file +can be used only for storing the additional attributes. + +### Option B: Separate Full Alignments File in Human-Readable Format (e.g., JSON) + +For each set of models for which an occurrence list is computed, a report is generated that includes: +- lpm set info, +- event log info, +- aligned traces per LPM + +#### Example: +$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$ + +$lpm1$: a -> d + +$lpm2$: a -> b -> d + +```json + { + "meta": { + "eventlog": "name-id", + "lpm_set": "name-id", + "variant": ["gapped-2", "local-4"] + }, + "alignments": { + "lpm1": { + "t1": { + "e1": "t_a", + "e2": ">>", + "e3": "t_a", + "e4": ">>", + "e5": "t_d" + }, + "t2": { + "e1": "t_a", + "e2": ">>", + "e3": "t_d" + } + }, + "lpm2": { + "t1": { + "e1": "t_a", + "e2": "t_b", + "e3": ">>", + "e4": ">>", + "e5": "t_d" + }, + "t2": { + "e1": ">>", + "e2": ">>", + "e3": ">>" + } + } + } +} +``` + +**Pros** +- Alignments are directly available + +**Cons** +- Problematic when one event is in two alignments of the same LPM +- The entire event log is duplicated for each LPM + +**Alternatives** +Same as with Option A, where each event occurs once, but in a separate alignment file in JSON format. + +### Option C: Separate Occurrence List File in Human-Readable Format (e.g., JSON) + +For each set of models for which an occurrence list is computed, a report is generated that includes: +- lpm set info, +- event log info, +- occurrence list + +#### Example: +$L = \langle a, b, a, c, d\rangle, \langle a, x, d\rangle$ + +$lpm1$: a -> d + +$lpm2$: a -> b -> d + +```json +{ + "meta": { + "eventlog": "name-id", + "lpm_set": "name-id", + "variant": ["gapped-2", "local-4"] + }, + "alignments": { + "lpm1": { + "alignment1": ["t1-e1", "t1-e5"], + "alignment2": ["t1-e3", "t1-e5"], + "alignment3": ["t2-e1", "t2-e3"] + }, + "lpm2": { + "alignment1": ["t1-e1", "t1-e2", "t1-e5"] + } + } +} +``` + +**Pros** +- Occurrence lists are directly available +- Additional LPM attributes can be stored + +**Cons** +- Events are multiplied for each alignment and LPM pair + +## Decision + + + +## Consequences + + + +## References + +- diff --git a/docs/decisions/assets/lpm-to-log-alignment-options.png b/docs/decisions/assets/lpm-to-log-alignment-options.png new file mode 100644 index 0000000..5de857f Binary files /dev/null and b/docs/decisions/assets/lpm-to-log-alignment-options.png differ diff --git a/docs/decisions/template.md b/docs/decisions/template.md new file mode 100644 index 0000000..444dc26 --- /dev/null +++ b/docs/decisions/template.md @@ -0,0 +1,90 @@ +--- +Status: Draft +Date: YYYY-MM-DD +Authors: +Reviewers: +Replaces: +Superseded by: +--- + +# ADR-XXXX: + +## Context + + + +## Decision Drivers + + + +- +- +- + +## Options Considered + + + +### Option A: + + + +**Pros** +- + +**Cons** +- + +### Option B: + + + +**Pros** +- + +**Cons** +- + +### Option C: + + + +**Pros** +- + +**Cons** +- + +## Decision + + + +> **Decided:** Option X — + +## Consequences + + + +**Positive** +- + +**Negative / Trade-offs** +- + +**Follow-up actions** +- [ ] + +## References + + + +-