RFC: Clarify `WARNING_QUIT` semantics and improve fatal-exit diagnostics

## Background

This RFC follows the discussion in #7413.

That PR proposed two behavioral changes to `WARNING_QUIT`:
1. changing the banner from `NOTICE` to `ERROR`;
2. avoiding `TIME STATISTICS` after the terminating message.

The review raised the following arguments:
* a `WARNING_QUIT` condition is not always an error;
* the word `ERROR` may make users feel that the failure was caused by their own operation;
* timing statistics are intentionally printed to help locate related errors;
* the implementation should avoid unnecessary C++ syntax or abstractions.

I understand that the current behavior is intentional. However, I do not think these arguments fully justify the present semantics and diagnostics, so I would like to discuss the design separately from the original PR.

## Current semantics

Despite its name and `NOTICE` banner, `WARNING_QUIT` is not a normal warning mechanism.

By default, it calls:

```cpp
WARNING_QUIT(file, description, 1);
```

and eventually terminates the process through `QUIT(1)`. Therefore, unlike `WARNING`, execution cannot continue after `WARNING_QUIT`. The current implementation also unconditionally invokes `timer::finish()` before exiting.  

From the perspective of program control flow, this is an abnormal termination. Calling it an error condition does not necessarily mean that the user is at fault.

An abnormal termination may result from:
* invalid user input;
* an unsupported feature or build configuration;
* an unavailable external dependency;
* a numerical failure;
* an invalid internal state;
* a programming defect.

These causes assign responsibility differently, but all of them mean that the requested calculation could not be completed.

## Discussion and Rationale

### 1. `ERROR` does not imply user blame

The concern that `ERROR` may create a negative psychological implication appears to conflate two different concepts:
* whether the requested operation failed;
* who is responsible for the failure.

An error message should report the first. It does not need to assign the second.

For example, the following are all errors from the program’s perspective:

```text
This feature is not supported by this build.
A required library is unavailable.
The input value is outside the accepted range.
An internal invariant was violated.
```

Only some of them are caused by the user.

If `ERROR` is considered too accusatory, a neutral but terminating label such as `ABORT`, `TERMINATED`, or `FAILED` would still describe the behavior more accurately than `NOTICE`.

`NOTICE` normally suggests informational or non-fatal output. In `WARNING_QUIT`, however, the program immediately exits with a non-zero status. The banner and the control-flow semantics therefore communicate different things.

### 2. “This is not always an error” indicates that the API is overloaded

I agree that all `WARNING_QUIT` call sites do not represent the same category of problem. However, that is exactly an argument for distinguishing the categories, rather than presenting all of them as `NOTICE`.

The function is currently used for substantially different situations, including:
* input validation failures;
* unsupported functionality;
* missing compile-time features;
* null pointers and invalid indices;
* failed constructors;
* impossible or unexpected internal states.

A single `WARNING_QUIT` interface consequently combines user-facing validation, feature availability checks, numerical failures, and internal assertions.

Possible long-term interfaces could be:

```cpp
INPUT_ERROR(...)
NOT_IMPLEMENTED(...)
INTERNAL_ERROR(...)
ABORT(...)
```

or one fatal-reporting function with an explicit category:

```cpp
ABORT(ErrorCategory::InvalidInput, ...);
ABORT(ErrorCategory::UnsupportedFeature, ...);
ABORT(ErrorCategory::InternalError, ...);
```

The exact API is open for discussion. The important point is that differences in responsibility should be represented explicitly, instead of using a non-fatal-looking banner for every terminating condition.

There is already some inconsistency in the present implementation: `CHECK_WARNING_QUIT` prints `ERROR!` and exits, while `WARNING_QUIT` prints `NOTICE` and exits. The source itself also comments that `warning.log` might eventually be replaced by `error.log`. 

### 3. Timing statistics are profiling information, not causal diagnostics

Timing information answers questions such as:
* where the program spent most of its runtime;
* how often a timed routine was called;
* which completed routines were relatively expensive.

However, it does not answer:
* which statement triggered the termination;
* which condition failed;
* which source location emitted the message;
* what the calling sequence was;
* whether the failure originated in input handling, numerical code, or an internal invariant.

Therefore, timing statistics cannot directly help locate the cause of a `WARNING_QUIT`.

Moreover, they also have a significant presentation cost. `TIME STATISTICS` is printed after the terminating notice, so the final part of stdout or a CI log is dominated by a profiling table rather than the actual reason for termination. This weakens the visibility of the most actionable message.

Therefore, users are more likely to feel confused rather than helped with `TIME STATISTICS` printed after abort message.

### 4. Timing data on an abnormal path may be incomplete even if it's intentially preserved

There is also a technical limitation in using the current timing table for failure diagnosis.

`QUIT()` calls:

```cpp
timer::finish(ofs_running, ..., false);
```

The `false` argument disables the check for timers that have not ended. `timer::finish()` explicitly ends only the global `total` timer. Other active timers remain active.

A timer accumulates the current invocation only when `timer::end()` is called. Consequently, when `WARNING_QUIT` occurs inside an active timed region, the elapsed time of that unfinished invocation is not added to `cpu_second`. The table may therefore omit exactly the interval in which the termination happened.   

This does not make the timing table entirely useless, but it means it should not be treated as a reliable substitute for actual failure-location information.

### 5. Source location and a call stack would be more direct

A more useful fatal diagnostic would report at least:

```text
Reason
Source file
Line number
Function
```

and, where supported, a routine call stack.

CP2K provides one example of this design. Its `CPABORT` macro automatically captures the source file and line number. The abort handler prints the formatted abort message, prints the routine stack, flushes the output, and then aborts the parallel program.  

ABACUS would not need to adopt the complete CP2K implementation. A relatively small first step could use a macro around an implementation function:

```cpp
#define ABACUS_ABORT(context, message) \
    abort_impl(context, message, __FILE__, __LINE__, __func__)
```

This would provide direct source-location information without requiring every caller to pass it manually. A stack trace could be added later behind platform or build-time guards.

### 6. `[[noreturn]]` describes an actual contract

The previous review also suggested avoiding `[[noreturn]]` to keep the C++ syntax simple.

However, `[[noreturn]]` is not merely stylistic decoration. It records the control-flow contract that the function never returns. This can improve compiler diagnostics and make unreachable code visible.

For example, the two-argument `WARNING_QUIT` currently contains MPI cleanup code after calling the terminating three-argument overload. That code is unreachable because the called overload exits the process. 

In any case, this particular question now appears to be settled: the declarations of `QUIT` and `WARNING_QUIT` on the current `develop` branch already use `[[noreturn]]`. 

## Proposed direction

I am not proposing that the original #7413 patch should be restored unchanged. A more flexible solution could be:

1. Keep normal `QUIT()` behavior unchanged, including final timing statistics.
2. Treat `WARNING_QUIT` as an abnormal-termination path.
3. Always print the reason and source location for an abnormal termination.
4. Add an optional call stack where supported.
5. Avoid printing timing statistics to stdout after the fatal message.
6. If timing data is considered valuable, retain it in `running.log`, print it before the final termination block, or make it configurable.
7. Consider replacing `NOTICE` with a neutral label such as `ABORT` or `TERMINATED` if `ERROR` is considered too blame-oriented.
8. In the longer term, distinguish invalid input, unsupported functionality, numerical failure, and internal errors.

One possible simple output would be:

```text
---------------------------------------------------------
                   ABACUS TERMINATED
---------------------------------------------------------

Reason   : nspin must be 1, 2, or 4
Location : source/source_lcao/LCAO_set_st.cpp:128
Function : LCAO_domain::build_ST_new
Details  : OUT.0/warning.log

---------------------------------------------------------
```
Or like:
```text
---------------------------------------------------------
                   ABACUS TERMINATED
---------------------------------------------------------

 DeePKS features require ABACUS to be built with DeePKS.

Location : source/source_lcao/LCAO_set_st.cpp:128
Function : LCAO_domain::build_ST_new
Details  : OUT.0/warning.log

---------------------------------------------------------
```
This communicates that the calculation stopped without implying that the user necessarily caused the problem.

## Questions

1. Should normal termination and abnormal termination share the same timing-output path?
2. Should timing statistics on abnormal termination be printed to stdout, written only to a log, printed before the final message, or made configurable?
3. Would `ABORT` or `TERMINATED` be preferable to both `NOTICE` and `ERROR`?
4. Should fatal-reporting calls automatically capture file, line, and function information?
5. Should ABACUS distinguish user-input failures, unsupported functionality, numerical failures, and internal errors?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Clarify `WARNING_QUIT` semantics and improve fatal-exit diagnostics #7472

Background

Current semantics

Discussion and Rationale

1. `ERROR` does not imply user blame

2. “This is not always an error” indicates that the API is overloaded

3. Timing statistics are profiling information, not causal diagnostics

4. Timing data on an abnormal path may be incomplete even if it's intentially preserved

5. Source location and a call stack would be more direct

6. `[[noreturn]]` describes an actual contract

Proposed direction

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: Clarify WARNING_QUIT semantics and improve fatal-exit diagnostics #7472

Description

Background

Current semantics

Discussion and Rationale

1. ERROR does not imply user blame

2. “This is not always an error” indicates that the API is overloaded

3. Timing statistics are profiling information, not causal diagnostics

4. Timing data on an abnormal path may be incomplete even if it's intentially preserved

5. Source location and a call stack would be more direct

6. [[noreturn]] describes an actual contract

Proposed direction

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

RFC: Clarify `WARNING_QUIT` semantics and improve fatal-exit diagnostics #7472

1. `ERROR` does not imply user blame

6. `[[noreturn]]` describes an actual contract