Skip to content

RFC: Clarify WARNING_QUIT semantics and improve fatal-exit diagnostics #7472

Description

@Growl1234

Background

This RFC follows the discussion in #7413.

That PR proposed two behavioral changes to WARNING_QUIT:

  1. changing the banner from NOTICE to ERROR;
  2. avoiding TIME STATISTICS after the terminating message.

The review raised the following arguments:

  • a WARNING_QUIT condition is not always an error;
  • the word ERROR may make users feel that the failure was caused by their own operation;
  • timing statistics are intentionally printed to help locate related errors;
  • the implementation should avoid unnecessary C++ syntax or abstractions.

I understand that the current behavior is intentional. However, I do not think these arguments fully justify the present semantics and diagnostics, so I would like to discuss the design separately from the original PR.

Current semantics

Despite its name and NOTICE banner, WARNING_QUIT is not a normal warning mechanism.

By default, it calls:

WARNING_QUIT(file, description, 1);

and eventually terminates the process through QUIT(1). Therefore, unlike WARNING, execution cannot continue after WARNING_QUIT. The current implementation also unconditionally invokes timer::finish() before exiting.

From the perspective of program control flow, this is an abnormal termination. Calling it an error condition does not necessarily mean that the user is at fault.

An abnormal termination may result from:

  • invalid user input;
  • an unsupported feature or build configuration;
  • an unavailable external dependency;
  • a numerical failure;
  • an invalid internal state;
  • a programming defect.

These causes assign responsibility differently, but all of them mean that the requested calculation could not be completed.

Discussion and Rationale

1. ERROR does not imply user blame

The concern that ERROR may create a negative psychological implication appears to conflate two different concepts:

  • whether the requested operation failed;
  • who is responsible for the failure.

An error message should report the first. It does not need to assign the second.

For example, the following are all errors from the program’s perspective:

This feature is not supported by this build.
A required library is unavailable.
The input value is outside the accepted range.
An internal invariant was violated.

Only some of them are caused by the user.

If ERROR is considered too accusatory, a neutral but terminating label such as ABORT, TERMINATED, or FAILED would still describe the behavior more accurately than NOTICE.

NOTICE normally suggests informational or non-fatal output. In WARNING_QUIT, however, the program immediately exits with a non-zero status. The banner and the control-flow semantics therefore communicate different things.

2. “This is not always an error” indicates that the API is overloaded

I agree that all WARNING_QUIT call sites do not represent the same category of problem. However, that is exactly an argument for distinguishing the categories, rather than presenting all of them as NOTICE.

The function is currently used for substantially different situations, including:

  • input validation failures;
  • unsupported functionality;
  • missing compile-time features;
  • null pointers and invalid indices;
  • failed constructors;
  • impossible or unexpected internal states.

A single WARNING_QUIT interface consequently combines user-facing validation, feature availability checks, numerical failures, and internal assertions.

Possible long-term interfaces could be:

INPUT_ERROR(...)
NOT_IMPLEMENTED(...)
INTERNAL_ERROR(...)
ABORT(...)

or one fatal-reporting function with an explicit category:

ABORT(ErrorCategory::InvalidInput, ...);
ABORT(ErrorCategory::UnsupportedFeature, ...);
ABORT(ErrorCategory::InternalError, ...);

The exact API is open for discussion. The important point is that differences in responsibility should be represented explicitly, instead of using a non-fatal-looking banner for every terminating condition.

There is already some inconsistency in the present implementation: CHECK_WARNING_QUIT prints ERROR! and exits, while WARNING_QUIT prints NOTICE and exits. The source itself also comments that warning.log might eventually be replaced by error.log.

3. Timing statistics are profiling information, not causal diagnostics

Timing information answers questions such as:

  • where the program spent most of its runtime;
  • how often a timed routine was called;
  • which completed routines were relatively expensive.

However, it does not answer:

  • which statement triggered the termination;
  • which condition failed;
  • which source location emitted the message;
  • what the calling sequence was;
  • whether the failure originated in input handling, numerical code, or an internal invariant.

Therefore, timing statistics cannot directly help locate the cause of a WARNING_QUIT.

Moreover, they also have a significant presentation cost. TIME STATISTICS is printed after the terminating notice, so the final part of stdout or a CI log is dominated by a profiling table rather than the actual reason for termination. This weakens the visibility of the most actionable message.

Therefore, users are more likely to feel confused rather than helped with TIME STATISTICS printed after abort message.

4. Timing data on an abnormal path may be incomplete even if it's intentially preserved

There is also a technical limitation in using the current timing table for failure diagnosis.

QUIT() calls:

timer::finish(ofs_running, ..., false);

The false argument disables the check for timers that have not ended. timer::finish() explicitly ends only the global total timer. Other active timers remain active.

A timer accumulates the current invocation only when timer::end() is called. Consequently, when WARNING_QUIT occurs inside an active timed region, the elapsed time of that unfinished invocation is not added to cpu_second. The table may therefore omit exactly the interval in which the termination happened.

This does not make the timing table entirely useless, but it means it should not be treated as a reliable substitute for actual failure-location information.

5. Source location and a call stack would be more direct

A more useful fatal diagnostic would report at least:

Reason
Source file
Line number
Function

and, where supported, a routine call stack.

CP2K provides one example of this design. Its CPABORT macro automatically captures the source file and line number. The abort handler prints the formatted abort message, prints the routine stack, flushes the output, and then aborts the parallel program.

ABACUS would not need to adopt the complete CP2K implementation. A relatively small first step could use a macro around an implementation function:

#define ABACUS_ABORT(context, message) \
    abort_impl(context, message, __FILE__, __LINE__, __func__)

This would provide direct source-location information without requiring every caller to pass it manually. A stack trace could be added later behind platform or build-time guards.

6. [[noreturn]] describes an actual contract

The previous review also suggested avoiding [[noreturn]] to keep the C++ syntax simple.

However, [[noreturn]] is not merely stylistic decoration. It records the control-flow contract that the function never returns. This can improve compiler diagnostics and make unreachable code visible.

For example, the two-argument WARNING_QUIT currently contains MPI cleanup code after calling the terminating three-argument overload. That code is unreachable because the called overload exits the process.

In any case, this particular question now appears to be settled: the declarations of QUIT and WARNING_QUIT on the current develop branch already use [[noreturn]].

Proposed direction

I am not proposing that the original #7413 patch should be restored unchanged. A more flexible solution could be:

  1. Keep normal QUIT() behavior unchanged, including final timing statistics.
  2. Treat WARNING_QUIT as an abnormal-termination path.
  3. Always print the reason and source location for an abnormal termination.
  4. Add an optional call stack where supported.
  5. Avoid printing timing statistics to stdout after the fatal message.
  6. If timing data is considered valuable, retain it in running.log, print it before the final termination block, or make it configurable.
  7. Consider replacing NOTICE with a neutral label such as ABORT or TERMINATED if ERROR is considered too blame-oriented.
  8. In the longer term, distinguish invalid input, unsupported functionality, numerical failure, and internal errors.

One possible simple output would be:

---------------------------------------------------------
                   ABACUS TERMINATED
---------------------------------------------------------

Reason   : nspin must be 1, 2, or 4
Location : source/source_lcao/LCAO_set_st.cpp:128
Function : LCAO_domain::build_ST_new
Details  : OUT.0/warning.log

---------------------------------------------------------

Or like:

---------------------------------------------------------
                   ABACUS TERMINATED
---------------------------------------------------------

 DeePKS features require ABACUS to be built with DeePKS.

Location : source/source_lcao/LCAO_set_st.cpp:128
Function : LCAO_domain::build_ST_new
Details  : OUT.0/warning.log

---------------------------------------------------------

This communicates that the calculation stopped without implying that the user necessarily caused the problem.

Questions

  1. Should normal termination and abnormal termination share the same timing-output path?
  2. Should timing statistics on abnormal termination be printed to stdout, written only to a log, printed before the final message, or made configurable?
  3. Would ABORT or TERMINATED be preferable to both NOTICE and ERROR?
  4. Should fatal-reporting calls automatically capture file, line, and function information?
  5. Should ABACUS distinguish user-input failures, unsupported functionality, numerical failures, and internal errors?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Feature DiscussedThe features will be discussed first but will not be implemented soonInput&OutputSuitable for coders without knowing too many DFT detailsRefactorRefactor ABACUS codes

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions