Dictionary file by nicholasfrancia · Pull Request #17 · COMCIFS/Structure_Prediction_Dictionary

nicholasfrancia · 2026-04-24T17:10:03Z

A "Step" class has been added to describe intermediate steps in a CSP workflow, such as conformer generation and
structure similarity calculations.
New datafields describing Conformer Generation methods for organic CSP as an intermediate step.
New datafields describing Structure Similarity methods, used both as clustering method (remove duplicates) and to
identify matches with experimental.
New datafields in _theoretical_structure to show matches with experimental structures or specify the cluster center
if the structure has been labeled as a duplicate.
New "Output" class to describe general outputs of a CSP workflow, such as computational cost or global minima in each
step.
New .dic files for the CSP and Computational chemistry dictionaries.
New data fields for ML Potentials.
Markdown files updated with relevant examples.
The "Category" column in the descriptive files has been removed and names converted to their actual use in the .dic
files:

Old version:

Group Category Data Field

CSP Data Block class

New version:

Group Data Field

_csp data_block_class
"Definition", "Constraints" and "Examples" columns changed to improve clarity.
Descriptive dictionaries (Markdown files) in Dictionaries/DFT and Dictionaries/Forcefields now do not have any
duplicate data field in Dictionaries/CSPCore/cspcore.md.
A few old labels have been changed for consistency across the dictionary.
Monomer correction datafield in the "Additional Proposed Fields" section in Dictionaries/DFT/dft.md has been split
to describe functional, basis set and dispersion correction separately.

IsaacSugden · 2026-04-27T09:18:22Z

+| `_csp`      | `structure_ranking_data_block_id`               | char       | To be used in the "Workflow" datablock, it is a short, human-readable description of the ranking method.                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                     |                |
+| `_csp`      | `structure_ranking_relative_energy_cutoff`      | numb       | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting those below a certain relative energy difference from the global minima. | \>=0.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | kJ mol<sup>-1</sup>                 | 10.0           |
+| `_csp`      | `structure_ranking_max_structures_retained`     | numb       | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting the top N structures in the ranking.                                     | \>=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                     | 250            |
+| `_compchem` | `method`                                        | char       | The energy or scoring model used to rank structures.                                                                                                                                                                                      | - "pDFT" (Sec. 4.2): Ranking methods based on periodic Density Functional Theory methods. <br>- "Forcefield" (Sec. 4.3): Ranking methods based on classical forcefield. <br>- "Semi Empirical" (Sec. 4.4): Ranking methods based on semi empirical approaches, <br>- "Wavefunction" (Sec. 4.5): Ranking methods based on wavefunction approaches. <br>- "ML Potentials" (Sec. 4.6): Ranking methods based on Ml potentials. Forcefields parametrised with ML, should be under the "Forcefield" category. <br>- Other: Any other method. |                                     | Forcefield     |


will ML potentials be in Forcefields?

No! In ML interatomic potentials such as MACE or UMA, energies and related properties are obtained directly from the neural network model. On the other hand, some approaches (Espaloma is an example) use neural networks to optimise equilibrium values (bond distances, angles or torsions...) and related force constants of a classical force field. Although the parameters are learned using a NN rather than fitted to QM calculations directly, the resulting model still has the form of a classical forcefield. All data fields in the forcefield section could then be used in these cases. (I agree the explanation in here is not very clear!)

vaitkus · 2026-04-27T12:14:01Z

@nicholasfrancia I have a few more minor technical suggestions about the dictionaries, but would not want to detract from the main discussion at hand. Would these be useful here or would it be better to wait until it is merged and open a separate issue/PR?

nicholasfrancia · 2026-04-27T12:28:31Z

@vaitkus go ahead in here! If there are fundamental changes to the dictionary, we can open a separate PR later (this is my fault that I've included many changes in a single PR)

vaitkus

Ok, I marked several general classes of issues since it will probably be more efficient to detect and fix them in bulk on your side than for me to mark all of them one-by-one. I only looked at cif_csp.dic so similar issues might also be present in cif_compchem.dic. Once these suggestions are addressed (or rejected), I can take another quick look to see if no new issues revealed them selves.

In the future, such checks could be run by GitHub actions (as is done on other dictionary repos).

vaitkus · 2026-04-27T12:31:04Z

+
+    _dictionary.title             CIF_CSP
+    _dictionary.class             Instance
+    _dictionary.version           0.1


The version number must have 3 components to adhere to SemVer, so either 0.0.1 or 0.1.0 in this case.

I have a script that takes details from a CSV file and creates the 2 .dic files, so yes, most of these issues can be fixed in bulk. Thank you again, Antanas, this is super helpful!

vaitkus · 2026-04-27T13:08:38Z

+
+    _definition.id                CSP
+    _definition.scope             Category
+    _definition.class             Loop


The category is declared as looped, but the data items that constitute the primary loop key (data items that are collectively unique within a CIF loop row) are not provided. This is done by specifying one or more such names as values of the attribute, e.g.

loop_ _category_key.name '_csp_key_item_1' '_csp_key_item_2' '_csp_key_item_3'

…_file

vaitkus

Thank you for the changes to the dictionary.

I marked few more remaining issues (mostly with incorrect specification of CIF 2.0 lists). Also, while this is not strictly mandated, we tend to assign units to all numeric data items (i.e. Integer, Real, etc.) as it helps later on in automated checks . Note the two special units that may be relevant in such cases:

none (dimensionless - e.g. a ratio, factor, weight or scale)
unspecified (no units supplied - the correct units should be derived based on the context)

…_file

vaitkus · 2026-05-07T19:07:58Z

+    _name.object_id               'data_block_additional_files'
+    _type.purpose                 Describe
+    _type.source                  Assigned
+    _type.container               List


The type container for several items was recently changed from Table to List, however, the list dimension was not added. For Lists of unspecified/unlimited size it is still recommended to explicitly declare them as such, i.e.:

_type.dimension '[]'

vaitkus · 2026-05-07T19:13:18Z

+Indicates whether 1-4 interactions (Lennard-Jones and electrostatic) are scaled
+;
+    _name.category_id             'forcefield'
+    _name.object_id               '1_4_pairs'


I just wanted to note, that the dash is only invalid in the _name.object_id attribute, but it is OK to include it in the save frame name or the _definition.id if you wish (e.g. as is done for the _space_group_name_H-M_alt data item in the coreCIF dictionary). Of course, there is no need to do anything if you are happy with the underscore.

vaitkus · 2026-05-07T19:24:11Z

+
+    _definition.id                THEORETICAL_STRUCTURE
+    _definition.scope             Category
+    _definition.class             Loop


DDLm dictionaries allow two types of categories: Set and Loop. Data items that belong to the Set categories are allowed to appear only once per data block. On the other hand, items in the Loop categories can be looped together and have more than one associated value. A typical example of a looped category is ATOM_SITE:

loop_ _atom_site_label _atom_site_occupancy _atom_site_fract_x _atom_site_fract_y _atom_site_fract_z Co1 0.25 0.25(4) 0.75 0.25 N1 1.00 0.1849(8) 0.6787(8) C1 1.00 0.140(1) 0.715(1) 0.093(1) # ...

while CELL is a typical Set category:

_cell_length_a 15.798(5) _cell_length_b 15.798(5) _cell_length_c 15.674(7)

I see that in the two dictionaries all categories are defined as looped. Is that intentional?

nicholasfrancia added 5 commits March 26, 2026 15:30

Dictionary files

5d4f815

Dic files and minor corrections

48240cd

Dic files and minor corrections

9610bcd

Conformers generation and clustering + updated .dic files

d012f75

Conformers generation and clustering + updated .dic files

148be91

nicholasfrancia requested review from IsaacSugden, ijbruno and lhunnisett-ccdc April 24, 2026 17:10

nicholasfrancia self-assigned this Apr 24, 2026

Merge branch 'main' into dictionary_file

54be4dc

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

IsaacSugden approved these changes Apr 27, 2026

View reviewed changes

vaitkus reviewed Apr 27, 2026

View reviewed changes

Comment thread cif_csp.dic Outdated

vaitkus reviewed Apr 27, 2026

View reviewed changes

Comment thread cif_csp.dic Outdated

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

vaitkus reviewed Apr 27, 2026

View reviewed changes

IsaacSugden reviewed Apr 27, 2026

View reviewed changes

Comment thread Dictionaries/CSPCore/cspcore.md Outdated

nicholasfrancia added 2 commits April 30, 2026 10:59

Corrections to dict files

b258c19

Merge remote-tracking branch 'origin/dictionary_file' into dictionary…

395095d

…_file

vaitkus reviewed Apr 30, 2026

View reviewed changes

Comment thread cif_csp.dic Outdated

Comment thread cif_csp.dic Outdated

Comment thread cif_compchem.dic

Comment thread cif_compchem.dic Outdated

Comment thread cif_compchem.dic Outdated

Comment thread cif_compchem.dic Outdated

Comment thread cif_compchem.dic Outdated

nicholasfrancia added 2 commits May 6, 2026 15:23

Corrections to dict files

5a26c43

Update CHANGELOG.md

281113a

vaitkus reviewed May 6, 2026

View reviewed changes

Comment thread cif_compchem.dic Outdated

Comment thread cif_csp.dic Outdated

nicholasfrancia added 3 commits May 7, 2026 11:51

Corrections to dict files

c1b948f

Corrections to dict files

212e863

Merge remote-tracking branch 'origin/dictionary_file' into dictionary…

1f8d28e

…_file

vaitkus reviewed May 7, 2026

View reviewed changes

Conversation

nicholasfrancia commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaitkus commented Apr 27, 2026

Uh oh!

nicholasfrancia commented Apr 27, 2026

Uh oh!

vaitkus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vaitkus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants