Dictionary file#17
Conversation
| | `_csp` | `structure_ranking_data_block_id` | char | To be used in the "Workflow" datablock, it is a short, human-readable description of the ranking method. | | | | | ||
| | `_csp` | `structure_ranking_relative_energy_cutoff` | numb | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting those below a certain relative energy difference from the global minima. | \>=0. | kJ mol<sup>-1</sup> | 10.0 | | ||
| | `_csp` | `structure_ranking_max_structures_retained` | numb | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting the top N structures in the ranking. | \>=0 | | 250 | | ||
| | `_compchem` | `method` | char | The energy or scoring model used to rank structures. | - "pDFT" (Sec. 4.2): Ranking methods based on periodic Density Functional Theory methods. <br>- "Forcefield" (Sec. 4.3): Ranking methods based on classical forcefield. <br>- "Semi Empirical" (Sec. 4.4): Ranking methods based on semi empirical approaches, <br>- "Wavefunction" (Sec. 4.5): Ranking methods based on wavefunction approaches. <br>- "ML Potentials" (Sec. 4.6): Ranking methods based on Ml potentials. Forcefields parametrised with ML, should be under the "Forcefield" category. <br>- Other: Any other method. | | Forcefield | |
There was a problem hiding this comment.
will ML potentials be in Forcefields?
There was a problem hiding this comment.
No! In ML interatomic potentials such as MACE or UMA, energies and related properties are obtained directly from the neural network model. On the other hand, some approaches (Espaloma is an example) use neural networks to optimise equilibrium values (bond distances, angles or torsions...) and related force constants of a classical force field. Although the parameters are learned using a NN rather than fitted to QM calculations directly, the resulting model still has the form of a classical forcefield. All data fields in the forcefield section could then be used in these cases. (I agree the explanation in here is not very clear!)
|
@nicholasfrancia I have a few more minor technical suggestions about the dictionaries, but would not want to detract from the main discussion at hand. Would these be useful here or would it be better to wait until it is merged and open a separate issue/PR? |
|
@vaitkus go ahead in here! If there are fundamental changes to the dictionary, we can open a separate PR later (this is my fault that I've included many changes in a single PR) |
vaitkus
left a comment
There was a problem hiding this comment.
Ok, I marked several general classes of issues since it will probably be more efficient to detect and fix them in bulk on your side than for me to mark all of them one-by-one. I only looked at cif_csp.dic so similar issues might also be present in cif_compchem.dic. Once these suggestions are addressed (or rejected), I can take another quick look to see if no new issues revealed them selves.
In the future, such checks could be run by GitHub actions (as is done on other dictionary repos).
|
|
||
| _dictionary.title CIF_CSP | ||
| _dictionary.class Instance | ||
| _dictionary.version 0.1 |
There was a problem hiding this comment.
The version number must have 3 components to adhere to SemVer, so either 0.0.1 or 0.1.0 in this case.
There was a problem hiding this comment.
I have a script that takes details from a CSV file and creates the 2 .dic files, so yes, most of these issues can be fixed in bulk. Thank you again, Antanas, this is super helpful!
|
|
||
| _definition.id CSP | ||
| _definition.scope Category | ||
| _definition.class Loop |
There was a problem hiding this comment.
The category is declared as looped, but the data items that constitute the primary loop key (data items that are collectively unique within a CIF loop row) are not provided. This is done by specifying one or more such names as values of the attribute, e.g.
loop_
_category_key.name
'_csp_key_item_1'
'_csp_key_item_2'
'_csp_key_item_3'
vaitkus
left a comment
There was a problem hiding this comment.
Thank you for the changes to the dictionary.
I marked few more remaining issues (mostly with incorrect specification of CIF 2.0 lists). Also, while this is not strictly mandated, we tend to assign units to all numeric data items (i.e. Integer, Real, etc.) as it helps later on in automated checks . Note the two special units that may be relevant in such cases:
- none (dimensionless - e.g. a ratio, factor, weight or scale)
- unspecified (no units supplied - the correct units should be derived based on the context)
| _name.object_id 'data_block_additional_files' | ||
| _type.purpose Describe | ||
| _type.source Assigned | ||
| _type.container List |
There was a problem hiding this comment.
The type container for several items was recently changed from Table to List, however, the list dimension was not added. For Lists of unspecified/unlimited size it is still recommended to explicitly declare them as such, i.e.:
_type.dimension '[]'
| Indicates whether 1-4 interactions (Lennard-Jones and electrostatic) are scaled | ||
| ; | ||
| _name.category_id 'forcefield' | ||
| _name.object_id '1_4_pairs' |
There was a problem hiding this comment.
I just wanted to note, that the dash is only invalid in the _name.object_id attribute, but it is OK to include it in the save frame name or the _definition.id if you wish (e.g. as is done for the _space_group_name_H-M_alt data item in the coreCIF dictionary). Of course, there is no need to do anything if you are happy with the underscore.
|
|
||
| _definition.id THEORETICAL_STRUCTURE | ||
| _definition.scope Category | ||
| _definition.class Loop |
There was a problem hiding this comment.
DDLm dictionaries allow two types of categories: Set and Loop. Data items that belong to the Set categories are allowed to appear only once per data block. On the other hand, items in the Loop categories can be looped together and have more than one associated value. A typical example of a looped category is ATOM_SITE:
loop_
_atom_site_label
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
Co1 0.25 0.25(4) 0.75 0.25
N1 1.00 0.1849(8) 0.6787(8)
C1 1.00 0.140(1) 0.715(1) 0.093(1)
# ...
while CELL is a typical Set category:
_cell_length_a 15.798(5)
_cell_length_b 15.798(5)
_cell_length_c 15.674(7)
I see that in the two dictionaries all categories are defined as looped. Is that intentional?
A "Step" class has been added to describe intermediate steps in a CSP workflow, such as conformer generation and
structure similarity calculations.
New datafields describing Conformer Generation methods for organic CSP as an intermediate step.
New datafields describing Structure Similarity methods, used both as clustering method (remove duplicates) and to
identify matches with experimental.
New datafields in
_theoretical_structureto show matches with experimental structures or specify the cluster centerif the structure has been labeled as a duplicate.
New "Output" class to describe general outputs of a CSP workflow, such as computational cost or global minima in each
step.
New
.dicfiles for the CSP and Computational chemistry dictionaries.New data fields for ML Potentials.
Markdown files updated with relevant examples.
The "Category" column in the descriptive files has been removed and names converted to their actual use in the
.dicfiles:
Old version:
classNew version:
_cspdata_block_class"Definition", "Constraints" and "Examples" columns changed to improve clarity.
Descriptive dictionaries (Markdown files) in
Dictionaries/DFTandDictionaries/Forcefieldsnow do not have anyduplicate data field in
Dictionaries/CSPCore/cspcore.md.A few old labels have been changed for consistency across the dictionary.
Monomer correction datafield in the "Additional Proposed Fields" section in
Dictionaries/DFT/dft.mdhas been splitto describe functional, basis set and dispersion correction separately.