Skip to content

Dictionary file#17

Open
nicholasfrancia wants to merge 13 commits into
mainfrom
dictionary_file
Open

Dictionary file#17
nicholasfrancia wants to merge 13 commits into
mainfrom
dictionary_file

Conversation

@nicholasfrancia
Copy link
Copy Markdown
Collaborator

  • A "Step" class has been added to describe intermediate steps in a CSP workflow, such as conformer generation and
    structure similarity calculations.

  • New datafields describing Conformer Generation methods for organic CSP as an intermediate step.

  • New datafields describing Structure Similarity methods, used both as clustering method (remove duplicates) and to
    identify matches with experimental.

  • New datafields in _theoretical_structure to show matches with experimental structures or specify the cluster center
    if the structure has been labeled as a duplicate.

  • New "Output" class to describe general outputs of a CSP workflow, such as computational cost or global minima in each
    step.

  • New .dic files for the CSP and Computational chemistry dictionaries.

  • New data fields for ML Potentials.

  • Markdown files updated with relevant examples.

  • The "Category" column in the descriptive files has been removed and names converted to their actual use in the .dic
    files:

    Old version:

    Group Category Data Field
    CSP Data Block class

    New version:

    Group Data Field
    _csp data_block_class
  • "Definition", "Constraints" and "Examples" columns changed to improve clarity.

  • Descriptive dictionaries (Markdown files) in Dictionaries/DFT and Dictionaries/Forcefields now do not have any
    duplicate data field in Dictionaries/CSPCore/cspcore.md.

  • A few old labels have been changed for consistency across the dictionary.

  • Monomer correction datafield in the "Additional Proposed Fields" section in Dictionaries/DFT/dft.md has been split
    to describe functional, basis set and dispersion correction separately.

Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
| `_csp` | `structure_ranking_data_block_id` | char | To be used in the "Workflow" datablock, it is a short, human-readable description of the ranking method. | | | |
| `_csp` | `structure_ranking_relative_energy_cutoff` | numb | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting those below a certain relative energy difference from the global minima. | \>=0. | kJ mol<sup>-1</sup> | 10.0 |
| `_csp` | `structure_ranking_max_structures_retained` | numb | To be used in the "Workflow" datablock, in case of multi-step ranking of structures, it specifies the structures that will be used in the next step by selecting the top N structures in the ranking. | \>=0 | | 250 |
| `_compchem` | `method` | char | The energy or scoring model used to rank structures. | - "pDFT" (Sec. 4.2): Ranking methods based on periodic Density Functional Theory methods. <br>- "Forcefield" (Sec. 4.3): Ranking methods based on classical forcefield. <br>- "Semi Empirical" (Sec. 4.4): Ranking methods based on semi empirical approaches, <br>- "Wavefunction" (Sec. 4.5): Ranking methods based on wavefunction approaches. <br>- "ML Potentials" (Sec. 4.6): Ranking methods based on Ml potentials. Forcefields parametrised with ML, should be under the "Forcefield" category. <br>- Other: Any other method. | | Forcefield |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will ML potentials be in Forcefields?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! In ML interatomic potentials such as MACE or UMA, energies and related properties are obtained directly from the neural network model. On the other hand, some approaches (Espaloma is an example) use neural networks to optimise equilibrium values (bond distances, angles or torsions...) and related force constants of a classical force field. Although the parameters are learned using a NN rather than fitted to QM calculations directly, the resulting model still has the form of a classical forcefield. All data fields in the forcefield section could then be used in these cases. (I agree the explanation in here is not very clear!)

Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
@vaitkus
Copy link
Copy Markdown
Contributor

vaitkus commented Apr 27, 2026

@nicholasfrancia I have a few more minor technical suggestions about the dictionaries, but would not want to detract from the main discussion at hand. Would these be useful here or would it be better to wait until it is merged and open a separate issue/PR?

@nicholasfrancia
Copy link
Copy Markdown
Collaborator Author

@vaitkus go ahead in here! If there are fundamental changes to the dictionary, we can open a separate PR later (this is my fault that I've included many changes in a single PR)

Copy link
Copy Markdown
Contributor

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I marked several general classes of issues since it will probably be more efficient to detect and fix them in bulk on your side than for me to mark all of them one-by-one. I only looked at cif_csp.dic so similar issues might also be present in cif_compchem.dic. Once these suggestions are addressed (or rejected), I can take another quick look to see if no new issues revealed them selves.

In the future, such checks could be run by GitHub actions (as is done on other dictionary repos).

Comment thread cif_csp.dic Outdated

_dictionary.title CIF_CSP
_dictionary.class Instance
_dictionary.version 0.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version number must have 3 components to adhere to SemVer, so either 0.0.1 or 0.1.0 in this case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a script that takes details from a CSV file and creates the 2 .dic files, so yes, most of these issues can be fixed in bulk. Thank you again, Antanas, this is super helpful!

Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic
Comment thread cif_csp.dic

_definition.id CSP
_definition.scope Category
_definition.class Loop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The category is declared as looped, but the data items that constitute the primary loop key (data items that are collectively unique within a CIF loop row) are not provided. This is done by specifying one or more such names as values of the attribute, e.g.

loop_
  _category_key.name
     '_csp_key_item_1'
     '_csp_key_item_2'
     '_csp_key_item_3'

Comment thread cif_csp.dic Outdated
Comment thread Dictionaries/CSPCore/cspcore.md Outdated
Copy link
Copy Markdown
Contributor

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes to the dictionary.

I marked few more remaining issues (mostly with incorrect specification of CIF 2.0 lists). Also, while this is not strictly mandated, we tend to assign units to all numeric data items (i.e. Integer, Real, etc.) as it helps later on in automated checks . Note the two special units that may be relevant in such cases:

  • none (dimensionless - e.g. a ratio, factor, weight or scale)
  • unspecified (no units supplied - the correct units should be derived based on the context)

Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_compchem.dic
Comment thread cif_compchem.dic Outdated
Comment thread cif_compchem.dic Outdated
Comment thread cif_compchem.dic Outdated
Comment thread cif_compchem.dic Outdated
Comment thread cif_compchem.dic Outdated
Comment thread cif_csp.dic Outdated
Comment thread cif_csp.dic
_name.object_id 'data_block_additional_files'
_type.purpose Describe
_type.source Assigned
_type.container List
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type container for several items was recently changed from Table to List, however, the list dimension was not added. For Lists of unspecified/unlimited size it is still recommended to explicitly declare them as such, i.e.:

_type.dimension               '[]'

Comment thread cif_compchem.dic
Indicates whether 1-4 interactions (Lennard-Jones and electrostatic) are scaled
;
_name.category_id 'forcefield'
_name.object_id '1_4_pairs'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to note, that the dash is only invalid in the _name.object_id attribute, but it is OK to include it in the save frame name or the _definition.id if you wish (e.g. as is done for the _space_group_name_H-M_alt data item in the coreCIF dictionary). Of course, there is no need to do anything if you are happy with the underscore.

Comment thread cif_compchem.dic

_definition.id THEORETICAL_STRUCTURE
_definition.scope Category
_definition.class Loop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDLm dictionaries allow two types of categories: Set and Loop. Data items that belong to the Set categories are allowed to appear only once per data block. On the other hand, items in the Loop categories can be looped together and have more than one associated value. A typical example of a looped category is ATOM_SITE:

loop_
_atom_site_label
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
Co1 0.25 0.25(4) 0.75 0.25
N1 1.00 0.1849(8) 0.6787(8) 
C1 1.00 0.140(1) 0.715(1) 0.093(1)
# ...

while CELL is a typical Set category:

_cell_length_a                   15.798(5)
_cell_length_b                   15.798(5)
_cell_length_c                   15.674(7)

I see that in the two dictionaries all categories are defined as looped. Is that intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants