Critical Assessment of Small Molecule Identification

In all categories, the goal is to determine the correct molecular structure (connectivity) determined by matching the first block of the InChIKey between the submissions and the correct solutions.

Category 1: Best Structure Identification on Natural Products

The challenges for Category 1 are plant-derived compounds and a few that are endogenous to humans. Due to often unique distribution of some of the compounds, only limited metadata on the compounds’ origin are provided (see the page on challenges 1-45 data).

We provide the MS, MS/MS and additional data, which may include sample background, experimental factors and retention information in some cases. Any external information can be used by the participants. Whether you use this information or not is up to you. The data is available here.

All kinds of approaches, including manual, semi-automatic and automatic are allowed and encouraged!

Category 2: Best Automatic Structural Identification - In Silico Fragmentation Only

Category 2 consists of all Challenges 1-243 measured in two different analytical labs.

Based on the MS/MS data (and for Challenges 1-45 optionally also the MS data), the goal is to determine the correct molecular structure using in silico fragmentation techniques alone (i.e. neither retention time, nor spectral library lookup nor additional metadata should be considered). Mass spectral libraries can only be used for training of prediction models, but not to solve the challenge by querying with the peaklist against the library.

The aim of this category is to compare different fragmentation approaches, ranging from combinatorial, to rule-based, to simulations; the number of challenges means this is aimed specifically at automatic approaches. The abstract should describe the computational method, including the parameters used for the submission.

Category 3: Best Automatic Structural Identification - Full Information

This category uses the same data files as for Category 2, but in Category 3 any form of additional information can be used (retention time information, mass spectral libraries, patents, reference count, …). This allows to demonstrate whether/how much additional information can improve the results of the unknown annotation. The approach(es) used should be well documented in the abstract submitted with the results file.

Update 22/06/2017: We have been asked by several participants to provide candidate structures. These are provided in a new category, Category 4. Categories 2 and 3 are designed to evaluate a broader piece of the "identification puzzle"; Category 4 evaluates how different methods rank relative to one another on consistent candidates. We encourage participants to consider submitting to multiple categories.

Category 4: Best Automatic Candidate Ranking

This category uses a subset of the data files in Categories 2 and 3, Challenges 46-243. Two fixed candidate lists (full and nonredundant InChIkey block 1 filtered) are provided for each challenge. They are stored as InChI or SMILES, packed as ZIP files, and available from the challenge data page.

The evaluation logic is the same as for the other categories. Since we will apply the InChIKey block1 filter on the submissions, participants can choose whether to use the pre-filtered candidate set with one (arbitrarily chosen) stereo configuration per 2D structure, or the full candidate set with all stereo isomers retrieved from PubChem. If your method gives identical scores to different stereoisomers, the pre-filtered candidates will be more efficient and result in the same evaluation.