Critical Assessment of Small Molecule Identification

Category 4 extra contains the scores from two methods that were submitted on 19.10.2017 after the contest deadline. The initial submissions before the deadline were unfortunately identical to IOKR_TanimotoGaussian. With these updated files, IOKR_TanimotoGaussian_AvgScore would have achieved the 3rd rank in Category 4.

	F1 score	Mean rank	Median rank	Top	Top3	Top10	Misses	TopPos	TopNeg	Mean RRP	Median RRP	N
IOKR_TanimotoGaussian_AvgScore	1446	2721.84	201.0	38	60	73	0	22	16	0.683	0.957	198
MPIOKR_GaussianRFF	1066	3862.30	582.5	33	43	52	0	22	11	0.587	0.726	198

This summary is also available as CSV download.

Table legend:

F1 score: The Formula 1 score awards points similar to the scheme in F1 racing for each challenge based on the rank of the correct solution. In the participant table, these are summed over all challenges. Please note that the F1 score is thus not neccessarily comparable across categories.
Mean/Median rank: Mean and median rank of the correct solution. For tied ranks with other candidates, the average rank of the ties is used.
Top, Top3, Top10: Number of challenges where the correct solution is ranked first, among the Top 3 and Top 10
Misses: Number of challenges where the correct solution is missing.
TopPos, TopNeg: Top1 ranked solutions in positive or negative ionization mode.
Mean/Median RRP: The relative ranking position, which is also incorporating the length of candidate list.
N: Number of submissions that have passed the evaluation scripts.

Summary of Rank by Challenge

For each challenge, the lowest rank among participants is highlighted in bold. If the submission did not contain the correct candidate this is denoted as "-". If someone did not participate in a challenge, the table cell is empty. The tables are sortable if you click into the column header.

Category4:

	IOKR_TanimotoGaussian_AvgScore	MPIOKR_GaussianRFF
challenge-046	1647.0	1313.0
challenge-047	384.0	408.0
challenge-048	662.0	6981.0
challenge-049	469.5	1636.5
challenge-050	14.0	2.0
challenge-051	1.0	14.0
challenge-052	1477.0	315.0
challenge-053	2.0	2.0
challenge-054	11295.0	2985.0
challenge-055	1.0	29.0
challenge-056	1.0	1.0
challenge-057	15753.0	15693.0
challenge-058	533.0	10.0
challenge-059	1.0	1.0
challenge-060	8941.0	11264.0
challenge-061	3117.0	3304.0
challenge-062	1.0	216.0
challenge-063	1.0	1.0
challenge-064	11410.0	11188.0
challenge-065	1886.0	2004.0
challenge-066	11104.0	10588.0
challenge-067	1260.0	2918.0
challenge-068	8.5	7.5
challenge-069	307.0	51.0
challenge-070	11294.0	11323.0
challenge-071	107.0	9970.0
challenge-072	15577.0	14528.0
challenge-073	1.0	1.0
challenge-074	2.0	6.0
challenge-075	1.0	587.0
challenge-076	15949.0	15260.0
challenge-077	7099.0	7590.0
challenge-078	8399.0	12007.0
challenge-079	664.0	1030.0
challenge-080	5495.0	1.0
challenge-081	1.0	1.0
challenge-082	19.0	147.0
challenge-083	1.5	43.5
challenge-084	344.0	280.0
challenge-085	9093.0	9193.0
challenge-086	268.0	26388.0
challenge-087	1.0	1.0
challenge-088	3524.5	3503.5
challenge-089	4132.0	4492.0
challenge-090	209.0	67.0
challenge-091	704.0	640.0
challenge-092	32.0	344.0
challenge-093	3704.0	3578.0
challenge-094	8.0	16242.0
challenge-095	9242.0	9355.0
challenge-096	7928.0	247.0
challenge-097	1.0	1.0
challenge-098	3007.0	3090.0
challenge-099	10038.0	10059.0
challenge-100	7538.0	7101.0
challenge-101	12311.0	13007.0
challenge-102	4935.0	4425.0
challenge-103	1.0	1.0
challenge-104	115.0	1897.0
challenge-105	38.0	43.0
challenge-106	8520.0	3868.0
challenge-107	5550.0	5559.0
challenge-108	4.0	3721.0
challenge-109	9176.0	8997.0
challenge-110	6637.0	6611.0
challenge-111	2.0	6.0
challenge-112	1222.0	6219.0
challenge-113	380.0	399.0
challenge-114	7775.0	9881.0
challenge-115	1.0	1.0
challenge-116	1.0	39.0
challenge-117	2629.0	3755.0
challenge-118	6421.0	6842.0
challenge-119	9798.0	8871.0
challenge-120	1.0	10219.0
challenge-121	336.0	46.0
challenge-122	477.0	5358.0
challenge-123	11390.0	13377.0
challenge-124	1366.0	1349.0
challenge-125	2163.5	2277.5
challenge-126	32.0	39.0
challenge-127	562.0	350.0
challenge-128	7700.0	1.0
challenge-129	1.0	1879.0
challenge-130	72.0	110.0
challenge-131	379.0	384.0
challenge-132	1119.0	2367.0
challenge-133	4.0	11.0
challenge-134	1881.0	11909.0
challenge-135	567.5	596.5
challenge-136	1.0	1.0
challenge-137	6.5	1.5
challenge-138	22.0	3.0
challenge-139	2.0	185.0
challenge-140	1.0	11612.0
challenge-141	1.0	1.0
challenge-142	320.0	5299.0
challenge-143	50.0	10.0
challenge-144	25.0	4.0
challenge-145	25.0	1.0
challenge-146	11720.0	11091.0
challenge-147	1.0	1293.0
challenge-148	2.0	1.0
challenge-149	1.0	1.0
challenge-150	2.0	1722.0
challenge-151	10338.0	7983.0
challenge-152	55.0	80.0
challenge-153	23.0	34.0
challenge-154	10102.0	7472.0
challenge-155	2.0	3.0
challenge-156	15.0	20.0
challenge-157	25.0	13.0
challenge-158	3.0	9544.0
challenge-159	1.0	2299.0
challenge-160	12176.0	2629.0
challenge-161	261.0	87.0
challenge-162	1.0	17958.0
challenge-163	1.0	1.0
challenge-164	4.0	1.0
challenge-165	1988.0	2004.0
challenge-166	3641.0	3407.0
challenge-167	14821.0	15347.0
challenge-168	5.0	2212.0
challenge-169	27.0	578.0
challenge-170	1.0	1.0
challenge-171	2.0	1.0
challenge-172	204.0	444.0
challenge-173	1353.0	16181.0
challenge-174	6409.0	6948.0
challenge-175	1396.0	14.0
challenge-176	14.0	283.0
challenge-177	1.0	414.0
challenge-178	14960.0	1385.0
challenge-179	1732.0	2422.0
challenge-180	277.0	289.0
challenge-181	2364.0	16151.0
challenge-182	27.0	39.0
challenge-183	1.5	1.5
challenge-184	1.0	1.0
challenge-185	3729.0	11832.0
challenge-186	2.0	14.0
challenge-187	133.0	243.0
challenge-188	2.0	1.0
challenge-189	1138.0	47.0
challenge-190	2349.0	3666.0
challenge-191	974.0	976.0
challenge-192	36.0	5022.0
challenge-193	1.0	1.0
challenge-194	65.0	8323.0
challenge-195	1.5	3.5
challenge-196	2855.0	2.0
challenge-197	34.0	34.0
challenge-198	1.0	1.0
challenge-199	2745.0	2795.0
challenge-200	1.0	1.0
challenge-201	15085.0	26330.0
challenge-202	2337.0	18616.0
challenge-203	1.0	1.0
challenge-204	6392.0	6410.0
challenge-205	3.0	16.0
challenge-206	2329.0	2407.0
challenge-207	10269.0	9551.0
challenge-208	33.0	139.0
challenge-209	198.0	31.0
challenge-210	32.0	4671.0
challenge-211	1.0	1.0
challenge-212	15505.0	15390.0
challenge-213	1.0	1.0
challenge-214	9752.0	10013.0
challenge-215	9.0	3860.0
challenge-216	8.0	13.0
challenge-217	2.0	9.0
challenge-218	2.0	2.0
challenge-219	1.0	26.0
challenge-220	1.0	15801.0
challenge-221	18460.0	19164.0
challenge-222	585.0	12334.0
challenge-223	1.0	62.0
challenge-224	2.0	2.0
challenge-225	1.5	5.5
challenge-226	858.0	227.0
challenge-227	2908.0	2873.0
challenge-228	5.0	4252.0
challenge-229	18334.0	18559.0
challenge-230	15108.0	18378.0
challenge-231	1373.0	1369.0
challenge-232	1.0	1.0
challenge-233	283.0	2848.0
challenge-234	11448.0	11475.0
challenge-235	3.0	1536.0
challenge-236	7.0	1.0
challenge-237	3.0	21.0
challenge-238	237.0	23.0
challenge-239	8.0	1.0
challenge-240	628.0	5456.0
challenge-241	4.0	1.0
challenge-242	1.0	1.0
challenge-243	2.0	2.0

This summary is also available as CSV download.

Participant information and abstracts

Participant:          Bach
ParticipantID:        IOKR_TanimotoGaussian
Category:	          4
Authors:              Eric Bach(1), Céline Brouard(1,2), Kai Dührkop(3), 
                      Sebastian Böcker(3) and Juho Rousu(1,2)
Affiliations:         (1) Department of Computer Science, Aalto University,
		          Espoo, Finland
                      (2) Helsinki Institute for Information Technology, Espoo,
		      	  Finland
                      (3) Chair for Bioinformatics, Friedrich-Schiller University,
		      	  Jena, Germany
Automatic pipeline:   yes
Spectral libraries:   no

Abstract
We used a recent machine learning approach, called Input Output Kernel Regression 
(IOKR), for predicting the candidate scores. IOKR has been successfully applied 
to metabolite identification [1]. 

In this method kernel functions are used to measure the similarity between MS/MS
spectra (input kernel) respectively between molecular structures (output kernel).
On the input side, we use several kernels defined on MS/MS spectra and fragmentation
trees, and combine them uniformly, i.e. we sum up the kernels with equal weights.
On the output side, we use a Gaussian kernel on Tanimoto features calculated 
from binary fingerprints representing the molecular structures in the candidate 
sets.

We train two separated IOKR models one for each ionization mode, i.e. positive
and negative. For the positive model we use ~14000 identified MS/MS spectra and 
for the negative model ~5800. Those spectra mainly are extracted from the GNPS 
and MassBank databases. We represent the candidate molecular structures using 
~7600 binary molecular fingerprints. 

For each challenge spectra we predict the molecular formula using Sirius [2] by
taking into account the possible molecule formulas based on the candidate sets.
The score we submitted for each candidate is the one corresponding to the most
likely molecular formula.

[1] Brouard, Cé.; Shen, H.; Dührkop, K.; d'Alché Buc, F.; Böcker, S. & Rousu, J.
    Fast metabolite identification with Input Output Kernel Regression
    Bioinformatics, 2016
[2] https://bio.informatik.uni-jena.de/software/sirius/

Participant:          Bach
ParticipantID:        MPIOKR_GaussianRFF
Category:	      4
Authors:              Eric Bach(1), Céline Brouard(1,2), Kai Dührkop(3), 
                      Sebastian Böcker(3) and Juho Rousu(1,2)
Affiliations:         (1) Department of Computer Science, Aalto University,
		          Espoo, Finland
                      (2) Helsinki Institute for Information Technology,
		      	  Espoo, Finland
                      (3) Chair for Bioinformatics, Friedrich-Schiller University,
		      	  Jena, Germany
Automatic pipeline:   yes
Spectral libraries:   no

Abstract

Magnitude-preserving Input Output Kernel Regression (MP-IOKR) is an
extension of the Input Output Kernel Regression (IOKR) method [1],
which has been successfully applied to metabolite identification
[2]. Magnitude-preserving IOKR uses a modified objective function,
which can exploit the knowledge about the molecular candidates for a
set of training MS/MS spectra.

IOKR objective function for the regression function h (prediction of
the feature vector representing a molecular structure):

(1) h = argmin_h sum_i ||h(x_i) - psi(y_i)||^2 + lambda ||h||^2
        
Magnitude-preserving modification of the objective function:

(2) h = argmin_h sum_i 1/n_i sum_j ||(h(x_i)-h(x_j))-(psi(y_i)-psi(y_j)))||^2 
                    + lambda ||h||^2
                    
with i in {1,...,l} being an iterator over the number of training data, j in {1,...,n_i} 
being an iterator over the number of molecular candidates of training example i. 
The x_i and y_i are the training MS/MS spectra respectively the training molecular 
structure. The x_j and y_j are the candidates' MS/MS spectra respectively the 
candidates' molecular structures. Equation (1) (IOKR) minimizes the prediction 
error between the training MS/MS spectra and the training molecular structure. 
In contrast to that, Equation (2) (MP-IOKR) learns a function h, which also preserves 
the magnitudes (differences) between the training molecular structure (y_i) and
all its candidate molecular structures (y_j). In that way we consider how the true
candidate relates to all the remaining candidates and include this knowledge into
our learning problem. It is important to node that we do not need the MS/MS 
spectrum of each candidate x_j, as we approximate the corresponding input feature 
vectors for each candidate using the molecular structure y_j.

MP-IOKR is a kernel method. Kernels measure the similarity between structured
objects, e.g. MS/MS spectra (input kernel) or molecular structures (output kernel).
On the input side, we use several kernels defined on MS/MS spectra and fragmentation
trees, and combine them uniformly, i.e. we sum up the kernels with equal weights.
On the output side, we use a Gaussian kernel calculated from binary fingerprints
representing the molecular structures in the candidate sets. As we cannot deal
with possibly millions of candidates (and O(million^2) kernel matrices). we 
approximate the Gaussian features and use those in our framework.

We train two separated MP-IOKR models one for each ionization mode, i.e. positive
and negative. For the positive model we use ~14000 identified MS/MS spectra +
~4Million candidates and their molecular structures. The MS/MS spectra are mainly
are extracted from the GNPS and MassBank databases. For the negative model we 
use ~5800 identified MS/MS spectra and ~1.5Million candidates. We represent the 
candidate molecular structures using ~7600 binary molecular fingerprints. 

For each challenge spectra we predict the molecular formula using Sirius [3] by
taking into account the possible molecule formulas based on the candidate sets.
The score we submitted for each candidate is the one corresponding to the most
likely molecular formula.

[1] Brouard, Cé.; Shen, H.; Dührkop, K.; d'Alché Buc, F.; Böcker, S. & Rousu, J. 
    Fast metabolite identification with Input Output Kernel Regression 
    Bioinformatics, 2016
[2] Brouard, Cé.; Bach, E.; Böcker, S. & Rousu, J.
    Magnitude-Preserving Ranking for Structured Outputs
    (submitted), 2017
[3] https://bio.informatik.uni-jena.de/software/sirius/

Details per Challenge and Participant. See legend at bottom for more details

The details table is also available as HTML and as CSV download.