An enhanced version of an algorithm is discussed which encodes a
description of the chemical environment
of carbon atoms in a manner that correlates to carbon-13 nuclear
magnetic resonance (13C NMR) chemical
shifts. The encoding algorithm uses a vector-based approach in
which the first dimension of the vector
represents the chemical shift of the carbon atom, the second dimension
represents the collective influence
of atoms one bond away from the carbon on its chemical shift, and each
successive dimension represents
the influence of the atoms one bond further away. This encoding
algorithm is a key component of a 13C
NMR spectrum simulation procedure in which each of the carbons in a
large database of known structures
and spectra is represented as a vector. Database search methods
based on vector comparisons are used to
find the closest matching chemical environments and associated chemical
shifts for each of the carbons in
a structure input by a user. Enhancements to the original
algorithm include an expansion of the number of
atom classes treated, the addition of a scheme to treat aromatic
systems as a special case, and the use of an
expanded vector format to regain some of the information lost by
collapsing the molecular structure to a
vector representation. To test this algorithm, a database of
structures and spectra is split into training and
test sets consisting of 16 959 and 4240 structures, respectively.
Experiments performed to optimize several
parameters associated with the encoding algorithm are followed by
comparing the retrieved (i.e., predicted)
and actual chemical shifts for the structures in the test set. For
the optimal parameter settings found, the
median of the mean absolute deviations in chemical shifts for the
structures in the test set was 1.30 ppm
and was obtained with an expanded vector representation based on 15
dimensions.