CSRML format
The Chemical Subgraph Representation Markup Language (CSRML) is an XML-based language for encoding molecular subgraph patterns. pyCSRML parses the two fingerprint definition files shipped with the ChemoTyper tool: ToxPrint v2.0 (729 bits) and TxP_PFAS v1.0.4 (129 bits).
File structure
A CSRML file contains:
<csrml id="..." version="..." csrmlVersion="2">
<title>...</title>
<description>...</description>
<subgraphs>
<subgraph id="txp-001" label="bond:CC_...">
<subgraphMolecule feature="substructureMatch">
<molecule feature="substructureMatch">
<atomArray> ... </atomArray>
<bondArray> ... </bondArray>
</molecule>
<molecule feature="substructureException"> ... </molecule>
</subgraphMolecule>
</subgraph>
...
</subgraphs>
<classes> ... </classes>
</csrml>
Each <subgraph> defines one fingerprint bit. The label attribute becomes
the bit name.
Atom features (matchIf)
Atoms within a CSRML pattern carry <matchIf> child elements that refine the
query beyond a simple element match:
CSRML feature |
SMARTS mapping |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
recursive OR-of-AND tree (see below) |
Pseudo-elements
CSRML defines several symbolic element codes that are not in the periodic table:
Symbol |
SMARTS |
Meaning |
|---|---|---|
|
|
any atom (wildcard) |
|
|
heteroatom (non-C, non-H) |
|
|
heteroatom (synonym of Q in TxP_PFAS) |
|
|
any halogen (F, Cl, Br, I) |
|
|
metal / metalloid (B, Si, Ge, As, Sb, Te) |
|
derived from |
complex query atom (see below) |
combineAtomFeatures (QRY atoms)
Some atoms (labelled QRY) use a recursive combineAtomFeatures structure
that encodes an OR-of-AND tree:
<matchIf feature="combineAtomFeatures" combineBy="or">
<matchIf feature="combineAtomFeatures" combineBy="and">
<matchIf feature="atomList" value="C"/>
<matchIf feature="aliphaticAtom"/>
<matchIf feature="connectivity" value="3"/>
</matchIf>
<matchIf feature="combineAtomFeatures" combineBy="and">
<matchIf feature="atomList" value="C"/>
<matchIf feature="aliphaticAtom"/>
<matchIf feature="attachedHydrogenCount" minInclusive="1"/>
</matchIf>
<!-- ... -->
</matchIf>
This maps to the SMARTS atom primitive: [#6;A;X3,#6;A;!H0,...].
Exception handling
Fingerprint bits may have one or more substructureException molecules.
An exception molecule can either:
Stand-alone – the bit is set to 0 if the molecule also matches the exception as a global substructure query.
Anchor-linked (
matchingQueryAtom) – one atom in the exception has amatchingQueryAtomfeature referencing an atom in the main pattern. In this case the exception is folded into the main SMARTS as a[!$(...)]recursive negation on the referenced atom, encoding the constraint directly rather than as a separate pass.
Example: the bit pfas_chain:perF-linear_cap_C1_excl has the main SMARTS
[#6](-[#9])(-[#9])(-[#9])-[…] and an exception anchored at the terminal CF₃.
After folding the result is:
[#6](-[#9])(-[#9])(-[#9])-[!$([#6;A](-[#9])-[#9])]
Low-level API
from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list
data = parse_csrml_xml("TxP_PFAS_v1.0.4.xml")
bits = ordered_bit_list(data)
for bit in bits[:3]:
print(bit["id"], "→", bit["smarts"])
for exc in bit["exception_smarts"]:
print(" excl:", exc)
Known limitations
The following CSRML features are recognised in the XML but cannot be fully represented as SMARTS by pyCSRML.
elSystems / piSystem / piElectronCount
CSRML v1 uses <elSystems> elements to constrain atoms by membership in a
conjugated or aromatic pi-electron system (e.g. piElectronCount ≥ 4 for
the sulfonyl O=S=O group). These elements have no direct SMARTS equivalent
and are silently ignored by the parser. In practice the explicit double-bond
topology already implied by the pattern graph produces the same constraint for
all tested compounds, so the impact on accuracy is negligible.
Fluorotelomer (FT_n*) bits
Several fluorotelomer chain-length bits (pfas_chain:FT_n1_*,
FT_n2_*, FT_n3_*) achieve 93–99 % concordance rather than 100 %.
The ChemoTyper reference implementation uses Java-side chain-length counting
that cannot be reproduced exactly with a single SMARTS query. The patterns
generated by pyCSRML are therefore conservative approximations.
Concordance summary
After all implemented fixes the overall pyCSRML accuracy across 14 710 PFAS
compounds from Richard et al. (2023) is 99.44 %, with all 129 bits
reaching ≥ 90 % accuracy. The worst single bit is
pfas_chain:FT_n2_hetero at 93.7 %.