Custom fingerprints: JSON and YAML format
In addition to CSRML XML files, the Fingerprinter accepts
fingerprint definitions written in JSON or YAML. This lets you:
Distribute pre-built fingerprint specs without requiring the XML parser.
Craft lightweight custom fingerprint sets by hand.
Share fingerprint definitions that combine patterns from multiple sources.
The format is intentionally simple: a list of bit descriptors, each containing an id, a human-readable label, a SMARTS pattern, and an optional list of exception SMARTS.
File format
Both JSON and YAML use the same schema. Two layouts are accepted:
Wrapper dict (recommended) — includes metadata:
{
"id": "my-fingerprints",
"title": "My custom chemotype fingerprints",
"bits": [
{
"id": "fp-001",
"label": "bond:CF_monofluoro",
"smarts": "[#6]-[#9]",
"exception_smarts": []
},
{
"id": "fp-002",
"label": "chain:CF2_gem-difluoro",
"smarts": "[#6](-[#9])-[#9]",
"exception_smarts": []
}
]
}
Plain list — bits only, no metadata:
[
{
"id": "fp-001",
"label": "bond:CF_monofluoro",
"smarts": "[#6]-[#9]",
"exception_smarts": []
}
]
Both layouts are also valid YAML. For example, the same spec in YAML:
id: my-fingerprints
title: My custom chemotype fingerprints
bits:
- id: fp-001
label: bond:CF_monofluoro
smarts: "[#6]-[#9]"
exception_smarts: []
- id: fp-002
label: chain:CF2_gem-difluoro
smarts: "[#6](-[#9])-[#9]"
exception_smarts: []
Field reference
Field |
Required |
Description |
|---|---|---|
|
yes |
Unique identifier for the bit (e.g. |
|
yes |
Human-readable bit name returned by |
|
yes |
RDKit-compatible SMARTS pattern for the primary substructure match. A bit is set if the molecule contains this substructure and none of the exception SMARTS match. |
|
no |
List of SMARTS strings. When any of these match the molecule, the bit
is forced to 0 even if the main SMARTS matched. Defaults to |
Loading a custom file
from pyCSRML import Fingerprinter
from rdkit import Chem
# JSON
fp = Fingerprinter("my_fingerprints.json")
# YAML (requires pyyaml: pip install pyyaml)
fp = Fingerprinter("my_fingerprints.yaml")
mol = Chem.MolFromSmiles("FC(F)(F)CCO")
arr, names = fp.fingerprint(mol)
print([n for n, b in zip(names, arr) if b])
The bundled pre-built JSON caches for ToxPrint and TxP_PFAS can also be loaded directly, bypassing the XML parser entirely:
import importlib.resources
from pyCSRML import Fingerprinter
data_dir = importlib.resources.files("pyCSRML") / "data"
fp = Fingerprinter(str(data_dir / "TxP_PFAS_v1.0.4.json"))
from rdkit import Chem
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(=O)O")
arr, names = fp.fingerprint(mol)
Exporting the bundled specs to JSON
The pyCSRML.convert_xml_to_json script regenerates the bundled JSON
caches from the XML sources. You can also export any XML file yourself:
from pyCSRML._csrml import parse_csrml_xml, ordered_bit_list
import json
data = parse_csrml_xml("my_fingerprints.xml")
bits = []
for bit_id in ordered_bit_list(data):
sg = data["subgraph_index"][bit_id]
bits.append({
"id": sg["id"],
"label": sg["label"],
"smarts": sg["smarts"],
"exception_smarts": sg["exception_smarts"],
})
spec = {"id": data["id"], "title": data["title"], "bits": bits}
with open("my_fingerprints.json", "w") as f:
json.dump(spec, f, indent=2)
YAML installation
YAML support requires PyYAML, which is not installed by default:
pip install pyyaml
# or, with the yaml extra:
pip install "pyCSRML[yaml]"