Fingerprinter API

ToxPrint fingerprinter: compute binary chemotype fingerprints for molecules using ToxPrint v2.0 (729 bits) or TxP_PFAS v1.0 (129 bits) definitions.

Usage:

from pyToxPrint.fingerprinter import ToxPrintFingerprinter, PFASFingerprinter
from rdkit import Chem

fp = ToxPrintFingerprinter()          # loads bundled ToxPrint v2 XML
mol = Chem.MolFromSmiles("c1ccccc1")
arr, names = fp.fingerprint(mol)      # numpy bool array + list of bit names

fp_pfas = PFASFingerprinter()         # loads bundled TxP_PFAS XML
arr_pfas, names_pfas = fp_pfas.fingerprint(mol)

Pattern matching strategy

Each chemotype is defined by:
  1. A primary SMARTS pattern (substructureMatch molecule)

  2. Zero or more exception SMARTS patterns (substructureException molecules)

A fingerprint bit is set to 1 if:
  • The molecule contains a substructure match for the primary pattern, AND

  • The molecule does NOT contain a substructure match for any exception pattern (exception patterns are only applied when the exception molecule contains

    matchingQueryAtom cross-references to the main pattern; otherwise the exception acts as a global exclusion)

Note: The exception logic is a reasonable approximation; the original ChemoTyper tool may produce slightly different results for edge cases.

pyCSRML.fingerprinter.TOXPRINT_PATH: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/pycsrml/checkouts/latest/pyCSRML/data/toxprint_V2.0_r711.json')

Path to the bundled ToxPrint v2.0 JSON fingerprint definition. Pass this to Fingerprinter to load ToxPrint instantly:

fp = Fingerprinter(TOXPRINT_PATH)
pyCSRML.fingerprinter.TXPPFAS_PATH: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/pycsrml/checkouts/latest/pyCSRML/data/TxP_PFAS_v1.0.4.json')

Path to the bundled TxP_PFAS v1.0.4 JSON fingerprint definition. Pass this to Fingerprinter to load TxP_PFAS instantly:

fp = Fingerprinter(TXPPFAS_PATH)
class pyCSRML.fingerprinter.Fingerprinter(source, json_cache=None, verbose=False)[source]

Bases: object

Compute binary chemotype fingerprints from a CSRML fingerprint definition.

The definition file can be in any of these formats:

  • XML (.xml) — a CSRML XML file (ToxPrint v2 or TxP_PFAS). The parser converts the subgraph patterns to SMARTS on the fly. An optional JSON cache speeds up subsequent loads.

  • JSON (.json) — a pre-built spec file (see Custom fingerprints: JSON and YAML format for the schema).

  • YAML (.yaml / .yml) — same schema as JSON but in YAML syntax. Requires pyyaml.

Parameters:
  • source (Union[str, Path]) – Path to the fingerprint definition file (.xml, .json, .yaml, or .yml).

  • json_cache (Union[str, Path, None]) – Path to a JSON cache file. Only used when source is an XML file. If the cache is newer than the XML, it is loaded directly (faster).

  • verbose (bool) – If True, emit a warning for every pattern that fails to compile.

property n_bits: int

Number of fingerprint bits.

property bit_names: list[str]

Ordered list of bit labels (one per bit).

property bit_ids: list[str]

Ordered list of bit IDs (original subgraph IDs).

property title: str
fingerprint(mol)[source]

Compute the binary fingerprint for a molecule.

Parameters:

mol (rdkit.Chem.Mol) – An RDKit molecule object (must be pre-sanitized).

Return type:

tuple[ndarray, list[str]]

Returns:

  • array (numpy.ndarray of dtype bool) – Binary fingerprint vector of length n_bits.

  • names (list[str]) – Corresponding bit labels.

fingerprint_smiles(smiles)[source]

Compute fingerprint from a SMILES string.

Returns all-zeros array if SMILES is invalid.

Return type:

tuple[ndarray, list[str]]

Parameters:

smiles (str)

fingerprint_batch(mols, smiles_list=None)[source]

Compute fingerprints for a list of molecules (or SMILES strings).

Parameters:
  • mols (iterable of rdkit.Chem.Mol or None) – If None is passed for a molecule, zeros are used.

  • smiles_list (Optional[list[str]]) – If provided, mols is ignored and this list of SMILES is used instead.

Returns:

matrix

Return type:

ndarray