1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 """FMCS - Find Maximum Common Substructure
35
36 This software finds the maximum common substructure of a set of
37 structures and reports it as a SMARTS strings.
38
39 This implements what I think is a new algorithm for the MCS problem.
40 The core description is:
41
42 best_substructure = None
43 pick one structure as the query, and other as the targets
44 for each substructure in the query graph:
45 convert it to a SMARTS string based on the desired match properties
46 if the SMARTS pattern exists in all of the targets:
47 then this is a common substructure
48 keep track of the maximum such common structure,
49
50 The SMARTS string depends on the desired match properties. For
51 example, if ring atoms are only allowed to match ring atoms then an
52 aliphatic ring carbon in the query is converted to the SMARTS "[C;R]",
53 and the double-bond ring bond converted to "=;@" while the respectice
54 chain-only version are "[C;!R]" and "=;!@".
55
56 The algorithm I outlined earlier will usually take a long time. There
57 are several ways to speed it up.
58
59 == Bond elimination ==
60
61 As the first step, remove bonds which obviously cannot be part of the
62 MCS.
63
64 This requires atom and bond type information, which I store as SMARTS
65 patterns. A bond can only be in the MCS if its canonical bond type is
66 present in all of the structures. A bond type is string made of the
67 SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the
68 other atom. The canonical bond type is the lexographically smaller of
69 the two possible bond types for a bond.
70
71 The atom and bond SMARTS depend on the type comparison used.
72
73 The "ring-matches-ring-only" option adds an "@" or "!@" to the bond
74 SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or
75 [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms
76 are compared by element and bonds are compared by bondtype. (This
77 option does not add "R" or "!R" to the atom SMARTS because there
78 should be a single bond in the MCS of c1ccccc1O and CO.)
79
80 The result of all of this atom and bond typing is a "TypedMolecule"
81 for each input structure.
82
83 I then find which canonical bondtypes are present in all of the
84 structures. I convert each TypedMolecule into a
85 FragmentedTypedMolecule which has the same atom information but only
86 those bonds whose bondtypes are in all of the structures. This can
87 break a structure into multiple, disconnected fragments, hence the
88 name.
89
90 (BTW, I would like to use the fragmented molecules as the targets
91 because I think the SMARTS match would go faster, but the RDKit SMARTS
92 matcher doesn't like them. I think it's because the new molecule
93 hasn't been sanitized and the underlying data structure the ring
94 information doesn't exist. Instead, I use the input structures for the
95 SMARTS match.)
96
97 == Use the structure with the smallest largest fragment as the query ==
98 == and sort the targets by the smallest largest fragment ==
99
100 I pick one of the FragmentedTypedMolecule instances as the source of
101 substructure enumeration. Which one?
102
103 My heuristic is to use the one with the smallest largest fragment.
104 Hopefully it produces the least number of subgraphs, but that's also
105 related to the number of rings, so a large linear graph will product
106 fewer subgraphs than a small fused ring system. I don't know how to
107 quantify that.
108
109 For each of the fragmented structures, I find the number of atoms in
110 the fragment with the most atoms, and I find the number of bonds in
111 the fragment with the most bonds. These might not be the same
112 fragment.
113
114 I sort the input structures by the number of bonds in the largest
115 fragment, with ties broken first on the number of atoms, and then on
116 the input order. The smallest such structure is the query structure,
117 and the remaining are the targets.
118
119 == Use a breadth-first search and a priority queue to ==
120 == enumerate the fragment subgraphs ==
121
122 I extract each of the fragments from the FragmentedTypedMolecule into
123 a TypedFragment, which I use to make an EnumerationMolecule. An
124 enumeration molecule contains a pair of directed edges for each atom,
125 which simplifies the enumeration algorithm.
126
127 The enumeration algorithm is based around growing a seed. A seed
128 contains the current subgraph atoms and bonds as well as an exclusion
129 set of bonds which cannot be used for future grown. The initial seed
130 is the first bond in the fragment, which may potentially grow to use
131 the entire fragment. The second seed is the second bond in the
132 fragment, which is excluded from using the first bond in future
133 growth. The third seed starts from the third bond, which may not use
134 the first or second bonds during growth, and so on.
135
136
137 A seed can grow along bonds connected to an atom in the seed but which
138 aren't already in the seed and aren't in the set of excluded bonds for
139 the seed. If there are no such bonds then subgraph enumeration ends
140 for this fragment. Given N bonds there are 2**N-1 possible ways to
141 grow, which is just the powerset of the available bonds, excluding the
142 no-growth case.
143
144 This breadth-first growth takes into account all possibilties of using
145 the available N bonds so all of those bonds are added to the exclusion
146 set of the newly expanded subgraphs.
147
148 For performance reasons, the bonds used for growth are separated into
149 'internal' bonds, which connect two atoms already in the subgraph, and
150 'external' bonds, which lead outwards to an atom not already in the
151 subgraph.
152
153 Each seed growth can add from 0 to N new atoms and bonds. The goal is
154 to maximize the subgraph size so the seeds are stored in a priority
155 queue, ranked so the seed with the most bonds is processed first. This
156 turns the enumeration into something more like a depth-first search.
157
158
159 == Prune seeds which aren't found in all of the structures ==
160
161 At each stage of seed growth I check that the new seed exists in all
162 of the original structures. (Well, all except the one which I
163 enumerate over in the first place; by definition that one will match.)
164 If it doesn't match then there's no reason to include this seed or any
165 larger seeds made from it.
166
167 The check is easy; I turn the subgraph into its corresponding SMARTS
168 string and use RDKit's normal SMARTS matcher to test for a match.
169
170 There are three ways to generate a SMARTS string: 1) arbitrary, 2)
171 canonical, 3) hybrid.
172
173 I have not tested #1. During most of the development I assumed that
174 SMARTS matches across a few hundred structures would be slow, so that
175 the best solution is to generate a *canonical* SMARTS and cache the
176 match information.
177
178 Well, it turns out that my canonical SMARTS match code takes up most
179 of the FMCS run-time. If I drop the canonicalization step then the
180 code averages about 5-10% faster. This isn't the same as #1 - I still
181 do the initial atom assignment based on its neighborhood, which is
182 like a circular fingerprint of size 2 and *usually* gives a consistent
183 SMARTS pattern, which I can then cache.
184
185 However, there are times when the non-canonical SMARTS code is slower.
186 Obviously one is if there are a lot of structures, and another if is
187 there is a lot of symmetry. I'm still working on characterizing this.
188
189
190 == Maximize atoms? or bonds? ==
191
192 The above algorithm enumerates all subgraphs of the query and
193 identifies those subgraphs which are common to all input structures.
194
195 It's trivial then to keep track of the current "best" subgraph, which
196 can defined as having the subgraph with the most atoms, or the most
197 bonds. Both of those options are implemented.
198
199 It would not be hard to keep track of all other subgraphs which are
200 the same size.
201
202 == --complete-ring-only implementation ==
203
204 The "complete ring only" option is implemented by first enabling the
205 "ring-matches-ring-only" option, as otherwise it doesn't make sense.
206
207 Second, in order to be a "best" subgraph, all bonds in the subgraph
208 which are ring bonds in the original molecule must also be in a ring
209 in the subgraph. This is handled as a post-processing step.
210
211 (Note: some possible optimizations, like removing ring bonds from
212 structure fragments which are not in a ring, are not yet implemented.)
213
214
215 == Prune seeds which have no potential for growing large enough ==
216
217 Given a seed, its set of edges available for growth, and the set of
218 excluded bonds, figure out the maximum possible growth for the seed.
219 If this maximum possible is less than the current best subgraph then
220 prune.
221
222 This requires a graph search, currently done in Python, which is a bit
223 expensive. To speed things up, I precompute some edge information.
224 That is, if I know that a given bond is a chain bond (not in a ring)
225 then I can calculate the maximum number of atoms and bonds for seed
226 growth along that bond, in either direction. However, precomputation
227 doesn't take into account the excluded bonds, so after a while the
228 predicted value is too high.
229
230 Again, I'm still working on characterizing this, and an implementation
231 in C++ would have different tradeoffs.
232 """
233
234 __version__ = "1.1"
235 __version_info = (1, 1, 0)
236
237 import sys
238
239 try:
240 from rdkit import Chem
241 from rdkit.six import next
242 from rdkit.six.moves import range
243 except ImportError:
244 sys.stderr.write("Please install RDKit from http://www.rdkit.org/\n")
245 raise
246
247 import copy
248 import itertools
249 import re
250 import weakref
251 from heapq import heappush, heappop, heapify
252 from itertools import chain, combinations, product
253 import collections
254 from collections import defaultdict
255 import time
256
257
258
259
260
270
271
272
273
274 _get_symbol = Chem.GetPeriodicTable().GetElementSymbol
275
276
277
278
279
280
282
284 value = _get_symbol(eleno)
285 self[eleno] = value
286 return value
287
288
289 _atom_smarts_no_aromaticity = AtomSmartsNoAromaticity()
290
291
292
293
294
295 for eleno in (1, 5, 6, 7, 8, 15, 16, 33, 34, 52):
296 _atom_smarts_no_aromaticity[eleno] = "#" + str(eleno)
297 assert _atom_smarts_no_aromaticity[6] == "#6"
298 assert _atom_smarts_no_aromaticity[2] == "He"
299
300
301
303 return ["*"] * len(atoms)
304
305
306
309
310
311 if hasattr(Chem.Atom, "GetIsotope"):
312
314 return ["%d*" % atom.GetIsotope() for atom in atoms]
315 else:
316
317
318
319
320
321
322
323
324
325
326
327
329 atom_smarts_types = []
330 for atom in atoms:
331 mass = atom.GetMass()
332 int_mass = int(round(mass * 1000))
333 if int_mass % 1000 == 0:
334
335 atom_smarts = "%d*" % (int_mass // 1000)
336 else:
337
338
339
340 atom.SetMass(0.0)
341 atom_smarts = "0*"
342 atom_smarts_types.append(atom_smarts)
343 return atom_smarts_types
344
345
346
348 return ["~"] * len(bonds)
349
350
351
352
354
355 bond_smarts_types = []
356 for bond in bonds:
357 bond_term = bond.GetSmarts()
358 if not bond_term:
359
360
361 if bond.GetIsAromatic():
362 bond_term = ':'
363 else:
364 bond_term = '-'
365 bond_smarts_types.append(bond_term)
366
367 return bond_smarts_types
368
369
370 atom_typers = {
371 "any": atom_typer_any,
372 "elements": atom_typer_elements,
373 "isotopes": atom_typer_isotopes,
374 }
375
376 bond_typers = {
377 "any": bond_typer_any,
378 "bondtypes": bond_typer_bondtypes,
379 }
380 default_atom_typer = atom_typers[Default.atomCompare]
381 default_bond_typer = bond_typers[Default.bondCompare]
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400 if hasattr(Chem.Atom, "GetIsotope"):
401
403 return [atom.GetIsotope() for atom in mol.GetAtoms()]
404
406 if mol.GetNumAtoms() != len(isotopes):
407 raise ValueError("Mismatch between the number of atoms and the number of isotopes")
408 for atom, isotope in zip(mol.GetAtoms(), isotopes):
409 atom.SetIsotope(isotope)
410
411 else:
412
414 return [atom.GetMass() for atom in mol.GetAtoms()]
415
417 if mol.GetNumAtoms() != len(isotopes):
418 raise ValueError("Mismatch between the number of atoms and the number of isotopes")
419 for atom, isotope in zip(mol.GetAtoms(), isotopes):
420 atom.SetMass(isotope)
421
422
423 _isotope_dict = weakref.WeakKeyDictionary()
424 _atom_class_dict = weakref.WeakKeyDictionary()
425
426
429
430
433
434
436 atom_classes = _atom_class_dict.get(mol, None)
437 if atom_classes is None:
438 return None
439 return [atom_classes[index] for index in atom_indices]
440
441
448
449
451 try:
452 atom_classes = mol.GetProp(atom_class_tag)
453 except KeyError:
454 raise ValueError("Missing atom class tag %r" % (atom_class_tag, ))
455 fields = atom_classes.split()
456 if len(fields) != mol.GetNumAtoms():
457 raise ValueError(
458 "Mismatch between the number of atoms (#%d) and the number of atom classes (%d)" % (
459 mol.GetNumAtoms(), len(fields)))
460 new_isotopes = []
461 for field in fields:
462 if not field.isdigit():
463 raise ValueError("Atom class %r from tag %r must be a number" % (field, atom_class_tag))
464 isotope = int(field)
465 if not (1 <= isotope <= 10000):
466 raise ValueError("Atom class %r from tag %r must be in the range 1 to 10000" %
467 (field, atom_class_tag))
468 new_isotopes.append(isotope)
469
470 save_isotopes(mol, get_isotopes(mol))
471 save_atom_classes(mol, new_isotopes)
472 set_isotopes(mol, new_isotopes)
473
474
475
476
477
478
479
480
481
482
483
485
486 - def __init__(self, rdmol, rdmol_atoms, rdmol_bonds, atom_smarts_types, bond_smarts_types,
487 canonical_bondtypes):
488 self.rdmol = rdmol
489
490
491
492
493
494
495 self.rdmol_atoms = rdmol_atoms
496 self.rdmol_bonds = rdmol_bonds
497
498
499 self.atom_smarts_types = atom_smarts_types
500 self.bond_smarts_types = bond_smarts_types
501
502
503 self.canonical_bondtypes = canonical_bondtypes
504
505
506
507
508
509
510
511
512
513
514
515
516
518
519 - def __init__(self, rdmol, rdmol_atoms, orig_atoms, orig_bonds, atom_smarts_types,
520 bond_smarts_types, canonical_bondtypes):
521 self.rdmol = rdmol
522 self.rdmol_atoms = rdmol_atoms
523 self.orig_atoms = orig_atoms
524 self.orig_bonds = orig_bonds
525
526 self.atom_smarts_types = atom_smarts_types
527 self.bond_smarts_types = bond_smarts_types
528
529
530 self.canonical_bondtypes = canonical_bondtypes
531
532
533
534
535
536
538
539 - def __init__(self, rdmol, orig_atoms, orig_bonds, atom_smarts_types, bond_smarts_types,
540 canonical_bondtypes):
541 self.rdmol = rdmol
542 self.orig_atoms = orig_atoms
543 self.orig_bonds = orig_bonds
544 self.atom_smarts_types = atom_smarts_types
545 self.bond_smarts_types = bond_smarts_types
546 self.canonical_bondtypes = canonical_bondtypes
547
548
549
550
551
552
553
555 canonical_bondtypes = []
556 for bond, bond_smarts in zip(bonds, bond_smarts_types):
557 atom1_smarts = atom_smarts_types[bond.GetBeginAtomIdx()]
558 atom2_smarts = atom_smarts_types[bond.GetEndAtomIdx()]
559 if atom1_smarts > atom2_smarts:
560 atom1_smarts, atom2_smarts = atom2_smarts, atom1_smarts
561 canonical_bondtypes.append("[%s]%s[%s]" % (atom1_smarts, bond_smarts, atom2_smarts))
562 return canonical_bondtypes
563
564
565
566
567
568
571 atoms = list(rdmol.GetAtoms())
572 atom_smarts_types = atom_typer(atoms)
573
574
575 if matchValences:
576 new_atom_smarts_types = []
577 for (atom, atom_smarts_type) in zip(atoms, atom_smarts_types):
578 valence = atom.GetImplicitValence() + atom.GetExplicitValence()
579 valence_str = "v%d" % valence
580 if "," in atom_smarts_type:
581 atom_smarts_type += ";" + valence_str
582 else:
583 atom_smarts_type += valence_str
584 new_atom_smarts_types.append(atom_smarts_type)
585 atom_smarts_types = new_atom_smarts_types
586
587
588
589 bonds = list(rdmol.GetBonds())
590 bond_smarts_types = bond_typer(bonds)
591 if ringMatchesRingOnly:
592 new_bond_smarts_types = []
593 for bond, bond_smarts in zip(bonds, bond_smarts_types):
594 if bond.IsInRing():
595 if bond_smarts == ":":
596
597 pass
598 else:
599 if "," in bond_smarts:
600 bond_smarts += ";@"
601 else:
602 bond_smarts += "@"
603 else:
604 if "," in bond_smarts:
605 bond_smarts += ";!@"
606 else:
607 bond_smarts += "!@"
608
609 new_bond_smarts_types.append(bond_smarts)
610 bond_smarts_types = new_bond_smarts_types
611
612 canonical_bondtypes = get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types)
613 return TypedMolecule(rdmol, atoms, bonds, atom_smarts_types, bond_smarts_types,
614 canonical_bondtypes)
615
616
617
618
620 raise NotImplementedError("not tested!")
621
622 rdmol = copy.copy(rdmol)
623
624 atom_smarts_types = []
625 atoms = list(mol.GetAtoms())
626 for atom, atom_type in zip(atoms, atom_types):
627 atom.SetAtomicNum(0)
628 atom.SetMass(atom_type)
629 atom_term = "%d*" % (atom_type, )
630 if ringMatchesRingOnly:
631 if atom.IsInRing():
632 atom_term += "R"
633 else:
634 atom_term += "!R"
635 atom_smarts_types.append('[' + atom_term + ']')
636
637 bonds = list(rdmol.GetBonds())
638 bond_smarts_types = get_bond_smarts_types(mol, bonds, ringMatchesRingOnly)
639 canonical_bondtypes = get_canonical_bondtypes(mol, bonds, atom_smarts_types, bond_smarts_types)
640
641 return TypedMolecule(mol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
642
643
653
654
656 if num_atoms != len(atom_classes):
657 raise ValueError("mols[%d]: len(atom_classes) must be the same as the number of atoms" %
658 (molno, ))
659 for atom_class in atom_classes:
660 if not isinstance(atom_class, int):
661 raise ValueError("mols[%d]: atom_class elements must be integers" % (molno, ))
662 if not (1 <= atom_class < 1000):
663 raise ValueError("mols[%d]: atom_class elements must be in the range 1 <= value < 1000" %
664 (molno, ))
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
681 d = defaultdict(int)
682 for item in it:
683 d[item] += 1
684 return dict(d)
685
686
687
688
690 d = {}
691 for k, v1 in counts1.iteritems():
692 if k in counts2:
693 v = min(v1, counts2[k])
694 d[k] = v
695 return d
696
697
698
700 overall_counts = defaultdict(list)
701 for typed_mol in typed_mols:
702 bondtype_counts = get_counts(typed_mol.canonical_bondtypes)
703 for k, v in bondtype_counts.items():
704 overall_counts[k].append(v)
705 return overall_counts
706
707
708
709
710
711
712
714 emol = Chem.EditableMol(Chem.Mol())
715
716
717 for atom in typed_mol.rdmol_atoms:
718 emol.AddAtom(atom)
719
720
721
722 orig_bonds = []
723 new_bond_smarts_types = []
724 new_canonical_bondtypes = []
725 for bond, bond_smarts, canonical_bondtype in zip(
726 typed_mol.rdmol_bonds, typed_mol.bond_smarts_types, typed_mol.canonical_bondtypes):
727 if canonical_bondtype in supported_canonical_bondtypes:
728 orig_bonds.append(bond)
729 new_bond_smarts_types.append(bond_smarts)
730 new_canonical_bondtypes.append(canonical_bondtype)
731 emol.AddBond(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
732
733 new_mol = emol.GetMol()
734 return FragmentedTypedMolecule(new_mol, list(new_mol.GetAtoms()), typed_mol.rdmol_atoms,
735 orig_bonds, typed_mol.atom_smarts_types, new_bond_smarts_types,
736 new_canonical_bondtypes)
737
738
739
740
741
742
743
744
745
747 max_num_atoms = max_twice_num_bonds = 0
748 for atom_indices in Chem.GetMolFrags(rdmol):
749 num_atoms = len(atom_indices)
750 if num_atoms > max_num_atoms:
751 max_num_atoms = num_atoms
752
753
754
755 twice_num_bonds = 0
756 for atom_index in atom_indices:
757
758 twice_num_bonds += sum(1 for bond in atoms[atom_index].GetBonds())
759 if twice_num_bonds > max_twice_num_bonds:
760 max_twice_num_bonds = twice_num_bonds
761
762 return max_num_atoms, max_twice_num_bonds // 2
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779 EnumerationMolecule = collections.namedtuple("Molecule", "rdmol atoms bonds directed_edges")
780 Atom = collections.namedtuple("Atom", "real_atom atom_smarts bond_indices is_in_ring")
781 Bond = collections.namedtuple("Bond",
782 "real_bond bond_smarts canonical_bondtype atom_indices is_in_ring")
783
784
785
786
787
788
789
790 DirectedEdge = collections.namedtuple("DirectedEdge", "bond_index end_atom_index")
791
792
793 Subgraph = collections.namedtuple("Subgraph", "atom_indices bond_indices")
794
795
797 rdmol = typed_mol.rdmol
798 rdmol_atoms = typed_mol.rdmol_atoms
799
800
801
802
803
804 emol = Chem.EditableMol(Chem.Mol())
805 atom_smarts_types = []
806 atom_map = {}
807 for i, atom_index in enumerate(atom_indices):
808 atom = rdmol_atoms[atom_index]
809 emol.AddAtom(atom)
810 atom_smarts_types.append(typed_mol.atom_smarts_types[atom_index])
811 atom_map[atom_index] = i
812
813
814 orig_bonds = []
815 bond_smarts_types = []
816 new_canonical_bondtypes = []
817 for bond, orig_bond, bond_smarts, canonical_bondtype in zip(
818 rdmol.GetBonds(), typed_mol.orig_bonds, typed_mol.bond_smarts_types,
819 typed_mol.canonical_bondtypes):
820 begin_atom_idx = bond.GetBeginAtomIdx()
821 end_atom_idx = bond.GetEndAtomIdx()
822 count = (begin_atom_idx in atom_map) + (end_atom_idx in atom_map)
823
824 if count == 2:
825 bond_smarts_types.append(bond_smarts)
826 new_canonical_bondtypes.append(canonical_bondtype)
827 emol.AddBond(atom_map[begin_atom_idx], atom_map[end_atom_idx], bond.GetBondType())
828 orig_bonds.append(orig_bond)
829 elif count == 1:
830 raise AssertionError("connected/disconnected atoms?")
831 return TypedFragment(emol.GetMol(),
832 [typed_mol.orig_atoms[atom_index] for atom_index in atom_indices],
833 orig_bonds, atom_smarts_types, bond_smarts_types, new_canonical_bondtypes)
834
835
837 if minNumAtoms < 2:
838 raise ValueError("minNumAtoms must be at least 2")
839
840 fragments = []
841 for atom_indices in Chem.GetMolFrags(typed_mol.rdmol):
842
843 if len(atom_indices) < minNumAtoms:
844 continue
845
846
847
848
849
850
851
852
853 typed_fragment = get_typed_fragment(typed_mol, atom_indices)
854 rdmol = typed_fragment.rdmol
855 atoms = []
856 for atom, orig_atom, atom_smarts_type in zip(rdmol.GetAtoms(), typed_fragment.orig_atoms,
857 typed_fragment.atom_smarts_types):
858 bond_indices = [bond.GetIdx() for bond in atom.GetBonds()]
859
860 atom_smarts = '[' + atom_smarts_type + ']'
861 atoms.append(Atom(atom, atom_smarts, bond_indices, orig_atom.IsInRing()))
862
863 directed_edges = collections.defaultdict(list)
864 bonds = []
865 for bond_index, (bond, orig_bond, bond_smarts, canonical_bondtype) in enumerate(
866 zip(rdmol.GetBonds(), typed_fragment.orig_bonds, typed_fragment.bond_smarts_types,
867 typed_fragment.canonical_bondtypes)):
868 atom_indices = [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
869 bonds.append(Bond(bond, bond_smarts, canonical_bondtype, atom_indices, orig_bond.IsInRing()))
870
871 directed_edges[atom_indices[0]].append(DirectedEdge(bond_index, atom_indices[1]))
872 directed_edges[atom_indices[1]].append(DirectedEdge(bond_index, atom_indices[0]))
873
874 fragment = EnumerationMolecule(rdmol, atoms, bonds, dict(directed_edges))
875 fragments.append(fragment)
876
877
878 fragments.sort(key=lambda fragment: len(fragment.atoms), reverse=True)
879 return fragments
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
900 d = defaultdict(list)
901 q = 2
902 while 1:
903 if q not in d:
904 yield q
905 d[q * q].append(q)
906 else:
907 for p in d[q]:
908 d[p + q].append(p)
909 del d[q]
910 q += 1
911
912
913 _prime_stream = gen_primes()
914
915
916 _primes = []
917
918
926
927
928 _get_nth_prime(1000)
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
946
947
948 __slots__ = ["index", "atom_smarts", "value", "neighbors", "rank", "outgoing_edges"]
949
950 - def __init__(self, index, atom_smarts):
951 self.index = index
952 self.atom_smarts = atom_smarts
953 self.value = 0
954 self.neighbors = []
955 self.rank = 0
956 self.outgoing_edges = []
957
958
959
960 OutgoingEdge = collections.namedtuple(
961 "OutgoingEdge", "from_atom_index bond_index bond_smarts other_node_idx other_node")
962
963
964
965
966
969
970
971
972
973
974
975 atom_map = {}
976
977 cangen_nodes = []
978 atoms = enumeration_mol.atoms
979 canonical_labels = []
980 for i, atom_index in enumerate(subgraph.atom_indices):
981 atom_map[atom_index] = i
982 cangen_nodes.append(CangenNode(i, atoms[atom_index].atom_smarts))
983 canonical_labels.append([])
984
985
986
987 for bond_index in subgraph.bond_indices:
988 bond = enumeration_mol.bonds[bond_index]
989 from_atom_index, to_atom_index = bond.atom_indices
990 from_subgraph_atom_index = atom_map[from_atom_index]
991 to_subgraph_atom_index = atom_map[to_atom_index]
992
993 from_node = cangen_nodes[from_subgraph_atom_index]
994 to_node = cangen_nodes[to_subgraph_atom_index]
995 from_node.neighbors.append(to_node)
996 to_node.neighbors.append(from_node)
997
998 canonical_bondtype = bond.canonical_bondtype
999 canonical_labels[from_subgraph_atom_index].append(canonical_bondtype)
1000 canonical_labels[to_subgraph_atom_index].append(canonical_bondtype)
1001
1002 from_node.outgoing_edges.append(
1003 OutgoingEdge(from_subgraph_atom_index, bond_index, bond.bond_smarts, to_subgraph_atom_index,
1004 to_node))
1005 to_node.outgoing_edges.append(
1006 OutgoingEdge(to_subgraph_atom_index, bond_index, bond.bond_smarts, from_subgraph_atom_index,
1007 from_node))
1008
1009 if do_initial_assignment:
1010
1011
1012 for atom_index, node, canonical_label in zip(subgraph.atom_indices, cangen_nodes,
1013 canonical_labels):
1014
1015
1016
1017
1018
1019 canonical_label.sort()
1020 canonical_label.append(atoms[atom_index].atom_smarts)
1021 label = "\n".join(canonical_label)
1022
1023
1024
1025
1026
1027
1028 node.value = atom_assignment[label]
1029
1030 return cangen_nodes
1031
1032
1033
1035 rank = 0
1036 prev_value = -1
1037 for node in cangen_nodes:
1038 if node.value != prev_value:
1039 rank += 1
1040 prev_value = node.value
1041 node.rank = rank
1042
1043
1044
1045
1047 result = []
1048 prev_value = -1
1049 count = 0
1050 for index in range(start, end):
1051 node = cangen_nodes[index]
1052 if node.value == prev_value:
1053 count += 1
1054 else:
1055 if count > 1:
1056
1057 result.append((start, index))
1058 count = 1
1059 prev_value = node.value
1060 start = index
1061 if count > 1:
1062
1063 result.append((start, end))
1064 return result
1065
1066
1067
1068 -def canon(cangen_nodes):
1069
1070
1071
1072
1073 cangen_nodes.sort(key=lambda node: node.value)
1074 rerank(cangen_nodes)
1075
1076
1077 master_sort_order = cangen_nodes[:]
1078
1079
1080 duplicates = find_duplicates(cangen_nodes, 0, len(cangen_nodes))
1081
1082 PRIMES = _primes
1083
1084 while duplicates:
1085
1086 for node in cangen_nodes:
1087 try:
1088 node.value = PRIMES[node.rank]
1089 except IndexError:
1090 node.value = _get_nth_prime(node.rank)
1091 for node in cangen_nodes:
1092
1093
1094 p = 1
1095 for neighbor in node.neighbors:
1096 p *= neighbor.value
1097 node.value = p
1098
1099
1100
1101
1102 cangen_nodes.sort(key=lambda node: node.value)
1103
1104
1105 rerank(cangen_nodes)
1106
1107
1108 new_duplicates = []
1109 unchanged = True
1110 for (start, end) in duplicates:
1111
1112
1113
1114 if start + 2 == end:
1115 node1, node2 = master_sort_order[start], master_sort_order[end - 1]
1116 if node1.value > node2.value:
1117 master_sort_order[start] = node2
1118 master_sort_order[end - 1] = node1
1119 else:
1120 subset = master_sort_order[start:end]
1121 subset.sort(key=lambda node: node.value)
1122 master_sort_order[start:end] = subset
1123
1124 subset_duplicates = find_duplicates(master_sort_order, start, end)
1125 new_duplicates.extend(subset_duplicates)
1126 if unchanged:
1127
1128 if not (len(subset_duplicates) == 1 and subset_duplicates[0] == (start, end)):
1129 unchanged = False
1130
1131
1132
1133 if not new_duplicates:
1134 break
1135
1136
1137 if not unchanged:
1138 duplicates = new_duplicates
1139 continue
1140
1141 duplicates = new_duplicates
1142
1143
1144 pass
1145
1146
1147
1148
1149 for node in cangen_nodes:
1150 node.value = node.rank * 2
1151
1152
1153
1154
1155 start, end = duplicates[-1]
1156 cangen_nodes[start].value -= 1
1157 if end == start + 2:
1158
1159
1160 del duplicates[-1]
1161 else:
1162
1163 duplicates[-1] = (start + 1, end)
1164 rerank(cangen_nodes)
1165
1166
1167
1168
1169 cangen_nodes.sort(key=lambda node: node.index)
1170
1171
1173 if closure < 10:
1174 return bond_smarts + str(closure)
1175 else:
1176 return bond_smarts + "%%%02d" % closure
1177
1178
1179 _available_closures = list(range(1, 101))
1180 heapify(_available_closures)
1181
1182
1183
1184
1185
1186
1187
1188
1190 start_index = 0
1191 best_rank = cangen_nodes[0].rank
1192 for i, node in enumerate(cangen_nodes):
1193 if node.rank < best_rank:
1194 best_rank = node.rank
1195 start_index = i
1196 node.outgoing_edges.sort(key=lambda edge: edge.other_node.rank)
1197
1198 visited_atoms = [0] * len(cangen_nodes)
1199 closure_bonds = set()
1200
1201
1202 stack = []
1203 atom_idx = start_index
1204 stack.extend(reversed(cangen_nodes[atom_idx].outgoing_edges))
1205 visited_atoms[atom_idx] = True
1206
1207 while stack:
1208 edge = stack.pop()
1209 if visited_atoms[edge.other_node_idx]:
1210 closure_bonds.add(edge.bond_index)
1211 else:
1212 visited_atoms[edge.other_node_idx] = 1
1213 for next_edge in reversed(cangen_nodes[edge.other_node_idx].outgoing_edges):
1214 if next_edge.other_node_idx == edge.from_atom_index:
1215
1216 continue
1217 stack.append(next_edge)
1218
1219 available_closures = _available_closures[:]
1220 unclosed_closures = {}
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230 smiles_terms = []
1231 stack = [(0, (start_index, -1))]
1232 while stack:
1233 action, data = stack.pop()
1234 if action == 0:
1235
1236
1237
1238
1239
1240
1241 while 1:
1242
1243 num_neighbors = 0
1244 atom_idx, prev_bond_idx = data
1245 smiles_terms.append(cangen_nodes[atom_idx].atom_smarts)
1246 outgoing_edges = cangen_nodes[atom_idx].outgoing_edges
1247 for outgoing_edge in outgoing_edges:
1248 bond_idx = outgoing_edge.bond_index
1249
1250
1251 if bond_idx in closure_bonds:
1252
1253 if bond_idx not in unclosed_closures:
1254
1255 closure = heappop(available_closures)
1256 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure))
1257 unclosed_closures[bond_idx] = closure
1258 else:
1259 closure = unclosed_closures[bond_idx]
1260 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure))
1261 heappush(available_closures, closure)
1262 del unclosed_closures[bond_idx]
1263 else:
1264
1265 if bond_idx == prev_bond_idx:
1266
1267 continue
1268 if num_neighbors == 0:
1269
1270
1271 data = (outgoing_edge.other_node_idx, bond_idx)
1272 bond_smarts = outgoing_edge.bond_smarts
1273 else:
1274
1275 if num_neighbors == 1:
1276
1277
1278 stack.append((0, data))
1279 stack.append((1, bond_smarts))
1280
1281
1282 stack.append((3, None))
1283 stack.append((0, (outgoing_edge.other_node_idx, bond_idx)))
1284 stack.append((4, outgoing_edge.bond_smarts))
1285
1286 num_neighbors += 1
1287 if num_neighbors != 1:
1288
1289 break
1290 smiles_terms.append(bond_smarts)
1291 elif action == 1:
1292
1293 smiles_terms.append(data)
1294 continue
1295
1296 elif action == 3:
1297 smiles_terms.append(')')
1298 elif action == 4:
1299 smiles_terms.append('(' + data)
1300 else:
1301 raise AssertionError
1302
1303 return "".join(smiles_terms)
1304
1305
1306
1307
1308
1309
1315
1316
1317
1318
1319
1320
1321
1322
1323
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1363 "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
1364 s = list(iterable)
1365 return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
1366
1367
1368
1370 "nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
1371 s = list(iterable)
1372 it = chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
1373 next(it)
1374 return it
1375
1376
1377
1378
1379
1381 c = itertools.count()
1382 return lambda: next(c)
1383
1384
1385 tiebreaker = _Counter()
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1399 internal_bonds = set()
1400 external_edges = []
1401 for atom_index in atom_indices:
1402 for directed_edge in directed_edges[atom_index]:
1403
1404 if directed_edge.bond_index in visited_bond_indices:
1405 continue
1406
1407 if directed_edge.end_atom_index in atom_indices:
1408
1409 internal_bonds.add(directed_edge.bond_index)
1410 else:
1411
1412 external_edges.append(directed_edge)
1413
1414
1415 return list(internal_bonds), external_edges
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425 -def all_subgraph_extensions(enumeration_mol, subgraph, visited_bond_indices, internal_bonds,
1426 external_edges):
1427
1428
1429
1430
1431
1432 if not external_edges:
1433
1434 it = nonempty_powerset(internal_bonds)
1435 for internal_bond in it:
1436
1437 bond_indices = set(subgraph.bond_indices)
1438 bond_indices.update(internal_bond)
1439 yield None, Subgraph(subgraph.atom_indices, frozenset(bond_indices)), 0, 0
1440 return
1441
1442
1443 if not internal_bonds:
1444 it = nonempty_powerset(external_edges)
1445 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges)))
1446 for external_ext in it:
1447 new_atoms = frozenset(ext.end_atom_index for ext in external_ext)
1448 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms))
1449 bond_indices = frozenset(
1450 chain(subgraph.bond_indices, (ext.bond_index for ext in external_ext)))
1451 num_possible_atoms, num_possible_bonds = find_extension_size(enumeration_mol, new_atoms,
1452 exclude_bonds, external_ext)
1453
1454
1455
1456 yield new_atoms, Subgraph(atom_indices, bond_indices), num_possible_atoms, num_possible_bonds
1457 return
1458
1459
1460 internal_powerset = list(powerset(internal_bonds))
1461 external_powerset = powerset(external_edges)
1462
1463 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges)))
1464
1465 for external_ext in external_powerset:
1466 if not external_ext:
1467
1468 for internal_bond in internal_powerset[1:]:
1469 bond_indices = set(subgraph.bond_indices)
1470 bond_indices.update(internal_bond)
1471 yield None, Subgraph(subgraph.atom_indices, bond_indices), 0, 0
1472 else:
1473 new_atoms = frozenset(ext.end_atom_index for ext in external_ext)
1474 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms))
1475
1476
1477 bond_indices = frozenset(
1478 chain(subgraph.bond_indices, (ext.bond_index for ext in external_ext)))
1479 num_possible_atoms, num_possible_bonds = find_extension_size(enumeration_mol, atom_indices,
1480 exclude_bonds, external_ext)
1481
1482 for internal_bond in internal_powerset:
1483 bond_indices2 = frozenset(chain(bond_indices, internal_bond))
1484
1485 yield new_atoms, Subgraph(atom_indices,
1486 bond_indices2), num_possible_atoms, num_possible_bonds
1487
1488
1490 num_remaining_atoms = num_remaining_bonds = 0
1491 visited_atoms = set(known_atoms)
1492 visited_bonds = set(exclude_bonds)
1493
1494
1495
1496 for directed_edge in directed_edges:
1497
1498 stack = [directed_edge.end_atom_index]
1499
1500
1501 while stack:
1502 atom_index = stack.pop()
1503 for next_edge in enumeration_mol.directed_edges[atom_index]:
1504
1505 bond_index = next_edge.bond_index
1506 if bond_index in visited_bonds:
1507
1508 continue
1509 num_remaining_bonds += 1
1510 visited_bonds.add(bond_index)
1511
1512
1513 next_atom_index = next_edge.end_atom_index
1514 if next_atom_index in visited_atoms:
1515
1516 continue
1517 num_remaining_atoms += 1
1518
1519 visited_atoms.add(next_atom_index)
1520
1521 stack.append(next_atom_index)
1522
1523
1524 return num_remaining_atoms, num_remaining_bonds
1525
1526
1527
1528
1529
1530
1532
1533 - def __init__(self, targets, required_match_count=None):
1534 self.targets = targets
1535 if required_match_count is None:
1536 required_match_count = len(targets)
1537 self.required_match_count = required_match_count
1538 self._num_allowed_errors = len(targets) - required_match_count
1539 super(dict, self).__init__()
1540
1542 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors)
1543 self.targets = self.targets[1:]
1544 self._num_allowed_errors = len(self.targets) - self.required_match_count
1545
1547 num_allowed_errors = self._num_allowed_errors
1548 if num_allowed_errors < 0:
1549 raise AssertionError("I should never be called")
1550 self[smarts] = False
1551 return False
1552
1553 pat = Chem.MolFromSmarts(smarts)
1554 if pat is None:
1555 raise AssertionError("Bad SMARTS: %r" % (smarts, ))
1556
1557 num_allowed_errors = self._num_allowed_errors
1558 for target in self.targets:
1559 if not MATCH(target, pat):
1560 if num_allowed_errors == 0:
1561
1562 self[smarts] = False
1563 return False
1564 num_allowed_errors -= 1
1565
1566
1567 self[smarts] = True
1568 return True
1569
1570
1572
1573 - def __init__(self, targets, required_match_count=None):
1574 self.targets = targets
1575 if required_match_count is None:
1576 required_match_count = len(targets)
1577 self.cache = {}
1578 self.required_match_count = required_match_count
1579 self._num_allowed_errors = len(targets) - required_match_count
1580 self.num_lookups = self.num_cached_true = self.num_cached_false = 0
1581 self.num_search_true = self.num_search_false = self.num_matches = 0
1582
1584 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors)
1585 if self._num_allowed_errors > 1:
1586 self.targets = self.targets[1:]
1587 self._num_allowed_errors = len(self.targets) - self.required_match_count
1588
1590 self.num_lookups += 1
1591 x = self.cache.get(smarts, missing)
1592 if x is not missing:
1593 if x:
1594 self.num_cached_true += 1
1595 else:
1596 self.num_cached_false += 1
1597 return x
1598
1599 pat = Chem.MolFromSmarts(smarts)
1600 if pat is None:
1601 raise AssertionError("Bad SMARTS: %r" % (smarts, ))
1602
1603 for i, target in enumerate(self.targets):
1604 if not MATCH(target, pat):
1605
1606 self.num_search_false += 1
1607 self.num_matches += i + 1
1608 self.cache[smarts] = False
1609 N = len(self.targets)
1610 return False
1611
1612
1613
1614 self.num_matches += i + 1
1615 self.num_search_true += 1
1616 self.cache[smarts] = True
1617 return True
1618
1620 print >> sys.stderr, "%d tests of %d unique SMARTS, cache: %d True %d False, search: %d True %d False (%d substructure tests)" % (
1621 self.num_lookups, len(self.cache), self.num_cached_true, self.num_cached_false,
1622 self.num_search_true, self.num_search_false, self.num_matches)
1623
1624
1625
1627
1628 num_atoms = len(subgraph.atom_indices)
1629 num_bonds = len(subgraph.bond_indices)
1630 best_num_atoms, best_num_bonds = best_sizes
1631
1632
1633 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds
1634 if diff_bonds < 0:
1635 return True
1636 elif diff_bonds == 0:
1637
1638 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms
1639 if diff_atoms <= 0:
1640 return True
1641
1642 return False
1643
1644
1646
1647 num_atoms = len(subgraph.atom_indices)
1648 num_bonds = len(subgraph.bond_indices)
1649 best_num_atoms, best_num_bonds = best_sizes
1650
1651
1652 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms
1653 if diff_atoms < 0:
1654 return True
1655 elif diff_atoms == 0:
1656 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds
1657 if diff_bonds <= 0:
1658 return True
1659 else:
1660
1661
1662 pass
1663
1664 return False
1665
1666
1667
1668
1670
1672 self.best_num_atoms = self.best_num_bonds = -1
1673 self.best_smarts = None
1674 self.sizes = (-1, -1)
1675 self.timer = timer
1676 self.verbose = verbose
1677
1678 - def _new_best(self, num_atoms, num_bonds, smarts):
1679 self.best_num_atoms = num_atoms
1680 self.best_num_bonds = num_bonds
1681 self.best_smarts = smarts
1682 self.sizes = sizes = (num_atoms, num_bonds)
1683 self.timer.mark("new best")
1684 if self.verbose:
1685 dt = self.timer.mark_times["new best"] - self.timer.mark_times["start fmcs"]
1686 sys.stderr.write("Best after %.1fs: %d atoms %d bonds %s\n" %
1687 (dt, num_atoms, num_bonds, smarts))
1688 return sizes
1689
1691 return MCSResult(self.best_num_atoms, self.best_num_bonds, self.best_smarts, completed)
1692
1693
1695
1696 - def __init__(self, num_atoms, num_bonds, smarts, completed):
1697 self.num_atoms = num_atoms
1698 self.num_bonds = num_bonds
1699 self.smarts = smarts
1700 self.completed = completed
1701
1703 return self.smarts is not None
1704
1705
1707
1709 sizes = self.sizes
1710
1711
1712 num_subgraph_atoms = len(subgraph.atom_indices)
1713 if num_subgraph_atoms < sizes[0]:
1714 return sizes
1715
1716 num_subgraph_bonds = len(subgraph.bond_indices)
1717 if num_subgraph_atoms == sizes[0]:
1718 if num_subgraph_bonds <= sizes[1]:
1719 return sizes
1720
1721 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1722
1723
1725
1727 sizes = self.sizes
1728
1729
1730 num_subgraph_bonds = len(subgraph.bond_indices)
1731 if num_subgraph_bonds < sizes[1]:
1732 return sizes
1733
1734 num_subgraph_atoms = len(subgraph.atom_indices)
1735 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]:
1736 return sizes
1737 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1738
1739
1740
1741
1742
1743
1745
1746
1747 atoms = enumeration_mol.atoms
1748 bonds = enumeration_mol.bonds
1749
1750
1751 ring_bonds = []
1752 for bond_index in subgraph.bond_indices:
1753 bond = bonds[bond_index]
1754 if bond.is_in_ring:
1755 ring_bonds.append(bond_index)
1756
1757
1758 if not ring_bonds:
1759
1760 return True
1761
1762 if len(ring_bonds) <= 2:
1763
1764 return False
1765
1766
1767
1768
1769 confirmed_ring_bonds = set()
1770 subgraph_ring_bond_indices = set(ring_bonds)
1771 for bond_index in ring_bonds:
1772
1773 if bond_index in confirmed_ring_bonds:
1774 continue
1775
1776 from_atom_index, to_atom_index = bonds[bond_index].atom_indices
1777
1778
1779 atom_depth = {from_atom_index: 0, to_atom_index: 1}
1780 bond_stack = [bond_index]
1781 backtrack_stack = []
1782 prev_bond_index = bond_index
1783 current_atom_index = to_atom_index
1784
1785 while 1:
1786
1787 next_bond_index = next_atom_index = None
1788 this_is_a_ring = False
1789 for outgoing_edge in enumeration_mol.directed_edges[current_atom_index]:
1790 if outgoing_edge.bond_index == prev_bond_index:
1791
1792 continue
1793 if outgoing_edge.bond_index not in subgraph_ring_bond_indices:
1794
1795 continue
1796
1797 if outgoing_edge.end_atom_index in atom_depth:
1798
1799
1800 confirmed_ring_bonds.update(bond_stack[atom_depth[outgoing_edge.end_atom_index]:])
1801 confirmed_ring_bonds.add(outgoing_edge.bond_index)
1802 if len(confirmed_ring_bonds) == len(ring_bonds):
1803
1804 return True
1805 this_is_a_ring = True
1806 continue
1807
1808
1809
1810 if next_bond_index is None:
1811
1812 next_bond_index = outgoing_edge.bond_index
1813 next_atom_index = outgoing_edge.end_atom_index
1814 else:
1815
1816 backtrack_stack.append(
1817 (len(bond_stack), outgoing_edge.bond_index, outgoing_edge.end_atom_index))
1818
1819 if next_bond_index is None:
1820
1821 if this_is_a_ring:
1822
1823
1824 while backtrack_stack:
1825 old_size, prev_bond_index, current_atom_index = backtrack_stack.pop()
1826 if bond_index not in confirmed_ring_bonds:
1827
1828
1829 del bond_stack[old_size:]
1830 break
1831 else:
1832
1833
1834
1835
1836 break
1837 else:
1838
1839 return False
1840 else:
1841
1842 bond_stack.append(next_bond_index)
1843 atom_depth[next_atom_index] = len(bond_stack)
1844 prev_bond_index = next_bond_index
1845 current_atom_index = next_atom_index
1846
1847
1848
1849
1850
1852
1854 sizes = self.sizes
1855
1856
1857 num_subgraph_atoms = len(subgraph.atom_indices)
1858 if num_subgraph_atoms < sizes[0]:
1859 return sizes
1860
1861 num_subgraph_bonds = len(subgraph.bond_indices)
1862 if num_subgraph_atoms == sizes[0] and num_subgraph_bonds <= sizes[1]:
1863 return sizes
1864
1865 if check_completeRingsOnly(smarts, subgraph, mol):
1866 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1867 return sizes
1868
1869
1871
1873 sizes = self.sizes
1874
1875
1876 num_subgraph_bonds = len(subgraph.bond_indices)
1877 if num_subgraph_bonds < sizes[1]:
1878 return sizes
1879
1880 num_subgraph_atoms = len(subgraph.atom_indices)
1881 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]:
1882 return sizes
1883
1884 if check_completeRingsOnly(smarts, subgraph, mol):
1885 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1886 return sizes
1887
1888
1889 _maximize_options = {
1890 ("atoms", False): (prune_maximize_atoms, SingleBestAtoms),
1891 ("atoms", True): (prune_maximize_atoms, SingleBestAtomsCompleteRingsOnly),
1892 ("bonds", False): (prune_maximize_bonds, SingleBestBonds),
1893 ("bonds", True): (prune_maximize_bonds, SingleBestBondsCompleteRingsOnly),
1894 }
1895
1896
1897
1898
1899 -def enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits,
1900 timeout, heappush, heappop):
1901 if timeout is None:
1902 end_time = None
1903 else:
1904 end_time = time.time() + timeout
1905
1906 seeds = []
1907
1908 best_sizes = (0, 0)
1909
1910
1911 for mol in enumeration_mols:
1912 atom_range = range(len(mol.atoms))
1913 bond_set = set(range(len(mol.bonds)))
1914 subgraph = Subgraph(atom_range, bond_set)
1915 if not prune(subgraph, mol, 0, 0, best_sizes):
1916
1917
1918
1919
1920 smarts = make_arbitrary_smarts(subgraph, mol, atom_assignment)
1921 if matches_all_targets[smarts]:
1922 best_sizes = hits.add_new_match(subgraph, mol, smarts)
1923
1924 for mol in enumeration_mols:
1925 directed_edges = mol.directed_edges
1926
1927
1928
1929
1930
1931
1932 sorted_bonds = list(enumerate(mol.bonds))
1933
1934 def get_bond_ring_score(bond_data, atoms=mol.atoms):
1935 bond_index, bond = bond_data
1936 a1, a2 = bond.atom_indices
1937 return bond.is_in_ring + atoms[a1].is_in_ring + atoms[a2].is_in_ring
1938
1939 sorted_bonds.sort(key=get_bond_ring_score)
1940
1941 visited_bond_indices = set()
1942 num_remaining_atoms = len(mol.atoms) - 2
1943 num_remaining_bonds = len(mol.bonds)
1944 for bond_index, bond in sorted_bonds:
1945
1946 visited_bond_indices.add(bond_index)
1947 num_remaining_bonds -= 1
1948 subgraph = Subgraph(bond.atom_indices, frozenset([bond_index]))
1949
1950
1951 if prune(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1952 continue
1953
1954
1955
1956
1957
1958 smarts = bond.canonical_bondtype
1959 if matches_all_targets[smarts]:
1960 best_sizes = hits.add_new_match(subgraph, mol, smarts)
1961 else:
1962
1963
1964 continue
1965
1966 a1, a2 = bond.atom_indices
1967 outgoing_edges = [
1968 e for e in (directed_edges[a1] + directed_edges[a2])
1969 if e.end_atom_index not in bond.atom_indices and e.bond_index not in visited_bond_indices
1970 ]
1971
1972 empty_internal = frozenset()
1973 if not outgoing_edges:
1974 pass
1975 else:
1976
1977
1978
1979
1980 heappush(seeds, (-1, tiebreaker(), subgraph, visited_bond_indices.copy(), empty_internal,
1981 outgoing_edges, mol, directed_edges))
1982
1983
1984
1985
1986 del subgraph
1987
1988 while seeds:
1989 if end_time:
1990 if time.time() >= end_time:
1991 return False
1992
1993
1994 score, _, old_subgraph, visited_bond_indices, internal_bonds, external_edges, mol, directed_edges = heappop(
1995 seeds)
1996
1997 new_visited_bond_indices = visited_bond_indices.copy()
1998 new_visited_bond_indices.update(internal_bonds)
1999
2000
2001 new_visited_bond_indices.update(edge.bond_index for edge in external_edges)
2002
2003 for new_atoms, new_subgraph, num_remaining_atoms, num_remaining_bonds in \
2004 all_subgraph_extensions(mol, old_subgraph, visited_bond_indices, internal_bonds, external_edges):
2005 if prune(new_subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
2006
2007 continue
2008 smarts = make_canonical_smarts(new_subgraph, mol, atom_assignment)
2009 if matches_all_targets[smarts]:
2010
2011 best_sizes = hits.add_new_match(new_subgraph, mol, smarts)
2012 else:
2013
2014 continue
2015
2016 if not new_atoms:
2017 continue
2018
2019 new_internal_bonds, new_external_edges = find_extensions(new_atoms, new_visited_bond_indices,
2020 directed_edges)
2021
2022 if new_internal_bonds or new_external_edges:
2023
2024 heappush(seeds, (-len(new_subgraph.bond_indices), tiebreaker(), new_subgraph,
2025 new_visited_bond_indices, new_internal_bonds, new_external_edges, mol,
2026 directed_edges))
2027
2028 return True
2029
2030
2031
2033
2036
2038 self[key] = count = self.counter()
2039 return count
2040
2041
2042
2044 return mol.HasSubstructMatch(pat)
2045
2046
2048
2049 - def __init__(self, trigger, verboseDelay):
2050 self.num_seeds_added = 0
2051 self.num_seeds_processed = 0
2052 self.verboseDelay = verboseDelay
2053 self._time_for_next_report = time.time() + verboseDelay
2054 self.trigger = trigger
2055
2057 self.num_seeds_added += 1
2058 return heappush(seeds, item)
2059
2061 if time.time() >= self._time_for_next_report:
2062 self.trigger()
2063 self.report()
2064 self._time_for_next_report = time.time() + self.verboseDelay
2065 self.num_seeds_processed += 1
2066 return heappop(seeds)
2067
2069 self.trigger()
2070 self.report()
2071
2073 print >> sys.stderr, " %d subgraphs enumerated, %d processed" % (self.num_seeds_added,
2074 self.num_seeds_processed)
2075
2076
2080 assert timer is not None
2081 assert 0 < threshold_count <= len(fragmented_mols), threshold_count
2082 assert len(fragmented_mols) == len(typed_mols)
2083 assert len(fragmented_mols) >= 2
2084 if threshold_count is None:
2085 threshold_count = len(fragmented_mols)
2086 else:
2087 assert threshold_count >= 2, threshold_count
2088
2089 atom_assignment = Uniquer()
2090 if verbose:
2091 if verboseDelay < 0.0:
2092 raise ValueError("verboseDelay may not be negative")
2093 matches_all_targets = VerboseCachingTargetsMatcher(typed_mols[1:], threshold_count - 1)
2094 heapops = VerboseHeapOps(matches_all_targets.report, verboseDelay)
2095 push = heapops.heappush
2096 pop = heapops.heappop
2097 end_verbose = heapops.trigger_report
2098 else:
2099 matches_all_targets = CachingTargetsMatcher(typed_mols[1:], threshold_count - 1)
2100 push = heappush
2101 pop = heappop
2102 end_verbose = lambda: 1
2103
2104 try:
2105 prune, hits_class = _maximize_options[(maximize, bool(completeRingsOnly))]
2106 except KeyError:
2107 raise ValueError("Unknown 'maximize' option %r" % (maximize, ))
2108
2109 hits = hits_class(timer, verbose)
2110
2111 remaining_time = None
2112 if timeout is not None:
2113 stop_time = time.time() + timeout
2114
2115 for query_index, fragmented_query_mol in enumerate(fragmented_mols):
2116 enumerated_query_fragments = fragmented_mol_to_enumeration_mols(fragmented_query_mol,
2117 minNumAtoms)
2118
2119 targets = typed_mols
2120 if timeout is not None:
2121 remaining_time = stop_time - time.time()
2122 success = enumerate_subgraphs(enumerated_query_fragments, prune, atom_assignment,
2123 matches_all_targets, hits, remaining_time, push, pop)
2124 if query_index + threshold_count >= len(fragmented_mols):
2125 break
2126 if not success:
2127 break
2128 matches_all_targets.shift_targets()
2129
2130 end_verbose()
2131
2132 result = hits.get_result(success)
2133 if result.num_atoms < minNumAtoms:
2134 return MCSResult(-1, -1, None, result.completed)
2135 return result
2136
2137
2138
2139
2141
2143 self.mark_times = {}
2144
2145 - def mark(self, name):
2146 self.mark_times[name] = time.time()
2147
2148
2150 if times is None:
2151 return
2152 for (dest, start, end) in (
2153 ("fragment", "start fmcs", "end fragment"), ("select", "end fragment", "end select"),
2154 ("enumerate", "end select", "end fmcs"), ("best_found", "start fmcs", "new best"),
2155 ("mcs", "start fmcs", "end fmcs")):
2156 try:
2157 diff = timer.mark_times[end] - timer.mark_times[start]
2158 except KeyError:
2159 diff = None
2160 times[dest] = diff
2161
2162
2164 if threshold is None:
2165 return num_mols
2166
2167 x = num_mols * threshold
2168 threshold_count = int(x)
2169 if threshold_count < x:
2170 threshold_count += 1
2171
2172 if threshold_count < 2:
2173
2174
2175 threshold_count = 2
2176
2177 return threshold_count
2178
2179
2180 -def fmcs(mols,
2181 minNumAtoms=2,
2182 maximize=Default.maximize,
2183 atomCompare=Default.atomCompare,
2184 bondCompare=Default.bondCompare,
2185 threshold=1.0,
2186 matchValences=Default.matchValences,
2187 ringMatchesRingOnly=False,
2188 completeRingsOnly=False,
2189 timeout=Default.timeout,
2190 times=None,
2191 verbose=False,
2192 verboseDelay=1.0, ):
2193
2194 timer = Timer()
2195 timer.mark("start fmcs")
2196
2197 if minNumAtoms < 2:
2198 raise ValueError("minNumAtoms must be at least 2")
2199 if timeout is not None:
2200 if timeout <= 0.0:
2201 raise ValueError("timeout must be None or a positive value")
2202
2203 threshold_count = _get_threshold_count(len(mols), threshold)
2204 if threshold_count > len(mols):
2205
2206 return MCSResult(-1, -1, None, 1)
2207
2208 if completeRingsOnly:
2209 ringMatchesRingOnly = True
2210
2211 try:
2212 atom_typer = atom_typers[atomCompare]
2213 except KeyError:
2214 raise ValueError("Unknown atomCompare option %r" % (atomCompare, ))
2215 try:
2216 bond_typer = bond_typers[bondCompare]
2217 except KeyError:
2218 raise ValueError("Unknown bondCompare option %r" % (bondCompare, ))
2219
2220
2221 typed_mols = convert_input_to_typed_molecules(mols, atom_typer, bond_typer,
2222 matchValences=matchValences,
2223 ringMatchesRingOnly=ringMatchesRingOnly)
2224 bondtype_counts = get_canonical_bondtype_counts(typed_mols)
2225 supported_bondtypes = set()
2226 for bondtype, count_list in bondtype_counts.items():
2227 if len(count_list) >= threshold_count:
2228 supported_bondtypes.add(bondtype)
2229
2230
2231
2232
2233 fragmented_mols = [remove_unknown_bondtypes(typed_mol, bondtype_counts)
2234 for typed_mol in typed_mols]
2235 timer.mark("end fragment")
2236
2237 sizes = []
2238 max_num_atoms = fragmented_mols[0].rdmol.GetNumAtoms()
2239 max_num_bonds = fragmented_mols[0].rdmol.GetNumBonds()
2240 ignored_count = 0
2241 for tiebreaker, (typed_mol, fragmented_mol) in enumerate(zip(typed_mols, fragmented_mols)):
2242 num_atoms, num_bonds = find_upper_fragment_size_limits(fragmented_mol.rdmol,
2243 fragmented_mol.rdmol_atoms)
2244 if num_atoms < minNumAtoms:
2245
2246 ignored_count += 1
2247 if ignored_count + threshold_count > len(mols):
2248
2249
2250 timer.mark("end select")
2251 timer.mark("end fmcs")
2252 _update_times(timer, times)
2253 return MCSResult(-1, -1, None, True)
2254 else:
2255 if num_atoms < max_num_atoms:
2256 max_num_atoms = num_atoms
2257 if num_bonds < max_num_bonds:
2258 max_num_bonds = num_bonds
2259 sizes.append((num_bonds, num_atoms, tiebreaker, typed_mol, fragmented_mol))
2260
2261 if len(sizes) < threshold_count:
2262 timer.mark("end select")
2263 timer.mark("end fmcs")
2264 _update_times(timer, times)
2265 return MCSResult(-1, -1, None, True)
2266
2267 assert min(size[1] for size in sizes) >= minNumAtoms
2268
2269
2270
2271
2272 sizes.sort()
2273
2274
2275 timer.mark("end select")
2276
2277
2278 fragmented_mols = [size_info[4] for size_info in sizes]
2279 typed_mols = [size_info[3].rdmol for size_info in sizes]
2280
2281 timer.mark("start enumeration")
2282 mcs_result = compute_mcs(fragmented_mols, typed_mols, minNumAtoms,
2283 threshold_count=threshold_count, maximize=maximize,
2284 completeRingsOnly=completeRingsOnly, timeout=timeout, timer=timer,
2285 verbose=verbose, verboseDelay=verboseDelay)
2286 timer.mark("end fmcs")
2287 _update_times(timer, times)
2288 return mcs_result
2289
2290
2291
2292
2293
2294
2295
2297 emol = Chem.EditableMol(Chem.Mol())
2298 atom_map = {}
2299 for atom_index in subgraph.atom_indices:
2300 emol.AddAtom(mol.GetAtomWithIdx(atom_index))
2301 atom_map[atom_index] = len(atom_map)
2302
2303 for bond_index in subgraph.bond_indices:
2304 bond = mol.GetBondWithIdx(bond_index)
2305 emol.AddBond(atom_map[bond.GetBeginAtomIdx()], atom_map[bond.GetEndAtomIdx()],
2306 bond.GetBondType())
2307
2308 return emol.GetMol()
2309
2310
2311
2313 fragment = subgraph_to_fragment(mol, subgraph)
2314 new_smiles = Chem.MolToSmiles(fragment)
2315 return "%s %s\n" % (new_smiles, mol.GetProp("_Name"))
2316
2317
2325
2326
2328
2329 mol_block = Chem.MolToMolBlock(mol, kekulize=False)
2330 tag_data = []
2331 for name in mol.GetPropNames():
2332 if name.startswith("_"):
2333 continue
2334 value = mol.GetProp(name)
2335 tag_data.append("> <" + name + ">\n")
2336 tag_data.append(value + "\n")
2337 tag_data.append("\n")
2338 tag_data.append("$$$$\n")
2339 return mol_block + "".join(tag_data)
2340
2341
2365
2366
2367
2369 fragment = subgraph_to_fragment(mol, subgraph)
2370 Chem.FastFindRings(fragment)
2371 _copy_sd_tags(mol, fragment)
2372
2373 if args.save_atom_class_tag is not None:
2374 output_tag = args.save_atom_class_tag
2375 atom_classes = get_selected_atom_classes(mol, subgraph.atom_indices)
2376 if atom_classes is not None:
2377 fragment.SetProp(output_tag, " ".join(map(str, atom_classes)))
2378
2379 _save_other_tags(fragment, fragment, mcs, mol, subgraph, args)
2380
2381 return _MolToSDBlock(fragment)
2382
2383
2384
2386 fragment = copy.copy(mol)
2387 _copy_sd_tags(mol, fragment)
2388
2389 if args.save_atom_indices_tag is not None:
2390 output_tag = args.save_atom_indices_tag
2391 s = " ".join(str(index) for index in subgraph.atom_indices)
2392 fragment.SetProp(output_tag, s)
2393
2394 _save_other_tags(fragment, subgraph_to_fragment(mol, subgraph), mcs, mol, subgraph, args)
2395
2396 return _MolToSDBlock(fragment)
2397
2398
2399 structure_format_functions = {
2400 "fragment-smiles": make_fragment_smiles,
2401 "fragment-sdf": make_fragment_sdf,
2402 "complete-sdf": make_complete_sdf,
2403 }
2404
2405
2412
2413
2415 num_atoms = int(s)
2416 if num_atoms < 2:
2417 raise argparse.ArgumentTypeError("must be at least 2, not %s" % s)
2418 return num_atoms
2419
2420
2422 try:
2423 import fractions
2424 except ImportError:
2425 threshold = float(s)
2426 one = 1.0
2427 else:
2428 threshold = fractions.Fraction(s)
2429 one = fractions.Fraction(1)
2430 if not (0 <= threshold <= one):
2431 raise argparse.ArgumentTypeError("must be a value between 0.0 and 1.0, not %s" % s)
2432 return threshold
2433
2434
2436 if s == "none":
2437 return None
2438 timeout = float(s)
2439 if timeout < 0.0:
2440 raise argparse.ArgumentTypeError("Must be a non-negative value, not %r" % (s, ))
2441 return timeout
2442
2443
2445
2448
2450 return self.left <= value
2451
2452
2453 range_pat = re.compile(r"(\d+)-(\d*)")
2454 value_pat = re.compile("(\d+)")
2455
2456
2458 ranges = []
2459 start = 0
2460 while 1:
2461 m = range_pat.match(s, start)
2462 if m is not None:
2463
2464
2465 left = int(m.group(1))
2466 right = m.group(2)
2467 if not right:
2468 ranges.append(starting_from(left - 1))
2469 else:
2470 ranges.append(range(left - 1, int(right)))
2471 else:
2472
2473 m = value_pat.match(s, start)
2474 if m is not None:
2475 val = int(m.group(1))
2476 ranges.append(range(val - 1, val))
2477 else:
2478 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % (start + 1, s))
2479 start = m.end()
2480
2481 t = s[start:start + 1]
2482 if not t:
2483 break
2484 if t == ",":
2485 start += 1
2486 continue
2487 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % (start + 1, s))
2488 return ranges
2489
2490
2491 compare_shortcuts = {
2492 "topology": ("any", "any"),
2493 "elements": ("elements", "any"),
2494 "types": ("elements", "bondtypes"),
2495 }
2496
2497
2498
2499
2501 bond_indices = []
2502 for bond in pat.GetBonds():
2503 mol_atom1 = match_atom_indices[bond.GetBeginAtomIdx()]
2504 mol_atom2 = match_atom_indices[bond.GetEndAtomIdx()]
2505 bond = mol.GetBondBetweenAtoms(mol_atom1, mol_atom2)
2506 assert bond is not None
2507 bond_indices.append(bond.GetIdx())
2508 return bond_indices
2509
2510
2511 -def main(args=None):
2512 parser = argparse.ArgumentParser(
2513 description="Find the maximum common substructure of a set of structures",
2514 epilog="For more details on these options, see https://bitbucket.org/dalke/fmcs/")
2515 parser.add_argument("filename", nargs=1, help="SDF or SMILES file")
2516
2517 parser.add_argument("--maximize", choices=["atoms", "bonds"], default=Default.maximize,
2518 help="Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: %s)" %
2519 (Default.maximize, ))
2520 parser.add_argument("--min-num-atoms", type=parse_num_atoms, default=2, metavar="INT",
2521 help="Minimimum number of atoms in the MCS (Default: 2)")
2522
2523 class CompareAction(argparse.Action):
2524
2525 def __call__(self, parser, namespace, value, option_string=None):
2526 atomCompare_name, bondCompare_name = compare_shortcuts[value]
2527 namespace.atomCompare = atomCompare_name
2528 namespace.bondCompare = bondCompare_name
2529
2530 parser.add_argument(
2531 "--compare", choices=["topology", "elements", "types"], default=None, action=CompareAction,
2532 help="Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', "
2533 "'elements' is '--atom-compare elements --bond-compare any', "
2534 "and 'types' is '--atom-compare elements --bond-compare bondtypes' "
2535 "(Default: types)")
2536
2537 parser.add_argument(
2538 "--atom-compare", choices=["any", "elements", "isotopes"], default=None, help=(
2539 "Specify the atom comparison method. With 'any', every atom matches every "
2540 "other atom. With 'elements', atoms match only if they contain the same element. "
2541 "With 'isotopes', atoms match only if they have the same isotope number; element "
2542 "information is ignored so [5C] and [5P] are identical. This can be used to "
2543 "implement user-defined atom typing. "
2544 "(Default: elements)"))
2545
2546 parser.add_argument("--bond-compare", choices=["any", "bondtypes"], default="bondtypes", help=(
2547 "Specify the bond comparison method. With 'any', every bond matches every "
2548 "other bond. With 'bondtypes', bonds are the same only if their bond types "
2549 "are the same. (Default: bondtypes)"))
2550
2551 parser.add_argument(
2552 "--threshold", default="1.0", type=parse_threshold,
2553 help="Minimum structure match threshold. A value of 1.0 means that the common "
2554 "substructure must be in all of the input structures. A value of 0.8 finds "
2555 "the largest substructure which is common to at least 80%% of the input "
2556 "structures. (Default: 1.0)")
2557
2558 parser.add_argument(
2559 "--atom-class-tag", metavar="TAG",
2560 help="Use atom class assignments from the field 'TAG'. The tag data must contain a space "
2561 "separated list of integers in the range 1-10000, one for each atom. Atoms are "
2562 "identical if and only if their corresponding atom classes are the same. Note "
2563 "that '003' and '3' are treated as identical values. (Not used by default)")
2564
2565
2566
2567
2568
2569
2570 parser.add_argument(
2571 "--ring-matches-ring-only", action="store_true",
2572 help="Modify the bond comparison so that ring bonds only match ring bonds and chain "
2573 "bonds only match chain bonds. (Ring atoms can still match non-ring atoms.) ")
2574
2575 parser.add_argument(
2576 "--complete-rings-only", action="store_true",
2577 help="If a bond is a ring bond in the input structures and a bond is in the MCS "
2578 "then the bond must also be in a ring in the MCS. Selecting this option also "
2579 "enables --ring-matches-ring-only.")
2580
2581 parser.add_argument(
2582 "--select", type=parse_select, action="store", default="1-",
2583 help="Select a subset of the input records to process. Example: 1-10,13,20,50- "
2584 "(Default: '1-', which selects all structures)")
2585
2586 parser.add_argument("--timeout", type=parse_timeout, metavar="SECONDS", default=Default.timeout,
2587 help="Report the best solution after running for at most 'timeout' seconds. "
2588 "Use 'none' for no timeout. (Default: %s)" % (Default.timeoutString, ))
2589
2590 parser.add_argument("--output", "-o", metavar="FILENAME",
2591 help="Write the results to FILENAME (Default: use stdout)")
2592
2593 parser.add_argument(
2594 "--output-format", choices=["smarts", "fragment-smiles", "fragment-sdf", "complete-sdf"],
2595 default="smarts",
2596 help="'smarts' writes the SMARTS pattern including the atom and bond criteria. "
2597 "'fragment-smiles' writes a matching fragment as a SMILES string. "
2598 "'fragment-sdf' writes a matching fragment as a SD file; see --save-atom-class for "
2599 "details on how atom class information is saved. "
2600 "'complete-sdf' writes the entire SD file with the fragment information stored in "
2601 "the tag specified by --save-fragment-indices-tag. (Default: smarts)")
2602
2603 parser.add_argument(
2604 "--output-all", action="store_true",
2605 help="By default the structure output formats only show an MCS for the first input structure. "
2606 "If this option is enabled then an MCS for all of the structures are shown.")
2607
2608 parser.add_argument(
2609 "--save-atom-class-tag", metavar="TAG",
2610 help="If atom classes are specified (via --class-tag) and the output format is 'fragment-sdf' "
2611 "then save the substructure atom classes to the tag TAG, in fragment atom order. By "
2612 "default this is the value of --atom-class-tag.")
2613
2614 parser.add_argument(
2615 "--save-counts-tag", metavar="TAG",
2616 help="Save the fragment count, atom count, and bond count to the specified SD tag as "
2617 "space separated integers, like '1 9 8'. (The fragment count will not be larger than "
2618 "1 until fmcs supports disconnected MCSes.)")
2619
2620 parser.add_argument("--save-atom-indices-tag", metavar="TAG",
2621 help="If atom classes are specified and the output format is 'complete-sdf' "
2622 "then save the MCS fragment atom indices to the tag TAG, in MCS order. "
2623 "(Default: mcs-atom-indices)")
2624
2625 parser.add_argument(
2626 "--save-smarts-tag", metavar="TAG",
2627 help="Save the MCS SMARTS to the specified SD tag. Uses '-' if there is no MCS")
2628
2629 parser.add_argument(
2630 "--save-smiles-tag", metavar="TAG",
2631 help="Save the fragment SMILES to the specified SD tag. Uses '-' if there is no MCS")
2632
2633 parser.add_argument("--times", action="store_true", help="Print timing information to stderr")
2634 parser.add_argument("-v", "--verbose", action="count", dest="verbosity",
2635 help="Print progress statistics to stderr. Use twice for higher verbosity.")
2636 parser.add_argument("--version", action="version", version="%(prog)s " + __version__)
2637
2638 args = parser.parse_args(args)
2639
2640 filename = args.filename[0]
2641 fname = filename.lower()
2642 if fname.endswith(".smi"):
2643 try:
2644 reader = Chem.SmilesMolSupplier(filename, titleLine=False)
2645 except IOError:
2646 raise SystemExit("Unable to open SMILES file %r" % (filename, ))
2647 elif fname.endswith(".sdf"):
2648 try:
2649 reader = Chem.SDMolSupplier(filename)
2650 except IOError:
2651 raise SystemExit("Unable to open SD file %r" % (filename, ))
2652 elif fname.endswith(".gz"):
2653 raise SystemExit("gzip compressed files not yet supported")
2654 else:
2655 raise SystemExit("Only SMILES (.smi) and SDF (.sdf) files are supported")
2656
2657 if args.minNumAtoms < 2:
2658 parser.error("--min-num-atoms must be at least 2")
2659
2660 if args.atomCompare is None:
2661 if args.atom_class_tag is None:
2662 args.atomCompare = "elements"
2663 else:
2664 args.atomCompare = "isotopes"
2665 else:
2666 if args.atom_class_tag is not None:
2667 parser.error("Cannot specify both --atom-compare and --atom-class-tag fields")
2668
2669
2670
2671 for name in ("atom_class_tag", "save_atom_class_tag", "save_counts_tag", "save_atom_indices_tag",
2672 "save_smarts_tag", "save_smiles_tag"):
2673 value = getattr(args, name)
2674 if value is not None:
2675 if value.startswith("_"):
2676 parser.error("--%s value may not start with a '_': %r" % (name.replace("_", "-"), value))
2677
2678
2679 atom_class_tag = args.atom_class_tag
2680 if args.output_format == "fragment-sdf":
2681 if atom_class_tag is not None:
2682 if args.save_atom_class_tag is None:
2683 args.save_atom_class_tag = atom_class_tag
2684
2685 if args.output_format == "complete-sdf":
2686 if (args.save_atom_indices_tag is None and args.save_counts_tag is None and
2687 args.save_smiles_tag is None and args.save_smarts_tag is None):
2688 parser.error("Using --output-format complete-sdf is useless without at least one "
2689 "of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, "
2690 "or --save-counts-tag")
2691
2692 t1 = time.time()
2693 structures = []
2694 if args.verbosity > 1:
2695 sys.stderr.write("Loading structures from %s ..." % (filename, ))
2696
2697 for molno, mol in enumerate(reader):
2698 if not any(molno in range_ for range_ in args.select):
2699 continue
2700 if mol is None:
2701 print >> sys.stderr, "Skipping unreadable structure #%d" % (molno + 1, )
2702 continue
2703 if atom_class_tag is not None:
2704 try:
2705 assign_isotopes_from_class_tag(mol, atom_class_tag)
2706 except ValueError as err:
2707 raise SystemExit("Structure #%d: %s" % (molno + 1, err))
2708 structures.append(mol)
2709 if args.verbosity > 1:
2710 if len(structures) % 100 == 0:
2711 sys.stderr.write("\rLoaded %d structures from %s ..." % (len(structures), filename))
2712 sys.stderr.flush()
2713
2714 if args.verbosity > 1:
2715 sys.stderr.write("\r")
2716
2717 times = {"load": time.time() - t1}
2718
2719 if args.verbosity:
2720 print >> sys.stderr, "Loaded", len(structures), "structures from", filename, " "
2721
2722 if len(structures) < 2:
2723 raise SystemExit("Input file %r must contain at least two structures" % (filename, ))
2724
2725 mcs = fmcs(structures,
2726 minNumAtoms=args.minNumAtoms,
2727 maximize=args.maximize,
2728 atomCompare=args.atomCompare,
2729 bondCompare=args.bondCompare,
2730 threshold=args.threshold,
2731
2732 matchValences=False,
2733 ringMatchesRingOnly=args.ringMatchesRingOnly,
2734 completeRingsOnly=args.completeRingsOnly,
2735 timeout=args.timeout,
2736 times=times,
2737 verbose=args.verbosity > 1,
2738 verboseDelay=1.0, )
2739
2740 msg_format = "Total time %(total).2f seconds: load %(load).2f fragment %(fragment).2f select %(select).2f enumerate %(enumerate).2f"
2741 times["total"] = times["mcs"] + times["load"]
2742
2743 if mcs and mcs.completed:
2744 msg_format += " (MCS found after %(best_found).2f)"
2745
2746 del mol
2747
2748 if args.output:
2749 outfile = open(args.output, "w")
2750 else:
2751 outfile = sys.stdout
2752
2753 if args.output_format == "smarts":
2754 if not mcs:
2755 outfile.write("No MCS found\n")
2756 else:
2757 if mcs.completed:
2758 status = "(complete search)"
2759 else:
2760 status = "(timed out)"
2761 outfile.write("%s %d atoms %d bonds %s\n" % (mcs.smarts, mcs.num_atoms, mcs.num_bonds,
2762 status))
2763
2764 else:
2765 if mcs.smarts is None:
2766
2767 pat = Chem.MolFromSmarts("[CN]")
2768 else:
2769
2770 pat = Chem.MolFromSmarts(mcs.smarts)
2771 for structure in structures:
2772 atom_indices = structure.GetSubstructMatch(pat)
2773 if not atom_indices:
2774
2775
2776
2777 assert args.threshold < 1, "No indices but should have matched everything!"
2778 continue
2779 bond_indices = _get_match_bond_indices(pat, structure, atom_indices)
2780 subgraph = Subgraph(atom_indices, bond_indices)
2781 if atom_class_tag:
2782 restore_isotopes(structure)
2783
2784 outfile.write(make_structure_format(args.output_format, mcs, structure, subgraph, args))
2785
2786 if not args.output_all:
2787 break
2788
2789 if args.output:
2790 outfile.close()
2791
2792 if args.times or args.verbosity:
2793 print >> sys.stderr, msg_format % times
2794
2795
2796 if __name__ == "__main__":
2797 import argparse
2798 main(sys.argv[1:])
2799