Package rdkit :: Package ML :: Package SLT :: Module Risk
[hide private]
[frames] | no frames]

Source Code for Module rdkit.ML.SLT.Risk

  1  # 
  2  #  Copyright (C) 2000-2008  greg Landrum 
  3  # 
  4  """ code for calculating empirical risk 
  5   
  6  """ 
  7  import math 
  8   
  9   
10 -def log2(x):
11 return math.log(x) / math.log(2.)
12 13
14 -def BurgesRiskBound(VCDim, nData, nWrong, conf):
15 """ Calculates Burges's formulation of the risk bound 16 17 The formulation is from Eqn. 3 of Burges's review 18 article "A Tutorial on Support Vector Machines for Pattern Recognition" 19 In _Data Mining and Knowledge Discovery_ Kluwer Academic Publishers 20 (1998) Vol. 2 21 22 **Arguments** 23 24 - VCDim: the VC dimension of the system 25 26 - nData: the number of data points used 27 28 - nWrong: the number of data points misclassified 29 30 - conf: the confidence to be used for this risk bound 31 32 33 **Returns** 34 35 - a float 36 37 **Notes** 38 39 - This has been validated against the Burges paper 40 41 - I believe that this is only technically valid for binary classification 42 43 """ 44 # maintain consistency of notation with Burges's paper 45 h = VCDim 46 l = nData 47 eta = conf 48 49 numerator = h * (math.log(2. * l / h) + 1.) - math.log(eta / 4.) 50 structRisk = math.sqrt(numerator / l) 51 52 rEmp = float(nWrong) / l 53 54 return rEmp + structRisk
55 56
57 -def CristianiRiskBound(VCDim, nData, nWrong, conf):
58 """ 59 the formulation here is from pg 58, Theorem 4.6 of the book 60 "An Introduction to Support Vector Machines" by Cristiani and Shawe-Taylor 61 Cambridge University Press, 2000 62 63 64 **Arguments** 65 66 - VCDim: the VC dimension of the system 67 68 - nData: the number of data points used 69 70 - nWrong: the number of data points misclassified 71 72 - conf: the confidence to be used for this risk bound 73 74 75 **Returns** 76 77 - a float 78 79 **Notes** 80 81 - this generates odd (mismatching) values 82 83 """ 84 # maintain consistency of notation with Christiani's book 85 86 d = VCDim 87 delta = conf 88 l = nData 89 k = nWrong 90 91 structRisk = math.sqrt((4. / nData) * (d * log2((2. * math.e * l) / d) + log2(4. / delta))) 92 rEmp = 2. * k / l 93 return rEmp + structRisk
94 95
96 -def CherkasskyRiskBound(VCDim, nData, nWrong, conf, a1=1.0, a2=2.0):
97 """ 98 99 The formulation here is from Eqns 4.22 and 4.23 on pg 108 of 100 Cherkassky and Mulier's book "Learning From Data" Wiley, 1998. 101 102 **Arguments** 103 104 - VCDim: the VC dimension of the system 105 106 - nData: the number of data points used 107 108 - nWrong: the number of data points misclassified 109 110 - conf: the confidence to be used for this risk bound 111 112 - a1, a2: constants in the risk equation. Restrictions on these values: 113 114 - 0 <= a1 <= 4 115 116 - 0 <= a2 <= 2 117 118 **Returns** 119 120 - a float 121 122 123 **Notes** 124 125 - This appears to behave reasonably 126 127 - the equality a1=1.0 is by analogy to Burges's paper. 128 129 """ 130 # maintain consistency of notation with Cherkassky's book 131 h = VCDim 132 n = nData 133 eta = conf 134 rEmp = float(nWrong) / nData 135 136 numerator = h * (math.log(float(a2 * n) / h) + 1) - math.log(eta / 4.) 137 eps = a1 * numerator / n 138 139 structRisk = eps / 2. * (1. + math.sqrt(1. + (4. * rEmp / eps))) 140 141 return rEmp + structRisk
142