机器学习基石作业2

出现noisy时错误的可能性,这题是简单的概率问题,μ是h出错的概率,当y=f(x)时出错:λμ;当otherwise时出错:(1-λ)(1-μ)。

=> λμ+(1-λ)(1-μ)

机器学习基石作业2

 

h的表现和μ无关,λμ+(1-λ)(1-μ)=1-λ-(1-2λ)μ => λ=0.5

 

机器学习基石作业2

 

根据公式4(2N)dVC exp(-1/8 ξ2 N)=0.05求得sample size最接近的是460000

 

机器学习基石作业2

 1 import matplotlib.pyplot as plt
 2 import numpy as np
 3 
 4 t = np.arange(-10,10,0.01)
 5 
 6 def original_vc_bound(n):
 7     return np.sqrt(((8.0 / n) * (np.log(80) + 50 * np.log(2 * n))))
 8 
 9 def rademacher_penalty_bound(n):
10     return 1.0 / n + np.sqrt((2.0 / n) * np.log(20)) + np.sqrt((2.0 * (np.log(2 * n) + 50 * np.log(n))) / n)
11 
12 def parrondo_vandenBroek(t, n):
13     return t - np.sqrt(1.0 / n * (2 * t + np.log(120) + 50 * np.log(2 * n)))
14 
15 def devroye(t, n):
16     return t - np.sqrt(1.0 / (2 * n) * (4 * t * (1 + t) + np.log(80) + 100 * np.log(n)))
17 
18 def variant_VC_bound(n):
19     return np.sqrt(16.0 / n * (np.log(2 / np.sqrt(0.05)) + 50 * np.log(n)))
20 
21 print original_vc_bound(10000)
22 print rademacher_penalty_bound(10000)
23 print variant_VC_bound(10000)
24 
25 plt.subplot(211)
26 plt.plot(t,parrondo_vandenBroek(t, 10000))
27 plt.subplot(212)
28 plt.plot(t, devroye(t, 10000))
29 plt.show()
30 
31 print parrondo_vandenBroek(0.331308785962, 10000)
32 print devroye(0.331308785962, 10000)
33 
34 print parrondo_vandenBroek(0.22, 10000)
35 print devroye(0.22, 10000)

机器学习基石作业2

机器学习基石作业2

 

parrondo_vandenBroek在图上可以看到大概当x = 0.22时,y < 0

devroye在图书可以看到大概当x = 0.22时,y > 0,x的bound肯定小于0.22

综上Devroye bound smallest

机器学习基石作业2
 1 import matplotlib.pyplot as plt
 2 import numpy as np
 3 
 4 t = np.arange(-10,10,0.01)
 5 
 6 def original_vc_bound(n):
 7     return np.sqrt(((8.0 / n) * (np.log(80) + 50 * np.log(2 * n))))
 8 
 9 def rademacher_penalty_bound(n):
10     return 1.0 / n + np.sqrt((2.0 / n) * np.log(20)) + np.sqrt((2.0 * (np.log(2 * n) + 50 * np.log(n))) / n)
11 
12 def parrondo_vandenBroek(t, n):
13     return t - np.sqrt(1.0 / n * (2 * t + np.log(120) + 50 * np.log(2 * n)))
14 
15 def devroye(t, n):
16     return t - np.sqrt(1.0 / (2 * n) * (4 * t * (1 + t) + np.log(80) + 100 * np.log(n)))
17 
18 def variant_VC_bound(n):
19     return np.sqrt(16.0 / n * (np.log(2 / np.sqrt(0.05)) + 50 * np.log(n)))
20 
21 print original_vc_bound(5)
22 print rademacher_penalty_bound(5)
23 print variant_VC_bound(5)
24 
25 plt.subplot(211)
26 plt.plot(t,parrondo_vandenBroek(t, 5))
27 plt.subplot(212)
28 plt.plot(t, devroye(t, 5))
29 plt.show()
30 
31 print parrondo_vandenBroek(7.04877656418, 5)
32 print devroye(7.04877656418, 5)
33 
34 print parrondo_vandenBroek(5.5, 5)
35 print devroye(5.5, 5)

机器学习基石作业2
机器学习基石作业2

很明显,最tightest的是
parrondo_vandenBroek

机器学习基石作业2
此题就是将N个数,拆成positive和negtive,最多只能三部分,排列组合:

C(n-1,2)*2 + C(n-1,1)*2 +2 = n2-n+2


机器学习基石作业2

由于已经有了mhN)所以直接计算break point,n=3时m=8,n=4时m=14 => k=4 => dvc=3


机器学习基石作业2

可以把此题看成点在以a为半径的圆和以b为半径的圆之间时为1,其余情况-1

可将所有情况分为2种,N个点全不在a,b之间和其他

=> 在n个点的n+1个空间里选2个+n个点之外的任意空间 => C(n+1,2)+1


机器学习基石作业2
degree is D,hypothesis parameters c = (c0;c1;····;cd );

=> dVC = D + 1

 

 

机器学习基石作业2

把[xi > ti]看成一个d维度的决策树,那么可以选出这样2d个点使得[xi > ti]后,得到决策树的2d个叶子节点v。可以选择

不同的S,使得2d个叶子节点v,得到所有2^2d个结果。即N=2d时可以shatter,N=2d+1时,[xi > ti]后最多只能得到

2d个叶子节点,所有必定有一个重复,所以不管使用什么S,都无法得到2^(2d+1)个结果,即N=2d+1不可shatter,综上

N=2d+1是break point => dvc =2d

 

 

机器学习基石作业2

感觉上由于α是随便取的,每个点都有机会取到两种情况所以2n,不知道该如何证明。。。。

 

机器学习基石作业2

 

这题很明显,等于是求每个函数的复杂度,√N^dvc是O(N^(dvc/2));mh(n/2)是O((n/2)^dvc);2dvc

2i*mh(n-i)由于指数的存在肯定是大于mh(n)所以至少是O(N^(dvc));

由于N ≥ dvc 2 => O(N^(dvc)) > O(N^(dvc/2)),O((n/2)^dvc),2dvc

=> 2i*mh(n-i) upper

 

机器学习基石作业2

 

很明显N是不对的,如果存在成长函数mh(n)=n,那么它的break point就在n=1的时候,那么如果当n=2

时,mh(2)=2,那么其中至少有一个点存在2种情况 => 该点shatter =>矛盾

此时mh(n)=1

 

机器学习基石作业2

 

intersection,肯定是越交集越小,依题意最小肯定是0,最大肯定也是小于H1~Hk里最小的

 

机器学习基石作业2

 

union,肯定是越并越大,所以最小也得H1~Hk里最大的,关于为什么不是0老师给了很好的解答

(any single hypothesis is of VC dimension 0,but the union of many hypotheses (which is a hypothesis set) can be of non-zero VC dimension)

upper bound比较麻烦,思考了很久,在论坛里看老师和小伙伴的回帖,有了些启发。我们思考这样一种情况Hk

dvc为n,总共有2n+k-1-(k-1)中情况也就是缺k-1种情况shatter,而这k-1种情况分别是single hypothesis H1~HK-1,

很明显unionH1~Hk的dvc是n+k-1 = sum(dvc)+k-1

 并且:不考虑dvc max的H,每个H的情况2dvc_hi ≤ hi ≤ 2dvc_hi+ti < 2dvc_hi+1

因为x>1时,2x1*2x > 2x1+2x2 => union(hi) ≤ sum(2dvc_hi+ti) ≤ ∏(2dvc_hi+1)

=> dvc(union(hi)) ≤ dvc((2dvc_hi+1)) = sum(dvc(hi)+1)

=>加上max 后dvc(∏(2dvc_hi+1)) = sum(dvc(hi))+k-1

 

 

机器学习基石作业2

当Θ>0时:

    当s==1:Eout=((1-Θ)*s*0.2+(1-(1-Θ)*s)*0.8+1*0.2)/2=0.5+0.3*(Θ-1)*s

    当s==-1:Eout=(-(1-Θ)*s*0.8+(1+(1-Θ)*s)*0.2+1*0.8)/2=0.5+0.3*(Θ-1)*s

由于Θ<0时和Θ>0时是对称的,综上

=> Eout=0.5+0.3*(|Θ|-1)*s

 

 

机器学习基石作业2

x = [i / 10.0 for i in range(-10, 11)]
x.remove(0)
y = []
error = 0
best = 100
for i in x:
    if i <= 0:
        y.append(-1)
    else:
        y.append(1)
y[0] = 1
y[9] = 1
y[19] = -1
y[10] = -1

for i in range(5000):
    if i < 2500:
        s = -1
    else:
        s = 1
    theta = -1 + (i % 2500) / 1250.0

    for j in range(20):
        if x[j] > theta:
            t = 1
        else:
            t = -1
        if t * s * y[j] < 0:
            error += 1
    if best > error:
        best = error
    error = 0

print best / 20.0

机器学习基石作业2

 

 

机器学习基石作业2

 1 import math
 2 x = [i / 10.0 for i in range(-10, 11)]
 3 x.remove(0)
 4 y = []
 5 error = 0
 6 best = 100
 7 bestTheta = 0
 8 bestS = 0
 9 
10 for i in x:
11     if i <= 0:
12         y.append(-1)
13     else:
14         y.append(1)
15 y[0] = 1
16 y[9] = 1
17 y[19] = -1
18 y[10] = -1
19 
20 for i in range(5000):
21     if i < 2500:
22         s = -1
23     else:
24         s = 1
25     theta = -1 + (i % 2500) / 1250.0
26 
27     for j in range(20):
28         if x[j] > theta:
29             t = 1
30         else:
31             t = -1
32         if t * s * y[j] < 0:
33             error += 1
34     if best > error:
35         best = error
36         bestTheta = theta
37         bestS = s
38     error = 0
39 
40 print 0.5 + 0.3 * (math.fabs(bestTheta) - 1) * bestS

机器学习基石作业2

机器学习基石作业2

 1 import math
 2 x1 = []
 3 x2 = []
 4 x3 = []
 5 x4 = []
 6 y = []
 7 file = open("C:/Users/samsung/Desktop/ntumlone_hw1_hw1_18_train.dat")
 8 for line in file:
 9     lines = line.split(' ')
10     x1.append(float(lines[0]))
11     x2.append(float(lines[1]))
12     x3.append(float(lines[2]))
13     x4.append(float(lines[3].split('\t')[0]))
14     y.append(int(lines[3].split('\t')[1].split('\n')[0]))
15 file.close()
16 x = [x1, x2, x3, x4]
17 error = 0
18 best = 10000
19 bestTheta = 0
20 bestS = 0
21 for t in range(4):
22     for i in range(5000):
23         if i < 2500:
24             s = -1
25         else:
26             s = 1
27         theta = 0 + (i % 2500) / 2500.0
28 
29         for j in range(500):
30             if x[t][j] > theta:
31                 a = 1
32             else:
33                 a = -1
34             if a * s * y[j] < 0:
35                 error += 1
36         if best > error:
37             best = error
38             bestTheta = theta
39             bestS = s
40         error = 0
41 
42 print best / 500.0

机器学习基石作业2

 

 

 

机器学习基石作业2

 1 import math
 2 x1 = []
 3 x2 = []
 4 x3 = []
 5 x4 = []
 6 y = []
 7 file = open("C:/Users/samsung/Desktop/ntumlone_hw1_hw1_18_train.dat")
 8 for line in file:
 9     lines = line.split(' ')
10     x1.append(float(lines[0]))
11     x2.append(float(lines[1]))
12     x3.append(float(lines[2]))
13     x4.append(float(lines[3].split('\t')[0]))
14     y.append(int(lines[3].split('\t')[1].split('\n')[0]))
15 file.close()
16 x = [x1, x2, x3, x4]
17 error = 0
18 best = 10000
19 bestTheta = 0
20 bestS = 0
21 bestT = 0
22 for t in range(4):
23     for i in range(100000):
24         if i < 50000:
25             s = -1
26         else:
27             s = 1
28         theta = 0 + (i % 50000) / 50000.0
29 
30         for j in range(500):
31             if x[t][j] > theta:
32                 a = 1
33             else:
34                 a = -1
35             if a * s * y[j] < 0:
36                 error += 1
37         if best > error:
38             best = error
39             bestTheta = theta
40             bestS = s
41             bestT = t
42         error = 0
43 print bestT
44 print bestTheta
45 print bestS
46 print best / 500.0
47 
48 file = open("C:/Users/samsung/Desktop/ntumlone_hw1_hw1_18_test.dat")
49 xtest = []
50 ytest = []
51 for line in file:
52     lines = line.split(' ')
53     xtest.append(float(lines[3].split('\t')[0]))
54     ytest.append(int(lines[3].split('\t')[1].split('\n')[0]))
55 
56 file.close()
57 for j in range(500):
58     if xtest[j] > bestTheta:
59         a = 1
60     else:
61         a = -1
62     if a * bestS * ytest[j] < 0:
63         error += 1
64 print error / 500.0

机器学习基石作业2