Add Your Heading Text Here
VC dimension (for Vapnik Chervonenkis dimension) (Vapnik and Chervonenkis (1968, 1971), Vapnik (1979)) measures the capacity of a hypothesis space. Capacity is a measure of complexity and measures the expressive power, richness or flexibility of a set of functions by assessing how wiggly its members can be.
Sewell (2006)
“Let us say we have a dataset containing N points. These N points can be labeled in 2_{N} ways as positive and negative. Therefore, 2_{N} different learning problems can be defined by N data points. If for any of these problems, we can find a hypothesis h ∈ H that separates the positive examples from the negative, then we say H shatters N points. That is, any learning problem definable by N examples can be learned with no error by a hypothesis drawn from H. The maximum number of points that can be shatttered by H is called the VapnikChervonekis (VC) dimension of H, is denoted as VC(H), and measures the capacity of the hypothesis class H.
In figure 2.5, we see that an axisaligned rectangle can shatter four points in two dimensions. Then VC(H), when H is the hypothesis class of axisaligned rectangles in two dimensions, is four. In calculating the VC dimension, it is enough that we find four points that can be shattered; it is not necessary that we be able to shatter any four points in two dimensions. For example, four points placed on a line cannot be shattered by rectangles. However, we cannot place five points in two dimensions anywhere such that a rectangle can separate the positive and negative examples for all possible labelings.
VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis class, we can learn only datasets containing four points and not more. A learning algorithm that can be learn datasets of four points is not very useful. However, this is because the VC dimension is independent of the probability distribution from which instances are drawn. In real lfe, the world is smoothly changing, instances close by most of the time have the same labels, and we need not worry about all possible labelings. There are a lot of datasets containing many more data points than four that are learnable by our hypothesis class (figure 2.1). So even hypothesis classes with small VC dimensions are applicable and are preferred over those with large VC dimensions, for example, a lookup table that has infinite VC dimension.”
Alpaydin (2004), page 22
“The key to extending results on potential learnability to infinite spaces is the observation that what matters is not the cardinality of H, but rather what may be described as its ‘expressive power’. In this chapter we shall formalise this notion in terms of the VapnikChervonenkis dimension of H, a notion originally defined by Vapnik and Chervonenkis (1971), and introduced into learnability theory by Blumer et al. (1986, 1989). The development of this notion is probably the most significant contribution that mathematics has made to Computational Learning Theory.
[…]
We noted above that the number of possible classifications by H of a sample of length m is at most 2^{m}, this being the number of binary vectors of length m. We say that a sample x of length m is shattered by H, or that H shatters x, if this maximum possible value is attained; that is, if H gives all possible classifications of x. Note that if the examples in x are not distinct then xcannot be shattered by any H. When the examples are distinct, x is shattered by H if and only if for any subset S of E_{x}, there is some hypothesis h in H such that for 1 ≤ i ≤ m,
[…]
S is then the subset of E_{x} comprising the positive examples of h.
Based on the intuitive notion that a hypothesis space H has high expressive power if it can achieve all possible classifications of a large set of examples, we use as a measure of this power the VapnikChervonekis dimension, or VC dimension, of H, defined as follows. The VC dimension of H is the maximum length of a sample shattered by H; if there is no such maximum, we say that the VC dimension of H is infinite. Using the notation introduced in the previous section, we can say that the VC dimension of H, denoted VCdim(H), is given by
[…]
where we take the maximum to be infinite if the set is unbounded.
[…]”
Anthony and Biggs (1992)
“The central result of this approach to analysing learning systems relates the number of examples, the training set error and the complexity of the hypothesis space to the generalizqation error. The appropriate measure for the complexity of the hypothesis space is the VapnikChervonekis (VC) dimension. (Recall that the VC dimension of a space of {1, 1}valued functions is the size of the largest subset of their domain for which the restriction of the space to that subset is the set of all {1, 1}valued functions; see Vapnik and Chervonekis (1971).) For example the floowing result of Vapnik (in a slighly strengthened version (ShaweTaylor et al., 1998)) gives a high probability bound on the error of a hypothesis. In this theorem, the training set error Er_{x}(f) of a hypothesis f : X → {1, 1} for a sequence x = ((x_{1}, y_{1}), …, (x_{l}, y_{l})) ∈ (X × {1, 1})^{l} of l labelled examples is the proportion of examples for which f(x_{i}) ≠ y_{i}. The generalization error of f is the probability that f(x) ≠ y, for a randomly chosen labelled example (x, y) ∈ X × {1, 1}.”
Bartlett and ShaweTaylor (1999), page 44
Bishop (1995)
“The VC dimension is a property of a set of functions {f(α)} (again, we use α as a generic set of parameters: a choice of α specifies a particular function), and can be defined for various classes of function f. Here we will only consider functions that correspond to the twoclass pattern recognition case, so that f(x, α) ∈ {1,1} ∀x, α. Now if a given set of l points can be labeled in all possible 2^{l} ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that that set of points is shattered by that set of functions. The VC dimension for the set of functions {f(α)} is defined as the maximum number of training points that can be shattered by {f(α)}. Note that, if the VC dimension is h, then there exists at least one set of h points that can be shattered, but it in general it will not be true that every set of h points can be shattered.”
Burgess (1998)
“The VC dimension thus gives concreteness to the notion of the capacity of a given set of functions.”
Burgess (1998)
To provide constructive distributionindependent bounds on the generalization ability of learning machines, we need to evaluate the growth function in (4.10). This can be done using the concept of VCdimension for a set of approximating functions of a learning machine. First we present this concept for the set of indicator functions.
Vapnik and Chervonenkis (1968) proved that the growth function is either linear or bounded by a logarithmic function of the number of samples n (see Fig 4.2). the point n = h where the growth stats to slow down is called the VCdimension (denoted by h). If it is finite, then the Growth Function does not grow linearly for large enough samples and, in fact, is bounded by a logarithmic function:
[…]
The VCdimension h is a characteristic of a set of functions. Finiteness of h provides necessary and sufficient conditions for the fast rate of convergence and for distributionindependent consistency of ERM learning, in view of (4.10). On the other hand, if the bound stays linear for any n,
G(n) = n ln 2 (4.12)
then the VCdimension for the set of indicator functions is (by definition) infinite. In this case any sample of size n can be split in all 2^{n} possible ways by the functions of a learning machine, and no valid generalization is possible, in view of the condition (4.10).
Importance of finite VCdimension for consistency of ERM learning can be intuitively explained and related to philosophical theories of nonfalsifiability (Vapnik, 1995). […]
Cherkassky and Mulier (1998), pages 99100
“To control the generalization ability of a learning machines one has to control two different factors: the errorrate on the training data and the capacity of the learning machine as measured by its VCdimension. […] The two factors in (38) form a tradeoff: the smaller the VCdimension of the set of functions of the learning machine, the smaller the confidence interval, but the larger the value of the error frequency. A General way for resolving this tradeoff was proposed as the principle of structural risk minimization: for the given data set one has to find a solution that minimizes their sum. A particular case of structural risk minimization principle is the OccamRazor principle: keep the first term equal to zero and minimize the second one.”
Cortes and Vapnik (1995)
“A set of points […] for which the set […] is said to be shattered by H. If there are sets of any size which can be shattered then the growth function is equal to 2^{l} for all l. The final ingredient in the Vapnik Chervonekis theory is an analysis of the case when there is a finite d hich is the largest size of shattered set. In this case the growth function can be bounded as follows for l > d
[…]
giving polynomial growth with exponent d. The value d is known as the Vapnik Chervonenkis (VC) dimension of the class H, denoted by VCdim(H). These quantities measure the richness or flexibility of the function class, something that is also often referred to as its capacity. Controlling the capacity of a learning system is one way of improving its generalisation accuracy. Putting the above bound on the growth function together with the observation about the fraction of permutations for which the first half of the sample is able to mislead the learner, we obtain the following bound on the left hand side of inequality (4.2):
[…]
Remark 4.2 The theorem shows that for infinite sets of hypotheses the problem of overfitting is avoidable and the measure of complexity that should be used is precisely the VC dimension. The size of training set required to ensure good generalisation scales linearly with this quantity in the case of a consistent hypothesis.
VC theory provides a distribution free bound on the generalisation of a consistent hypothesis, but more than that it can be shown that the bound is in fact tight up to log factors, as the following theorem makes clear.
[…]”
Cristianini and ShaweTaylor (2000)
The VapnikChervonenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be.
I(z, α, β) = θ{Q(z, α) – β}, α ∈ Λ, β ∈ (A, B), (3.22)
The VC dimension of the class {f(x, α)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x, α)}.A set of points is said to be shattered by a class of functions if, no matter how we we (sic) assign a binary label to each point, a member of the class can perfectly separate them. Figure 7.6 shows that the VC dimension of linear indicator functions in the plane is 3 but not 4, since no four points can be shattered by a set of lines. In general, a linear indicator function in p dimensions has VC dimension p + 1, which is also the number of free parameters. On the other hand, it can be shown that the family sin(αx) has infinite VC dimension, as Figure 7.5 suggests. By appropriate choice of α, any set of points can be shattered by this class (Exercise 7.7). So far we have discussed the VC dimension only of indicator functions, but this can be extended to realvalued functions. The VC dimension of a class of realvalued functions {g(x, α)} is defined to be the VC dimension of the indicator class {I(g(x, α) – β > 0)}. where β takes values over the range of g. One can use the VC dimension in constructing an estimate of insample prediction error; different types of results are available. Using the concept of VC dimension, one can prove results about the optimism of the training error when using a class of functions. […] Hastie, Tibshirani and Friedman (2001) Restriction to particular hypothesis spaces of limited size is one form of bias that has been explored to facilitate learning.^{41} In addition to the cardinality of the hypothesis space, a parameter known as the VapnikChervonenkis (VC) dimension of the hypothesis space has been shown to be useful in quantifying the bias inherent in a restricted hypothesis space. The VC dimension of a hypothesis space H, denoted VCdim(H), is defined to be the maximum number d of instances that can be labeled as positive and negative examples in all 2^{d} possible ways, such that each labeling is consistent with some hypothesis in H.^{15,55} […]” Haussler and Warmuth (1995) In: Wolpert This result is fundamental as it shows that we can upper bound the richness N_{H} of the hypothesis space by an integer summary—the VC dimension. A lot of research has been done to obtain tight upper bounds on the VC dimension which has, by definition, the following combinatorial interpretation: If ?_{H} = {{(x, y) ∈ Z  l_{01}(h(x), y) = 1}  h ∈ H} is the induced set of events that a hypothesis h ∈ H labels (x, y) ∈ Z incorrectly, then the VC dimension ? of ?_{H} is the largest natural number ? such that there exists a sample z ∈ Z^{?} of size ? which can be subdivided in all 2^{?} different ways by (set) intersection with ?_{H}. Then we say that ?_{H} shatters z. If no such number exists we say that the VC dimension of ?_{H} or H is infinite. Sometimes the VC dimension is also called the shatter coefficient […] An important property of the VC dimension is that it does not necessarily coincide with the number of parameters used. This feature is the key to seeing that, by studying convergence of expected risks, we are able to overcome a problem which is known as curse of dimensionality: […] Herbrich (2002), pages 128130 “How do support vector machines implement structural risk minimization? Vapnik showed that there is a connection between the margin and the VCdimension. […] The lemma states that the VCdimension is lower the larger the margin. Note that the VCdimension of maximummargin hyperplanes does not necessarily depend on the number of features! Instead the VCdimension depends on the Euclidian length ? of the weight vector ? optimized by the support vector machine. Intuitively, this means that the true error of a separating maximummargin hyperplane is close to the training error even in highdimensional spaces, if it has a small weight vector. However, bound (2) does not directly apply to support vector machins, since the VCdimension depends on the location of the examples [Vapnik, 1995]. The bounds in [ShaweTaylor et al., 1996] account for this data dependency. An overview is given in [Cristianini and ShaweTaylor, 2000]. Joachims (2002) “Let us consider a few natural geometric concept classes, and informally calculate their VC dimension. It is important to emphasize the nature of the existential and universal quantifiers in the definition of VC dimension: in order to show that the VC dimension of a class is at least d, we must simply find some shattered set of size d. In order to show that the VC dimension is at most d, we must show that no set of size d+1 is shattered. For this reason, proving upper bounds on the VC dimension is usually considerably more difficult than proving lower bounds. The following examples are not meant to be precise proofs of the stated bounds on the VC dimension, but are simply illustrative exercises to provide some practice thinking about the VC dimension.” Kearns and Vazirani (1994) “The VC (VapnikChervonekis) dimension h is a property of a set of approximating functions of a learning machine that is used in all important results in the statistical learning theory. Despite the fact that the VC dimension is very important, the unfortunate reality is that its analytic estimations can be used only for the simplest sets of functions. The basic concept of the VC dimension is presented first for a twoclass pattern recognition case and then generalized for some sets of real approximating functions. Let us first introduce the notion of an indicator function i_{F}(x, w), defined as the function that can assume only two values, say i_{F}(x, w) ∈ {1,1}, ∀x,w. (A standard example of an indicator function is the hard limiting threshold function given as i_{F}(x, w) = sign(x^{T}w); see fig. 2.7 and section 3.1.) In the case of twoclass classification tasks, the VC dimension of a set of indicator functions i_{F}(x, w) is defined as the largest number h of points that can be separated (shattered) in all possible ways. For twoclass pattern recognition, a set of l points can be labeled in 2^{l}possible ways. According to the definition of the VC dimension, given some set of indicator functions i_{F}(x, w), if there are members of the set that are able to assign all labels correctly, the VC dimension of this set of functions h = l. […] ” Kecman (2001) ”
7.4.1 Shattering a Set of Instances
The VC dimension measures the complexity of the hypothesis space H, not by the number of distinct hypotheses H, but instead by the number of distinct instances from X that can be completely discriminated using H. To make this notion more precise, we first define the notion of shattering a set of instances. Consider some subset of instances S ⊆ X. For example, Figure 7.3 shows a subset of three instances from X. Each hypothesis H from H imposes some dichotomy on S; that is, h partitions S into the two subsets {x ∈ Sh(x) = 1} and {x ∈ Sh(x) = 0}. Given some instance set S, there are 2^{S} possible dichotomies, though H may be unable to represent some of these. We say that H shatters S if every possible dichotomy of S can be represented by some hypothesis from H.Definition: A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.Figure 7.3 illustrates a set S of three instances that is shattered by the hypothesis space. Notice that each of the 2^{3} dichotomies of these three instances is covered by some hypothesis. Note that if a set of instances is not shattered by a hypothesis spaces, then there must be some concept (dichotomy) that can be defined over the instances, but that cannot be represented by the hypothesis space. The ability of Hto shatter a set of instances is thus a measure of its capacity to represent target concepts defined over these instances.
7.4.2 The VapnikChervonenkis Dimension
The ability to shatter a set of instances is closely related to the inductive bias of a hypothesis space. Recall from Chapter 2 that an unbiased hypothesis space is one capable of representing every possible concept (dichotomy) definable over the instance space X. Put briefly, an unbiased hypothesis space H is one that shatters the instance space X. What if H cannot shatter X, but can shatter some large subset Sof X? Intuitively, it seems reasonable to say that the larger the subset of X that can be shattered, the more expressive H. The VC dimension of H is precisely this measure.Definition: The VapnikChervonekis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) ≡ ∞.Note that for any finite H, VC(H) ≤ log_{2}H. To see this, suppose that VC(H) = d. Then H will require 2^{d} distinct hypotheses to shatter d instances. Hence, 2^{d} ≤ H, and d = VC(H) ≤ log_{2}H.” Mitchell (1997) “A simple hypothesis space (small VCdimension) will probably not contain good approximating functions and will lead to a high training (and true) error. On the other hand a too rich hypothesis space (large VCdimension) will lead to a small training error, but the second term in the righthand side of (3.2) will be large. This reflects the fact that for a hypothesis space with high VCdimension the hypothesis with low training error may just happen to fit the training data without accurately predicting new examples.” Joachims (2002) “The socalled VC dimension provides a measure roughly analogous to, but more general than, the lnH measure obtained from PAC analysis. The VC dimension can be applied to continuous function classes, to which standard PAC analysis does not apply. PAClearning theory and VC theory were first connected by the “four Germans” (none of whom actually is German): Blumer, Ehrenfeucht, Haussler, and Warmuth (1989).” Russell and Norvig (2003) “The VCdimension h is a property of the hypothesis space H (written as VCdim(H) = h). It is a nonnegative integer which can be infinite, and is a measure of complexity or expressive power of the family of classification functions realized by the learning machine that can be implemented using H. For many cases h is proportional to the number of free parameters of {f(·,α)}. The VCdimension is therefore a capacity measure of the family of classification functions realized by the learning machine. […] In other words, the VC dimension can be interpreted as the largest number of data points that can be separated in all possible ways by the functions contained in the space H. This is same as the maximal number of training examples that can be learned by the machine without error for all possible binary labelings of the training data. One has to be aware that it is not necessary that for a given VC dimension h, every set of h points can be shattered. This is only guaranteed for at least one case.” Rychetsky (2001), pages 1516 The bestknown capacity concept of VC theory is the VC dimension, defined as the largest number h of points that can be separated in all possible ways using functions of the given class (cf. chapter 4). An example of a VC bound is the following: if h < l is the VC dimension of the class of functions that the learning machine can implement, then for all functions of that class, with a probability of a least 1 – ?, the bound […] Schölkopf, Burges and Smola (1999) “A simple VC dimension example. There are 2^{3} = 8 ways of assigning 3 points to two classes. For the displayed points in R^{2}, all 8 possibilities can be realized using separating hyperplanes, in other words, the function class can shatter 3 points. This would not work if we were given 4 points, no matter how we placed them. Therefore, the VC dimension of the class of separating hyperplanes in R^{2} is 3. The bestknown capacity concept of VC theory is the VC dimension, defined as follows: each function of the class separates the patterns in a certain way and thus induces a certain labelling of the patterns. Since the labels are in {±1}, there are at most 2^{m} different labellings for m patterns.” Schölkopf and Smola (2002) “The VC dimension h of a space of {1,1}valued functions, G, is the size of the largest subset of domain points that can be labelled arbitrarily by choosing functions only from G. The VC dimension can be used to prove high probability bounds on the error of a hypothesis chosen from a class of decision functions G—this is the famous result of Vapnik and Chervonenkis [1971]. The bounds have since been improved slightly by Talagrand [1994]—see also [Alexander, 1984]. […]


 The first drawback is that the VC dimension must actually be determined (or at least bounded) for the class of interest—and this is often not easy to do. (However, bounds on the VC dimension hhave been computed for many natural decision function classes, including parametric classes involving standard arithmetic and boolean operations. See Anthony and Bartlett [1999] for a review of these results.)
 The second (more serious) drawback is that the analysis ignores the structure of the mapping from training samples to hypotheses, and concentrates solely on the range of the learner’s possible outputs. Ignoring the details of the learning map can omit many of the factors that are crucial for determining the success of the learning algorithm in real situations. […]


 . (2000), pages 1011

 where

 (

 ) is the step function

 (

 ) = 0 if

 < 0, 1 if

 ≥ 0.

 The

 (

 , α), α ∈ Λ, is defined to be the VC dimension of the set of corresponding indicators (3.22) with parameters α ∈ Λ and β ∈ (

 ,

 ).

 Vapnik (2000), pages 7980


 The VC framework is concerned with confidence intervals P(c – s > εf, m) for variable ε.
 The VC dimension is a characterization of the support (over all h’s) of P(hd). Many theorems bounding P(c – s > εf, m) depend only on m, the VC dimension of the generalizer, and ε. In particular, such results hold for all f and/or P(f) and for any generalizer of the given VC dimension.
 In the illustrative case where P(hd) always guesses the same h, c does not vary in P(c – s > εf, m); rather s does. (s is determined by the random variable d_{X}, which is unspecified in the conditioning event.) In all the other frameworks c varies.
 This motivates the analogy of viewing the VC framework as cointossing; one has a coin of biastowardsheads c, and flips it m times. The VC frameworks’ distribution of interest is analogous to the coinflipping distribution P(c – s > εc, m). This coinflipping distribution concerns the probability that the empirically observed frequency of heads will defer much from s. This probability does not directly address the process of observing the empirical frequency of heads and then seeing if the biastowardsheads differs much from that observed frequency. That process instead concerns P(c – s > εs, m).
 Accordingly VC results do not directly apply to distributions like P(cs, m); one can not use VC results to bound the probability of large error based on an observed value of s and the VC dimension of one’s generalizer. Note no recourse to offtrainingset error is needed for this result.
 The behavior of P(c – s > εf, m) for offtrainingset error is not currently well understood.”
