Machine learning - Gibbs Algorithm PDF

Title	Machine learning - Gibbs Algorithm
Course	Introduction to Machine learning
Institution	University of Greenwich
Pages	4
File Size	74.5 KB
File Type	PDF
Total Downloads	83
Total Views	177

Preview

CLICK TO PREVIEW PDF

Summary

Download Machine learning - Gibbs Algorithm PDF

Description

Machine learning - Gibbs Algorithm The Bayes optimal classifier provides the best classification result achievable, however it can be computationally intensive, as it computes the posterior probability for every hypothesis h ∈ H, the prediction of each hypothesis for each new instance, and the combination of these 2 to classify each new instance. Gibbs chooses one hypothesis at random according to P(h|D), and uses this to classify a new instance. If we assume the target concepts are drawn at random from H according the the priors on H, then the error of Gibbs is less than twice the error of optimal Bayes. Gibbs is seldom used, however. Bayesian Belief Networks The assumption which underlies the naive Bayes classifier, that attribute values are conditionally independent given some target value, this works fine in some settings, but in others it is too restrictive. However, Bayesian classification is intractable without some assumption. Bayesian belief networks are some sort of intermediate approach. Rather than obliging us to determine dependencies between all the combinations of attributes/attribute values they describe the conditional independence among subsets of attributes. More precisely, if we think of the attributes as discrete-valued random variables Y1, ..., Yn, each of which takes on a value from a set of values Vi, then the joint space of the variables is V1 × ... × Vn and the joint probability distribution is the probability distribution over this space. Bayesian belief networks describe a joint probability distribution for a set of variables, typically a subset of available instance attributes, and allow us to combine prior knowledge about dependencies and independencies among attributes with the observed training data. Bayesian belief networks are represented as a directed acyclic graph, where each node represents a variable, and the directed arcs represent the assertion

that the variable is conditionally independent of its non-descendants, given its immediate predecessors. Associated with each variable is a conditional probability table describing the probability distribution for that variable, given the values of its immediate predecessors. The network completely describes the joint probability distribution of the variables represented by its nodes, according to the formula P(y1, ..., yn) = &product;ni=1 P(yi | Parents(Yi)), where Parents(Yi) denotes the immediate predecessors of Yi in the graph. CPTs work well for networks where all of the variables are discrete, but if the network contains continuous variables, it is impossible to specify conditional probabilities explicitly for each value - there are infinitely many values. One approach to avoid continuous variables is using discretisation - dividing up the data into a fixed set of intervals. This sometimes works, but can result in a loss of accuracy and very large CPTs. Another approach is to use standard families of probability density functions that are specified by a finite number of parameters (e.g., normal or gaussian). Bayesian networks where conditional distribution for a continuous variable given a discrete or continuous parents, and conditional distribution for a discrete variable given continuous parents need to be addressed. To infer the probabilities of the values of one or more network variables given the observed values of the others can be done using Bayesian networks. If only one variable (the query variable) is unknown, and all other variables have observed values, then it is easy to infer, but in the general case, there may be unobserved, or hidden, variables in addition to the observed or evidence variables. What is then sought, for each query variable X and observed event e (where e instantiates a set of evidence variables), is the posterior probability distribution P(X|e). This problem is NP-hard. This can succeed in many cases, however. Exact inference methods work well for some network structures, and for large, multiply connected networks, approximate inference can be carried out using randomised sampling, or Monte Carlo, algorithms, which provide approximate answers whose accuracy depends on the number of samples generated.

Learning Bayesian Networks Learning Bayesian networks have a number of aspects: e.g., whether the structure is known or unknown, and whether the training examples provide values of all network variables, or just some (i.e., the variables may be observable or hidden). If the network structure is known, and we can observe all variables, then the probabilities for the CPTs can be estimated directly from the training data. This is straightforward is all network variables are discrete, but more complex is any network variables are continuous (we would need to estimate the parameters, e.g., using a linear Gaussian model). Including hidden variables in the model allows us to greatly reduce the complexity of the Bayesian networks, so it is generally useful to include them using the minimum description length principle, and also reducing the number of parameters in the conditional distribution, in turn reducing the amount of data required to learn the parameters. Various methods have been explored for learning parameters of hidden variables in Bayesian nets - one of the most popular methods is to use the expectation-maximisation (EM) algorithm. The general EM framework is that, given data with missing values, the space of possible models and an initial model, then until there is no change greater than the threshold, repeat the expectation step (compute the expectation over the missing values, given the model) and the maximisation step (replace the current model with a model that maximises the probability of data). EM is not just used for parameter estimation for hidden variables in Bayesian nets, for in many other applications, including unsupervised data clustering and hidden Markov models. So far, we have assumed the structure of the network is known, which frequency reflects the basic causal knowledge about the domain, but in some cases, the causal model is not available or is in dispute. When the Bayesian network structure is unknown, we can search for a good model. Starting with a model with no links, we add the parents for each node, fit some parameters and the measure the accuracy of the resulting model, or, we can

guess the initial structure and use some algorithm (e.g., a hill-climbing one) to add/subtract/reverse edges and nodes. Additionally, a method is needed for deciding when a good structure has been found. We can test whether the conditional independence assertions implicit in the structure are satisfied in the data, and we can measure the extent to which the proposed model explains the data - however we must be careful to penalise complexity, since trying to find the maximum-likelihood hypothesis will lead to a fully connected network, as adding more parents can not decrease likelihood. This is an active research topic....