Contains two classes for fitting maximum entropy models subject to linear constraints on the expectations of arbitrary feature statistics. One class, “model”, is for small discrete sample spaces, using explicit summation. The other, “bigmodel”, is for sample spaces that are either continuous (and perhaps high-dimensional) or discrete but too large to sum over, and uses importance sampling. conditional Monte Carlo methods.
The maximum entropy model has exponential form
p(x) = exp(theta^T . f_vec(x)) / Z(theta).
with a real parameter vector theta of the same length as the feature statistic f_vec. For more background, see, for example, Cover and Thomas (1991), Elements of Information Theory.
See the file bergerexample.py for a walk-through of how to use these routines when the sample space is small enough to be enumerated.
See bergerexamplesimulated.py for a a similar walk-through using simulation.
Copyright: Ed Schofield, 2003-2006 License: BSD-style (see LICENSE.txt in main source directory)
model.beginlogging (self, filename[, freq]) | Enable logging params for each fn evaluation to files named ‘filename.freq.pickle’, ‘filename.(2*freq).pickle’, ... each ‘freq’ iterations. |
model.endlogging (self) | Stop logging param values whenever setparams() is called. |
model.clearcache (self) | Clears the interim results of computations depending on the parameters and the sample. |
model.crossentropy (self, fx[, log_prior_x, base]) | Returns the cross entropy H(q, p) of the empirical distribution q of the data (with the given feature matrix fx) with respect to the model p. For discrete distributions this is defined as: |
model.dual (self[, params, ignorepenalty, ...]) | Computes the Lagrangian dual L(theta) of the entropy of the model, for the given vector theta=params. Minimizing this function (without constraints) should fit the maximum entropy model subject to the given constraints. These constraints are specified as the desired (target) values self.K for the expectations of the feature statistic. |
model.fit (self, K[, algorithm]) | Fit the maxent model p whose feature expectations are given by the vector K. |
model.grad (self[, params, ignorepenalty]) | Computes or estimates the gradient of the entropy dual. |
model.log (self, params) | This method is called every iteration during the optimization process. It calls the user-supplied callback function (if any), logs the evolution of the entropy dual and gradient norm, and checks whether the process appears to be diverging, which would indicate inconsistent constraints (or, for bigmodel instances, too large a variance in the estimates). |
model.logparams (self) | Saves the model parameters if logging has been enabled and the # of iterations since the last save has reached self.paramslogfreq. |
model.normconst (self) | Returns the normalization constant, or partition function, for the current model. Warning – this may be too large to represent; if so, this will result in numerical overflow. In this case use lognormconst() instead. |
model.reset (self[, numfeatures]) | Resets the parameters self.params to zero, clearing the cache variables dependent on them. Also resets the number of function and gradient evaluations to zero. |
model.setcallback (self[, callback, callback_dual, ...]) | Sets callback functions to be called every iteration, every function evaluation, or every gradient evaluation. All callback functions are passed one argument, the current model object. |
model.setparams (self, params) | Set the parameter vector to params, replacing the existing parameters. params must be a list or numpy array of the same length as the model’s feature vector f. |
model.setsmooth (sigma) | Speficies that the entropy dual and gradient should be computed with a quadratic penalty term on magnitude of the parameters. This ‘smooths’ the model to account for noise in the target expectation values or to improve robustness when using simulation to fit models and when the sampling distribution has high variance. The smoothing mechanism is described in Chen and Rosenfeld, ‘A Gaussian prior for smoothing maximum entropy models’ (1999). |
model.expectations (self) | The vector E_p[f(X)] under the model p_params of the vector of feature functions f_i over the sample space. |
model.lognormconst (self) | Compute the log of the normalization constant (partition function) Z=sum_{x in samplespace} p_0(x) exp(params . f(x)). The sample space must be discrete and finite. |
model.logpmf (self) | Returns an array indexed by integers representing the logarithms of the probability mass function (pmf) at each point in the sample space under the current model (with the current parameter vector self.params). |
model.pmf_function (self[, f]) | Returns the pmf p_theta(x) as a function taking values on the model’s sample space. The returned pmf is defined as: |
model.setfeaturesandsamplespace (self, f, samplespace) | Creates a new matrix self.F of features f of all points in the sample space. f is a list of feature functions f_i mapping the sample space to real values. The parameter vector self.params is initialized to zero. |
A maximum-entropy (exponential-form) model on a large sample space.
The model expectations are not computed exactly (by summing or integrating over a sample space) but approximately (by Monte Carlo estimation). Approximation is necessary when the sample space is too large to sum or integrate over in practice, like a continuous sample space in more than about 4 dimensions or a large discrete space like all possible sentences in a natural language.
Approximating the expectations by sampling requires an instrumental distribution that should be close to the model for fast convergence. The tails should be fatter than the model.
bigmodel.estimate (self) | This function approximates both the feature expectation vector E_p f(X) and the log of the normalization term Z with importance sampling. |
bigmodel.logpdf (self, fx[, log_prior_x]) | Returns the log of the estimated density p(x) = p_theta(x) at the point x. If log_prior_x is None, this is defined as: log p(x) = theta.f(x) - log Z where f(x) is given by the (m x 1) array fx. |
bigmodel.pdf (self, fx) | Returns the estimated density p_theta(x) at the point x with feature statistic fx = f(x). This is defined as p_theta(x) = exp(theta.f(x)) / Z(theta), where Z is the estimated value self.normconst() of the partition function. |
bigmodel.pdf_function (self) | Returns the estimated density p_theta(x) as a function p(f) taking a vector f = f(x) of feature statistics at any point x. This is defined as: p_theta(x) = exp(theta.f(x)) / Z |
bigmodel.resample (self) | (Re)samples the matrix F of sample features. |
bigmodel.setsampleFgen (self, sampler[, staticsample]) | Initializes the Monte Carlo sampler to use the supplied generator of samples’ features and log probabilities. This is an alternative to defining a sampler in terms of a (fixed size) feature matrix sampleF and accompanying vector samplelogprobs of log probabilities. |
bigmodel.settestsamples (self, F_list, logprob_list[, testevery, priorlogprob_list]) | Requests that the model be tested every ‘testevery’ iterations during fitting using the provided list F_list of feature matrices, each representing a sample {x_j} from an auxiliary distribution q, together with the corresponding log probabiltiy mass or density values log {q(x_j)} in logprob_list. This is useful as an external check on the fitting process with sample path optimization, which could otherwise reflect the vagaries of the single sample being used for optimization, rather than the population as a whole. |
bigmodel.stochapprox (self, K) | Tries to fit the model to the feature expectations K using stochastic approximation, with the Robbins-Monro stochastic approximation algorithm: theta_{k+1} = theta_k + a_k g_k - a_k e_k where g_k is the gradient vector (= feature expectations E - K) evaluated at the point theta_k, a_k is the sequence a_k = a_0 / k, where a_0 is some step size parameter defined as self.a_0 in the model, and e_k is an unknown error term representing the uncertainty of the estimate of g_k. We assume e_k has nice enough properties for the algorithm to converge. |
bigmodel.test (self) | Estimate the dual and gradient on the external samples, keeping track of the parameters that yield the minimum such dual. The vector of desired (target) feature expectations is stored as self.K. |
A conditional maximum-entropy (exponential-form) model p(x|w) on a discrete sample space. This is useful for classification problems: given the context w, what is the probability of each class x?
The form of such a model is
p(x | w) = exp(theta . f(w, x)) / Z(w; theta)
where Z(w; theta) is a normalization term equal to
Z(w; theta) = sum_x exp(theta . f(w, x)).
The sum is over all classes x in the set Y, which must be supplied to the constructor as the parameter ‘samplespace’.
Such a model form arises from maximizing the entropy of a conditional model p(x | w) subject to the constraints:
K_i = E f_i(W, X)
where the expectation is with respect to the distribution
q(w) p(x | w)
where q(w) is the empirical probability mass function derived from observations of the context w in a training set. Normally the vector K = {K_i} of expectations is set equal to the expectation of f_i(w, x) with respect to the empirical distribution.
This method minimizes the Lagrangian dual L of the entropy, which is defined for conditional models as
- L(theta) = sum_w q(w) log Z(w; theta)
- sum_{w,x} q(w,x) [theta . f(w,x)]
Note that both sums are only over the training set {w,x}, not the entire sample space, since q(w,x) = 0 for all w,x not in the training set.
where the expectation is as defined above.
conditionalmodel.dual (self[, params, ignorepenalty]) | The entropy dual function is defined for conditional models as |
conditionalmodel.expectations (self) | The vector of expectations of the features with respect to the distribution p_tilde(w) p(x | w), where p_tilde(w) is the empirical probability mass function value stored as self.p_tilde_context[w]. |
conditionalmodel.fit (self[, algorithm]) | Fits the conditional maximum entropy model subject to the constraints |
conditionalmodel.lognormconst (self) | Compute the elementwise log of the normalization constant (partition function) Z(w)=sum_{y in Y(w)} exp(theta . f(w, y)). The sample space must be discrete and finite. This is a vector with one element for each context w. |
conditionalmodel.logpmf (self) | Returns a (sparse) row vector of logarithms of the conditional probability mass function (pmf) values p(x | c) for all pairs (c, x), where c are contexts and x are points in the sample space. The order of these is log p(x | c) = logpmf()[c * numsamplepoints + x]. |
arrayexp (x) | Returns the elementwise antilog of the real array x. We try to exponentiate with numpy.exp() and, if that fails, with python’s math.exp(). numpy.exp() is about 10 times faster but throws an OverflowError exception for numerical underflow (e.g. exp(-800), whereas python’s math.exp() just returns zero, which is much more helpful. |
arrayexpcomplex (x) | Returns the elementwise antilog of the vector x. We try to exponentiate with numpy.exp() and, if that fails, with python’s math.exp(). numpy.exp() is about 10 times faster but throws an OverflowError exception for numerical underflow (e.g. exp(-800), whereas python’s math.exp() just returns zero, which is much more helpful. |
columnmeans (A) | This is a wrapper for general dense or sparse dot products. It is only necessary as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
columnvariances (A) | This is a wrapper for general dense or sparse dot products. It is not necessary except as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
densefeaturematrix (f, sample) | Returns an (m x n) dense array of non-zero evaluations of the scalar functions fi in the list f at the points x_1,...,x_n in the list sample. |
densefeatures (f, x) | Returns a dense array of non-zero evaluations of the functions fi in the list f at the point x. |
dotprod (u, v) | This is a wrapper around general dense or sparse dot products. It is not necessary except as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
flatten (a) | Flattens the sparse matrix or dense array/matrix ‘a’ into a 1-dimensional array |
innerprod (A, v) | This is a wrapper around general dense or sparse dot products. It is not necessary except as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
innerprodtranspose (A, v) | This is a wrapper around general dense or sparse dot products. It is not necessary except as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
logsumexp (a) | Compute the log of the sum of exponentials log(e^{a_1}+...e^{a_n}) of the components of the array a, avoiding numerical overflow. |
logsumexp_naive (values) | For testing logsumexp(). Subject to numerical overflow for large values (e.g. 720). |
robustlog (x) | Returns log(x) if x > 0, the complex log cmath.log(x) if x < 0, or float(‘-inf’) if x == 0. |
rowmeans (A) | This is a wrapper for general dense or sparse dot products. It is only necessary as a common interface for supporting ndarray, scipy spmatrix, and PySparse arrays. |
sample_wr (population, k) | Chooses k random elements (with replacement) from a population. (From the Python Cookbook). |
sparsefeaturematrix (f, sample[, format]) | Returns an (m x n) sparse matrix of non-zero evaluations of the scalar or vector functions f_1,...,f_m in the list f at the points x_1,...,x_n in the sequence ‘sample’. |
sparsefeatures (f, x[, format]) | Returns an Mx1 sparse matrix of non-zero evaluations of the scalar functions f_1,...,f_m in the list f at the point x. |