Fit the maxent model p whose feature expectations are given by the vector K.
Model expectations are computed either exactly or using Monte Carlo simulation, depending on the ‘func’ and ‘grad’ parameters passed to this function.
For ‘model’ instances, expectations are computed exactly, by summing over the given sample space. If the sample space is continuous or too large to iterate over, use the ‘bigmodel’ class instead.
For ‘bigmodel’ instances, the model expectations are not computed exactly (by summing or integrating over a sample space) but approximately (by Monte Carlo simulation). Simulation is necessary when the sample space is too large to sum or integrate over in practice, like a continuous sample space in more than about 4 dimensions or a large discrete space like all possible sentences in a natural language.
Approximating the expectations by sampling requires an instrumental distribution that should be close to the model for fast convergence. The tails should be fatter than the model. This instrumental distribution is specified by calling setsampleFgen() with a user-supplied generator function that yields a matrix of features of a random sample and its log pdf values.
The algorithm can be ‘CG’, ‘BFGS’, ‘LBFGSB’, ‘Powell’, or ‘Nelder-Mead’.
The CG (conjugate gradients) method is the default; it is quite fast and requires only linear space in the number of parameters, (not quadratic, like Newton-based methods).
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm is a variable metric Newton method. It is perhaps faster than the CG method but requires O(N^2) instead of O(N) memory, so it is infeasible for more than about 10^3 parameters.
The Powell algorithm doesn’t require gradients. For small models it is slow but robust. For big models (where func and grad are simulated) with large variance in the function estimates, this may be less robust than the gradient-based algorithms.