Completion Checks
=================

The :class:`~libcasm.monte.CompletionCheck` class can be used to check when a Monte Carlo simulation is complete. It is constructed with a :class:`~libcasm.monte.CompletionCheckParams` instance which allows:

- setting the requested absolute or relative precision for convergence of sampled data
- setting cutoff parameters for number of steps or passes, number of samples, amount of simulated time
  or amount of elapsed clocktime, with:

  - minimums, which force the simulation to keep running until a minimium value is reached
  - maximums, which force the simulation to stop running when a maximum value is reached

- controlling when completion checks are performed
- customizing the method used to check for equilibration
- customizing the method used to calculate statistics

A :class:`~libcasm.monte.CompletionCheckParams` instance can be read from a Python dict for input file parsing.

Performing completion checks
----------------------------

Example usage:

.. code-block:: Python

    import libcasm.monte as monte

    # ~~~ Initialize a random number engine ~~~
    random_number_engine = monte.RandomNumberEngine()

    # ~~~ User input: Specify the system ~~~
    # For example, the prim, cluster expansions, composition axes, etc.
    prim = ...
    composition_axes = ...
    formation_energy_clexulator = ...
    formation_energy_coeff = ...

    # ~~~ Construct the system ~~~
    # Construct a data structure that makes system data accessible
    # to the Monte Carlo calculator and sampling functions
    system = SystemType(
        prim=prim,
        composition_axes=composition_axes,
        formation_energy_clexulator=formation_energy_clex_clexulator,
        formation_energy_coeff=formation_energy_clex_coeff,
        ...
    )

    # ~~~ Construct the Monte Carlo calculator ~~~
    # A Monte Carlo calculator implements the Monte Carlo method and holds
    # method specific data along with access to the system.
    # For example, this would have the current configuration and conditions
    mc_calculator = MonteCarloCalculatorType(
        system,
        random_number_engine,
        ...
    )

    # ~~~ Construct sampling functions ~~~
    # Sampling functions should be given access to the mc_calculator to
    # be able to take samples when called, and stored in a StateSamplingFunctionMap
    sampling_functions = monte.StateSamplingFunctionMap()
    _functions = [
        make_formation_energy_f(mc_calculator),
        make_potential_energy_f(mc_calculator),
        make_parametric_composition_f(mc_calculator),
    ]
    for _f in _functions:
        sampling_functions[_f.name] = _f

    # ~~~ User input: Set CompletionCheckParams ~~~
    params = monte.CompletionCheckParams()
    params.cutoff_params.max_count = 100

    completion_check = monte.CompletionCheck(params)

    # ~~~ User input: Specify quantities to be sampled ~~~
    # should be keys into sampling_functions
    quantities = [
        "formation_energy",
        "potential_energy",
        "parametric_composition",
    ]

    # ~~~ Construct data samplers - for all requested quantities ~~~
    samplers = monte.SamplerMap()
    for quantity_name in quantities:
        if quantity_name in sampling_functions:
            f = sampling_functions[quantity_name]
            samplers[f.name] = monte.Sampler(
                shape=f.shape,
                component_names=f.component_names,
            )

    # this is required, but can be left with 0 samples to indicate unweighted
    sample_weight = monte.Sampler(shape=[])

    # method log specifies where to periodically write status messages
    # internally, method log tracks the elapsed clock time
    method_log = monte.MethodLog(path_to_logfile)

    n_steps = 0
    i_sample_period = 0
    sample_period = 1000
    while not completion_check.count_check(
        samplers=samplers,
        sample_weight=sample_weight,
        count=n_steps,
        method_log=method_log,
    ):
        # ... do Monte Carlo step ...
        n_steps += 1

        # ... periodically sample data ...
        if i_sample_period == sample_period:
            for name, f in sampling_functions.items():
                samplers[name].append(f())

Examples
--------

Run for a user-specified number of steps:

.. code-block:: Python

    import libcasm.monte as monte
    params = monte.CompletionCheckParams()
    params.cutoff_params.max_count = 100

    completion_check = monte.CompletionCheck(params)

    n_steps = 0
    while not completion_check.count_check(
        samplers=samplers,
        sample_weight=sample_weight,
        count=n_steps,
        method_log=method_log,
    ):
        # ... do Monte Carlo step ...
        n_steps += 1

Set a limit for the maximum elapsed clock time (in seconds):

.. note::
    Because checking the clock time at every Monte Carlo step slows down calculations excessively, clock time limits are only checked after the number of samples changes, and the actual clock time at which a simulation is stopped may longer than the limit set.

.. code-block:: Python

    import libcasm.monte as monte
    params = monte.CompletionCheckParams()
    params.cutoff_params.max_count = 3.6e2 # stop after running at least 1 hr

    completion_check = monte.CompletionCheck(params)

    n_steps = 0
    while not completion_check.count_check(
        samplers=samplers,
        sample_weight=sample_weight,
        count=n_steps,
        method_log=method_log,
    ):
        # ... do Monte Carlo step ...
        n_steps += 1


The ensemble average value of a property can be estimated from Metropolis Monte Carlo simulations as

.. math::

    \langle X \rangle \approx \bar{X} = \frac{\sum_l^N X_l}{N},

where:

- :math:`X_l` is the :math:`l`-th of :math:`N` observations of property :math:`X`
- :math:`\langle X \rangle` is the ensemble average
- :math:`\bar{X}` is the mean of the observations.

The error, :math:`\bar{X} - \langle X \rangle`, is normally distributed and approaches zero in the limit of large :math:`N` according to the central limit theorem,

.. math::

    \bar{X}-\langle X \rangle \approx \mathcal{N}(0, \sigma^2/N).

For stationary distributions, the variance of the error, :math:`\sigma^2`, can be calculated from the lag :math:`k` autocovariance between observations :math:`X_j` and :math:`X_{j+k}`  as

.. math::

    \sigma^2 = \gamma_0 + 2 \sum^\infty_{k=1} \gamma_k,

    \gamma_k = \mathrm{Cov}\left( X_j, X_{j+k} \right).

After an initial equilibration period, samples drawn from Monte Carlo simulations have a stationary distribution. Therefore, the error can be estimated from the observations by estimating :math:`\sum^\infty_{k=1} \gamma_k`.

If the autocovariance decays like :math:`\gamma_k = \gamma_0 \rho^{-|k|}`, then the infinite sum can be evaluated to give

.. math::

    \sigma^2 = \gamma_0 \left(\frac{1+\rho}{1-\rho}\right).


Van de Walle and Asta introduced methods for determining the initial equilibration period from which observations should be discarded before calculating the mean, and for calculating :math:`\rho` from the remaining observations to estimate the error in the mean. These methods are implemented in CASM in the :func:`~libcasm.monte.default_equilibration_check` and :class:`~libcasm.monte.BasicStatisticsCalculator` methods, respectively, and used as the defaults for automatic convergence to user-specified precision.

Users may also implement and use alternative methods through the `equilibration_check_f` and `calc_statistics_f` parameters of the :class:`~libcasm.monte.CompletionCheckParams` class.


Equilibration check
-------------------

The :func:`~libcasm.monte.default_equilibration_check` method partitions an array of
observations into three ranges:

- the equilibriation stage, ``[0, start1)``,
- the first partition, ``[start1, start2)``,
- and the second partition, ``[start2, N)``,

where `N` is ``len(observations)``, and `start1` and `start2` are indices into
the observations array such that ``0 <= start1 < start2 <= N``, and the number
of elements in the first and second partition are the same (within 1).

The simulation is considered equilibrated at observation `start1` if the
mean of the elements in the first and second partition are approximately equal
to the desired precsion, ``(abs(mean1 - mean2) < prec)``.

Additionally, in CASM, the value `start1` is incremented as much as needed to ensure
that the equilibration stage has observations on either side of the overall mean.

The result of :func:`~libcasm.monte.default_equilibration_check` is of type :class:`~libcasm.monte.IndividualEquilibrationResult`, which has two attributes which are set as follows:

- If all observations are approximately equal, then:

  - ``is_equilibrated = True``
  - ``N_samples_for_equilibration = 0``

- If the equilibration conditions are met, the result contains:

  - ``is_equilibrated = true``
  - ``N_samples_for_equilibration = start1``

- If the equilibration conditions are not met, the result contains:

  - ``is_equilibrated = false``
  - ``N_samples_for_equilibration = <undefined>``

If samples are weighted, as in the n-fold way algorithm, then the same
partitioning method is used, but with weighted observations calculated using:

.. code-block:: Python

    weighted_observation[i] = sample_weight[i] * observation[i] * N / W

where:

.. code-block:: Python

    W = np.sum(sample_weight)

The same weight_factor ``N/W`` applies for all properties.


Calculated precision
--------------------

The value of :math:`\rho` depends on the details of the system and the Monte Carlo method. The method of estimating :math:`\rho` introduced by Van de Walle and Asta `` and implemented in :class:`~libcasm.monte.BasicStatisticsCalculator`, calculates :math:`\hat{\rho}`, an estimate for :math:`\rho`, by searching for the smallest value of :math:`k` for which :math:`\hat{\gamma}_k/\hat{\gamma}_0 \le 1/2`.

If samples are weighted, as in the n-fold way algorithm, then :class:`~libcasm.monte.BasicStatisticsCalculator` uses one of two optional approaches which re-sample the data to generate :math:`N'` equally weighted observations. The weighted observations can be considered a time series, where the time interval associated with each sample value is equal to the sample weight. Then the time series can be sampled at :math:`N'` regular intervals. Given the resampled data, the two approaches are:

1. Calculate :math:`\bar{X} = \sum_l X_l w_l / \sum_l w_l` and :math:`\hat{\gamma}_0 = \mathrm{Var}(X)` directly from the original observations and sample weights, and only calculate :math:`\hat{\rho}` from resampled observations
2. Calculate :math:`\bar{X}`, :math:`\hat{\gamma}_0`, and :math:`\hat{\rho}` from the resampled observations

Given :math:`\hat{\gamma}_0` and :math:`\hat{\rho}`, the error in the mean, :math:`\bar{X} \pm p`, is calculated to a user-specified confidence level according to

.. math::

    p = \sqrt{2} * \mathrm{erf}^{-1}(c) \sqrt{ \frac{\hat{\gamma}_0}{N} \left(\frac{1+\hat{\rho}}{1-\hat{\rho}}\right) }

where :math:`c` is the confidence level, and :math:`p` is the calculated precision.