lisa.stats.Stats#

class lisa.stats.Stats(df, value_col='value', ref_group=None, filter_rows=None, compare=True, agg_cols=None, mean_ci_confidence=None, stats=None, stat_col='stat', unit_col='unit', ci_cols=('ci_minus', 'ci_plus'), control_var_col='fixed', mean_kind_col='mean_kind', non_normalizable_units={'pval'})[source]#

Bases: Loggable

Compute the statistics on an input pandas.DataFrame in “database” format.

Parameters:
  • df (pandas.DataFrame) –

    Dataframe in database format, i.e. meaningless index, and values in a given column with the other columns used as tags.

    Note

    Redundant tag columns (aka that are equal) will be removed from the dataframe.

  • value_col (str) – Name of the column containing the values.

  • ref_group (dict(str, object)) –

    Reference group used to compare the other groups against. It’s format is dict(tag_column_name, tag_value). The comparison will be made on subgroups built out of all the other tag columns, with the reference subgroups being the one matching that dictionary. If the tag value is None, the key will only be used for grouping in graphs. Comparison will add the following statistics:

    • A 2-sample Komolgorov-Smirnov test 'ks2samp_test' column. This test is non-parametric and checks for difference in distributions. The only assumption is that the distribution is continuous, which should suit almost all use cases

    • Most statistics will be normalized against the reference group as a difference percentage, except for a few non-normalizable values.

    Note

    The group referenced must exist, otherwise unexpected behaviours might occur.

  • filter_rows (dict(object, object) or None) – Filter the given pandas.DataFrame with a dict of {“column”: value) that rows has to match to be selected.

  • compare (bool) – If True, normalize most statistics as a percentage of change compared to ref_group.

  • agg_cols (list(str)) –

    Columns to aggregate on. In a sense, the given columns will be treated like a compound iteration number. Defaults to:

    • iteration column if available, otherwise

    • All the tag columns that are neither the value nor part of the ref_group.

  • mean_ci_confidence (float) – Confidence level used to establish the mean confidence interval, between 0 and 1.

  • stats (dict(str, str or collections.abc.Callable)) –

    Dictionnary of statistical functions to summarize each value group formed by tag columns along the aggregation columns. If None is given as value, the name will be passed to pandas.core.groupby.SeriesGroupBy.agg(). Otherwise, the provided function will be run.

    Note

    One set of keys is special: 'mean', 'std' and 'sem'. When value None is used, a custom function is used instead of the one from pandas, which will compute other related statistics and provide a confidence interval. An attempt will be made to guess the most appropriate kind of mean to use using the mean_kind_col, unit_col and control_var_col:

    • The mean itself, as:

      • 'mean' (arithmetic)

      • 'hmean' (harmonic)

      • 'gmean' (geometric)

    • The Standard Error of the Mean (SEM):

      • 'sem' (arithmetic)

      • 'hse' (harmonic)

      • 'gse' (geometric)

    • The standard deviation:

      • 'std' (arithmetic)

      • 'hsd' (harmonic)

      • 'gsd' (geometric)

  • stat_col (str) – Name of the column used to hold the name of the statistics that are computed.

  • unit_col (str) – Name of the column holding the unit of each value (as a string).

  • ci_cols (tuple(str, str)) – Name of the two columns holding the confidence interval for each computed statistics.

  • control_var_col – Name of the column holding the control variable name in the experiment leading to the given value. .. seealso:: guess_mean_kind()

  • control_var_col – str

  • mean_kind_col (str) –

    Type of mean to be used to summarize this value.

    Note

    Unless geometric mean is used, unit_col and control_var_col should be used to make things more obvious and reduce risks of confusion.

  • non_normalizable_units (list(str)) – List of units that cannot be normalized against the reference group.

Examples:

import pandas as pd

# The index is meaningless, all what matters is to uniquely identify
# each row using a set of tag columns, such as 'board', 'kernel',
# 'iteration', ...
df = pd.DataFrame.from_records(
    [
        ('juno', 'kernel1', 'bench1', 'score1', 1, 42, 'frame/s', 's'),
        ('juno', 'kernel1', 'bench1', 'score1', 2, 43, 'frame/s', 's'),
        ('juno', 'kernel1', 'bench1', 'score2', 1, 420, 'frame/s', 's'),
        ('juno', 'kernel1', 'bench1', 'score2', 2, 421, 'frame/s', 's'),
        ('juno', 'kernel1', 'bench2', 'score',  1, 54, 'foobar', ''),
        ('juno', 'kernel2', 'bench1', 'score1', 1, 420, 'frame/s', 's'),
        ('juno', 'kernel2', 'bench1', 'score1', 2, 421, 'frame/s', 's'),
        ('juno', 'kernel2', 'bench1', 'score2', 1, 4200, 'frame/s', 's'),
        ('juno', 'kernel2', 'bench1', 'score2', 2, 4201, 'frame/s', 's'),
        ('juno', 'kernel2', 'bench2', 'score',  1, 540, 'foobar', ''),

        ('hikey','kernel1', 'bench1', 'score1', 1, 42, 'frame/s', 's'),
        ('hikey','kernel1', 'bench1', 'score2', 1, 420, 'frame/s', 's'),
        ('hikey','kernel1', 'bench2', 'score',  1, 54, 'foobar', ''),
        ('hikey','kernel2', 'bench1', 'score1', 1, 420, 'frame/s', 's'),
        ('hikey','kernel2', 'bench1', 'score2', 1, 4200, 'frame/s', 's'),
        ('hikey','kernel2', 'bench2', 'score',  1, 540, 'foobar', ''),
    ],
    columns=['board', 'kernel', 'benchmark', 'metric', 'iteration', 'value', 'unit', 'fixed'],
)


# Get a Dataframe will all the default statistics.
Stats(df).df

# Use a ref_group will also compare other groups against it
Stats(df, ref_group={'board': 'juno', 'kernel': 'kernel1'}).df

Properties

df

pandas.DataFrame containing the statistics.

logger inherited

Convenience short-hand for self.get_logger().

Methods

get_df()

Returns a pandas.DataFrame containing the statistics.

plot_histogram()

Returns a matplotlib.figure.Figure with histogram of the values in the input pandas.DataFrame.

plot_stats()

Returns a matplotlib.figure.Figure containing the statistics for the class input pandas.DataFrame.

plot_values()

Returns a holoviews element with the values in the input pandas.DataFrame.

get_logger() inherited

Provides a logging.Logger named after cls.

log_locals() inherited

Debugging aid: log the local variables of the calling function.

Properties#

property Stats.df[source]#

pandas.DataFrame containing the statistics.

See also

get_df() for more controls.

property Stats.logger#

Inherited property, see lisa.utils.Loggable.logger

Convenience short-hand for self.get_logger().

Methods#

Stats.get_df(remove_ref=None, compare=None)[source]#

Returns a pandas.DataFrame containing the statistics.

Parameters:
  • compare (bool or None) – See Stats compare parameter. If None, it will default to the value provided to Stats.

  • remove_ref (bool or None) – If True, the rows of the reference group described by ref_group for this object will be removed from the returned dataframe. If None, it will default to compare.

Stats.plot_histogram(cumulative=False, bins=50, nbins=None, density=False, **kwargs)[source]#

Returns a matplotlib.figure.Figure with histogram of the values in the input pandas.DataFrame.

Parameters:
  • cumulative (bool) – Cumulative plot (CDF).

  • bins (int or None) – Number of bins for the distribution.

  • filename (str or None) – Path to the image file to write to.

Stats.plot_stats(filename=None, remove_ref=None, backend=None, groups_as_row=False, kind=None, **kwargs)[source]#

Returns a matplotlib.figure.Figure containing the statistics for the class input pandas.DataFrame.

Parameters:
  • filename (str or None) – Path to the image file to write to.

  • remove_ref (bool or None) – If True, do not plot the reference group. See get_df().

  • backend (str or None) – Holoviews backend to use: bokeh or matplotlib. If None, the current holoviews backend selected with hv.extension() will be used.

  • groups_as_row (bool) – By default, subgroups are used as rows in the subplot matrix so that the values shown on a given graph can be expected to be in the same order of magnitude. However, when there are many subgroups, this can lead to very large and somewhat hard to navigate plot matrix. In this case, using the group for the rows might help a great deal.

  • kind (str or None) –

    Type of plot. Can be any of:

    • horizontal_bar

    • vertical_bar

    • None

Variable keyword arguments:

Forwarded to get_df().

Stats.plot_values(**kwargs)[source]#

Returns a holoviews element with the values in the input pandas.DataFrame.

Parameters:

filename (str or None) – Path to the image file to write to.

classmethod Stats.get_logger(suffix=None)#

Inherited method, see lisa.utils.Loggable.get_logger()

Provides a logging.Logger named after cls.

classmethod Stats.log_locals(var_names=None, level='debug')#

Inherited method, see lisa.utils.Loggable.log_locals()

Debugging aid: log the local variables of the calling function.