lisa.stats.Stats#
- class lisa.stats.Stats(df, value_col='value', ref_group=None, filter_rows=None, compare=True, agg_cols=None, mean_ci_confidence=None, stats=None, stat_col='stat', unit_col='unit', ci_cols=('ci_minus', 'ci_plus'), control_var_col='fixed', mean_kind_col='mean_kind', non_normalizable_units={'pval'})[source]#
Bases:
Loggable
Compute the statistics on an input
pandas.DataFrame
in “database” format.- Parameters:
df (pandas.DataFrame) –
Dataframe in database format, i.e. meaningless index, and values in a given column with the other columns used as tags.
Note
Redundant tag columns (aka that are equal) will be removed from the dataframe.
value_col (str) – Name of the column containing the values.
ref_group (dict(str, object)) –
Reference group used to compare the other groups against. It’s format is
dict(tag_column_name, tag_value)
. The comparison will be made on subgroups built out of all the other tag columns, with the reference subgroups being the one matching that dictionary. If the tag value isNone
, the key will only be used for grouping in graphs. Comparison will add the following statistics:A 2-sample Komolgorov-Smirnov test
'ks2samp_test'
column. This test is non-parametric and checks for difference in distributions. The only assumption is that the distribution is continuous, which should suit almost all use casesMost statistics will be normalized against the reference group as a difference percentage, except for a few non-normalizable values.
Note
The group referenced must exist, otherwise unexpected behaviours might occur.
filter_rows (dict(object, object) or None) – Filter the given
pandas.DataFrame
with a dict of {“column”: value) that rows has to match to be selected.compare (bool) – If
True
, normalize most statistics as a percentage of change compared toref_group
.Columns to aggregate on. In a sense, the given columns will be treated like a compound iteration number. Defaults to:
iteration
column if available, otherwiseAll the tag columns that are neither the value nor part of the
ref_group
.
mean_ci_confidence (float) – Confidence level used to establish the mean confidence interval, between
0
and1
.stats (dict(str, str or collections.abc.Callable)) –
Dictionnary of statistical functions to summarize each value group formed by tag columns along the aggregation columns. If
None
is given as value, the name will be passed topandas.core.groupby.SeriesGroupBy.agg()
. Otherwise, the provided function will be run.Note
One set of keys is special:
'mean'
,'std'
and'sem'
. When valueNone
is used, a custom function is used instead of the one frompandas
, which will compute other related statistics and provide a confidence interval. An attempt will be made to guess the most appropriate kind of mean to use using themean_kind_col
,unit_col
andcontrol_var_col
:The mean itself, as:
'mean'
(arithmetic)'hmean'
(harmonic)'gmean'
(geometric)
The Standard Error of the Mean (SEM):
'sem'
(arithmetic)'hse'
(harmonic)'gse'
(geometric)
The standard deviation:
'std'
(arithmetic)'hsd'
(harmonic)'gsd'
(geometric)
stat_col (str) – Name of the column used to hold the name of the statistics that are computed.
unit_col (str) – Name of the column holding the unit of each value (as a string).
ci_cols (tuple(str, str)) – Name of the two columns holding the confidence interval for each computed statistics.
control_var_col – Name of the column holding the control variable name in the experiment leading to the given value. .. seealso::
guess_mean_kind()
control_var_col – str
mean_kind_col (str) –
Type of mean to be used to summarize this value.
Note
Unless geometric mean is used,
unit_col
andcontrol_var_col
should be used to make things more obvious and reduce risks of confusion.non_normalizable_units (list(str)) – List of units that cannot be normalized against the reference group.
Examples:
import pandas as pd # The index is meaningless, all what matters is to uniquely identify # each row using a set of tag columns, such as 'board', 'kernel', # 'iteration', ... df = pd.DataFrame.from_records( [ ('juno', 'kernel1', 'bench1', 'score1', 1, 42, 'frame/s', 's'), ('juno', 'kernel1', 'bench1', 'score1', 2, 43, 'frame/s', 's'), ('juno', 'kernel1', 'bench1', 'score2', 1, 420, 'frame/s', 's'), ('juno', 'kernel1', 'bench1', 'score2', 2, 421, 'frame/s', 's'), ('juno', 'kernel1', 'bench2', 'score', 1, 54, 'foobar', ''), ('juno', 'kernel2', 'bench1', 'score1', 1, 420, 'frame/s', 's'), ('juno', 'kernel2', 'bench1', 'score1', 2, 421, 'frame/s', 's'), ('juno', 'kernel2', 'bench1', 'score2', 1, 4200, 'frame/s', 's'), ('juno', 'kernel2', 'bench1', 'score2', 2, 4201, 'frame/s', 's'), ('juno', 'kernel2', 'bench2', 'score', 1, 540, 'foobar', ''), ('hikey','kernel1', 'bench1', 'score1', 1, 42, 'frame/s', 's'), ('hikey','kernel1', 'bench1', 'score2', 1, 420, 'frame/s', 's'), ('hikey','kernel1', 'bench2', 'score', 1, 54, 'foobar', ''), ('hikey','kernel2', 'bench1', 'score1', 1, 420, 'frame/s', 's'), ('hikey','kernel2', 'bench1', 'score2', 1, 4200, 'frame/s', 's'), ('hikey','kernel2', 'bench2', 'score', 1, 540, 'foobar', ''), ], columns=['board', 'kernel', 'benchmark', 'metric', 'iteration', 'value', 'unit', 'fixed'], ) # Get a Dataframe will all the default statistics. Stats(df).df # Use a ref_group will also compare other groups against it Stats(df, ref_group={'board': 'juno', 'kernel': 'kernel1'}).df
Properties
pandas.DataFrame
containing the statistics.logger
inheritedConvenience short-hand for
self.get_logger()
.Methods
Returns a
pandas.DataFrame
containing the statistics.Returns a
matplotlib.figure.Figure
with histogram of the values in the inputpandas.DataFrame
.Returns a
matplotlib.figure.Figure
containing the statistics for the class inputpandas.DataFrame
.Returns a holoviews element with the values in the input
pandas.DataFrame
.get_logger()
inheritedProvides a
logging.Logger
named aftercls
.log_locals()
inheritedDebugging aid: log the local variables of the calling function.
Properties#
- property Stats.df[source]#
pandas.DataFrame
containing the statistics.See also
get_df()
for more controls.
- property Stats.logger#
Inherited property, see
lisa.utils.Loggable.logger
Convenience short-hand for
self.get_logger()
.
Methods#
- Stats.get_df(remove_ref=None, compare=None)[source]#
Returns a
pandas.DataFrame
containing the statistics.- Parameters:
compare (bool or None) – See
Stats
compare
parameter. IfNone
, it will default to the value provided toStats
.remove_ref (bool or None) – If
True
, the rows of the reference group described byref_group
for this object will be removed from the returned dataframe. IfNone
, it will default tocompare
.
- Stats.plot_histogram(cumulative=False, bins=50, nbins=None, density=False, **kwargs)[source]#
Returns a
matplotlib.figure.Figure
with histogram of the values in the inputpandas.DataFrame
.
- Stats.plot_stats(filename=None, remove_ref=None, backend=None, groups_as_row=False, kind=None, **kwargs)[source]#
Returns a
matplotlib.figure.Figure
containing the statistics for the class inputpandas.DataFrame
.- Parameters:
filename (str or None) – Path to the image file to write to.
remove_ref (bool or None) – If
True
, do not plot the reference group. Seeget_df()
.backend (str or None) – Holoviews backend to use:
bokeh
ormatplotlib
. IfNone
, the current holoviews backend selected withhv.extension()
will be used.groups_as_row (bool) – By default, subgroups are used as rows in the subplot matrix so that the values shown on a given graph can be expected to be in the same order of magnitude. However, when there are many subgroups, this can lead to very large and somewhat hard to navigate plot matrix. In this case, using the group for the rows might help a great deal.
kind (str or None) –
Type of plot. Can be any of:
horizontal_bar
vertical_bar
None
- Variable keyword arguments:
Forwarded to
get_df()
.
- Stats.plot_values(**kwargs)[source]#
Returns a holoviews element with the values in the input
pandas.DataFrame
.- Parameters:
filename (str or None) – Path to the image file to write to.
- classmethod Stats.get_logger(suffix=None)#
Inherited method, see
lisa.utils.Loggable.get_logger()
Provides a
logging.Logger
named aftercls
.
- classmethod Stats.log_locals(var_names=None, level='debug')#
Inherited method, see
lisa.utils.Loggable.log_locals()
Debugging aid: log the local variables of the calling function.