lisa.datautils.df_combine_duplicates#

lisa.datautils.df_combine_duplicates(df, func, output_col, cols=None, all_col=True, prune=True, inplace=False)[source]#

Combine the duplicated rows using func and remove the duplicates.

Parameters:
  • df (pandas.DataFrame) – The dataframe to act on.

  • func (collections.abc.Callable) – Function to combine a group of duplicates. It will be passed a pandas.DataFrame corresponding to the group and must return either a pandas.Series with the same index as its input dataframe, or a scalar depending on the value of prune.

  • prune (bool) –

    If True, func will be expected to return a single scalar that will be used instead of a whole duplicated group. Only the first row of the group is kept, the other ones are removed.

    If False, func is expected to return a pandas.Series that will be used as replacement for the group. No rows will be removed.

  • output_col (str) – Column in which the output of func should be stored.

  • cols (list(str) or None) – Columns to use for duplicates detection

  • all_cols (bool) – If True, all columns will be used.

  • inplace (bool) – If True, the passed dataframe is modified.