lisa.datautils.df_combine_duplicates#
- lisa.datautils.df_combine_duplicates(df, func, output_col, cols=None, all_col=True, prune=True, inplace=False)[source]#
Combine the duplicated rows using
func
and remove the duplicates.- Parameters:
df (pandas.DataFrame) – The dataframe to act on.
func (collections.abc.Callable) – Function to combine a group of duplicates. It will be passed a
pandas.DataFrame
corresponding to the group and must return either apandas.Series
with the same index as its input dataframe, or a scalar depending on the value ofprune
.prune (bool) –
If
True
,func
will be expected to return a single scalar that will be used instead of a whole duplicated group. Only the first row of the group is kept, the other ones are removed.If
False
,func
is expected to return apandas.Series
that will be used as replacement for the group. No rows will be removed.output_col (str) – Column in which the output of
func
should be stored.cols (list(str) or None) – Columns to use for duplicates detection
all_cols (bool) – If
True
, all columns will be used.inplace (bool) – If
True
, the passed dataframe is modified.