API

Dataframe

DataFrame
DataFrame.add
DataFrame.append
DataFrame.apply
DataFrame.assign
DataFrame.astype
DataFrame.categorize
DataFrame.columns
DataFrame.compute
DataFrame.corr
DataFrame.count
DataFrame.cov
DataFrame.cummax
DataFrame.cummin
DataFrame.cumprod
DataFrame.cumsum
DataFrame.describe
DataFrame.div
DataFrame.drop
DataFrame.drop_duplicates
DataFrame.dropna
DataFrame.dtypes
DataFrame.explode
DataFrame.fillna
DataFrame.floordiv
DataFrame.get_partition
DataFrame.groupby
DataFrame.head
DataFrame.iloc
DataFrame.index
DataFrame.isna
DataFrame.isnull
DataFrame.iterrows
DataFrame.itertuples
DataFrame.join
DataFrame.known_divisions
DataFrame.loc
DataFrame.map_partitions
DataFrame.mask
DataFrame.max
DataFrame.mean
DataFrame.merge
DataFrame.min
DataFrame.mod
DataFrame.mul
DataFrame.ndim
DataFrame.nlargest
DataFrame.npartitions
DataFrame.partitions
DataFrame.pop
DataFrame.pow
DataFrame.prod
DataFrame.quantile
DataFrame.query
DataFrame.radd
DataFrame.random_split
DataFrame.rdiv
DataFrame.rename
DataFrame.repartition
DataFrame.replace
DataFrame.reset_index
DataFrame.rfloordiv
DataFrame.rmod
DataFrame.rmul
DataFrame.rpow
DataFrame.rsub
DataFrame.rtruediv
DataFrame.sample
DataFrame.set_index
DataFrame.shape
DataFrame.std
DataFrame.sub
DataFrame.sum
DataFrame.tail
DataFrame.to_bag
DataFrame.to_csv
DataFrame.to_dask_array
DataFrame.to_delayed
DataFrame.to_hdf
DataFrame.to_json
DataFrame.to_parquet
DataFrame.to_records
DataFrame.truediv
DataFrame.values
DataFrame.var
DataFrame.visualize
DataFrame.where

Series

Series
Series.add
Series.align
Series.all
Series.any
Series.append
Series.apply
Series.astype
Series.autocorr
Series.between
Series.bfill
Series.cat
Series.clear_divisions
Series.clip
Series.clip_lower
Series.clip_upper
Series.compute
Series.copy
Series.corr
Series.count
Series.cov
Series.cummax
Series.cummin
Series.cumprod
Series.cumsum
Series.describe
Series.diff
Series.div
Series.drop_duplicates
Series.dropna
Series.dt
Series.dtype
Series.eq
Series.explode
Series.ffill
Series.fillna
Series.first
Series.floordiv
Series.ge
Series.get_partition
Series.groupby
Series.gt
Series.head
Series.idxmax
Series.idxmin
Series.isin
Series.isna
Series.isnull
Series.iteritems
Series.known_divisions
Series.last
Series.le
Series.loc
Series.lt
Series.map
Series.map_overlap
Series.map_partitions
Series.mask
Series.max
Series.mean
Series.memory_usage
Series.min
Series.mod
Series.mul
Series.nbytes
Series.ndim
Series.ne
Series.nlargest
Series.notnull
Series.nsmallest
Series.nunique
Series.nunique_approx
Series.persist
Series.pipe
Series.pow
Series.prod
Series.quantile
Series.radd
Series.random_split
Series.rdiv
Series.reduction
Series.repartition
Series.replace
Series.rename
Series.resample
Series.reset_index
Series.rolling
Series.round
Series.sample
Series.sem
Series.shape
Series.shift
Series.size
Series.std
Series.str
Series.sub
Series.sum
Series.to_bag
Series.to_csv
Series.to_dask_array
Series.to_delayed
Series.to_frame
Series.to_hdf
Series.to_string
Series.to_timestamp
Series.truediv
Series.unique
Series.value_counts
Series.values
Series.var
Series.visualize
Series.where

Groupby Operations

DataFrameGroupBy.aggregate(arg[, …]) Aggregate using one or more operations over the specified axis.
DataFrameGroupBy.apply(func, *args, **kwargs) Parallel version of pandas GroupBy.apply
DataFrameGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values.
DataFrameGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
DataFrameGroupBy.cumprod([axis]) Cumulative product for each group.
DataFrameGroupBy.cumsum([axis]) Cumulative sum for each group.
DataFrameGroupBy.get_group(key) Construct DataFrame from group with provided name.
DataFrameGroupBy.max([split_every, split_out]) Compute max of group values.
DataFrameGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values.
DataFrameGroupBy.min([split_every, split_out]) Compute min of group values.
DataFrameGroupBy.size([split_every, split_out]) Compute group sizes.
DataFrameGroupBy.std([ddof, split_every, …]) Compute standard deviation of groups, excluding missing values.
DataFrameGroupBy.sum([split_every, …]) Compute sum of group values.
DataFrameGroupBy.var([ddof, split_every, …]) Compute variance of groups, excluding missing values.
DataFrameGroupBy.cov([ddof, split_every, …]) Compute pairwise covariance of columns, excluding NA/null values.
DataFrameGroupBy.corr([ddof, split_every, …]) Compute pairwise correlation of columns, excluding NA/null values.
DataFrameGroupBy.first([split_every, split_out]) Compute first of group values.
DataFrameGroupBy.last([split_every, split_out]) Compute last of group values.
DataFrameGroupBy.idxmin([split_every, …]) Return index of first occurrence of minimum over requested axis.
DataFrameGroupBy.idxmax([split_every, …]) Return index of first occurrence of maximum over requested axis.
SeriesGroupBy.aggregate(arg[, split_every, …]) Aggregate using one or more operations over the specified axis.
SeriesGroupBy.apply(func, *args, **kwargs) Parallel version of pandas GroupBy.apply
SeriesGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values.
SeriesGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
SeriesGroupBy.cumprod([axis]) Cumulative product for each group.
SeriesGroupBy.cumsum([axis]) Cumulative sum for each group.
SeriesGroupBy.get_group(key) Construct DataFrame from group with provided name.
SeriesGroupBy.max([split_every, split_out]) Compute max of group values.
SeriesGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values.
SeriesGroupBy.min([split_every, split_out]) Compute min of group values.
SeriesGroupBy.nunique([split_every, split_out])
SeriesGroupBy.size([split_every, split_out]) Compute group sizes.
SeriesGroupBy.std([ddof, split_every, split_out]) Compute standard deviation of groups, excluding missing values.
SeriesGroupBy.sum([split_every, split_out, …]) Compute sum of group values.
SeriesGroupBy.var([ddof, split_every, split_out]) Compute variance of groups, excluding missing values.
SeriesGroupBy.first([split_every, split_out]) Compute first of group values.
SeriesGroupBy.last([split_every, split_out]) Compute last of group values.
SeriesGroupBy.idxmin([split_every, …]) Return index of first occurrence of minimum over requested axis.
SeriesGroupBy.idxmax([split_every, …]) Return index of first occurrence of maximum over requested axis.
Aggregation(name, chunk, agg[, finalize]) User defined groupby-aggregation.

Rolling Operations

rolling.map_overlap
Series.rolling
DataFrame.rolling
Rolling.apply
Rolling.count
Rolling.kurt
Rolling.max
Rolling.mean
Rolling.median
Rolling.min
Rolling.quantile
Rolling.skew
Rolling.std
Rolling.sum
Rolling.var

Create DataFrames

read_csv
read_table
read_fwf
read_parquet
read_hdf
read_json
read_orc
read_sql_table
from_array
from_bcolz
from_dask_array
from_delayed
from_pandas
dask.bag.core.Bag.to_dataframe

Store DataFrames

to_csv
to_parquet
to_hdf
to_records
to_bag
to_json

Convert DataFrames

to_dask_array
to_delayed

Reshape DataFrames

get_dummies
pivot_table
melt

DataFrame Methods

Series Methods

DataFrameGroupBy

class dask.dataframe.groupby.DataFrameGroupBy(df, by=None, slice=None, group_keys=True)
agg(arg, split_every=None, split_out=1)

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.agg.

Some inconsistencies with the Dask version may exist.

Parameters:

func : function, str, list or dict

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns:

scalar, Series or DataFrame

The return can be:

  • scalar : when Series.agg is called with single function
  • Series : when DataFrame.agg is called with a single function
  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.DataFrame.groupby.apply, pandas.DataFrame.groupby.transform, pandas.DataFrame.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  # doctest: +SKIP
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})
>>> df  # doctest: +SKIP
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  # doctest: +SKIP
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  # doctest: +SKIP
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  # doctest: +SKIP
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  # doctest: +SKIP
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907

To control the output names with different aggregations per column, pandas supports “named aggregation”

>>> df.groupby("A").agg(  # doctest: +SKIP
...     b_min=pd.NamedAgg(column="B", aggfunc="min"),
...     c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
   b_min     c_sum
A
1      1 -1.956929
2      3 -0.322183
  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

See Named aggregation for more.

aggregate(arg, split_every=None, split_out=1)

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.aggregate.

Some inconsistencies with the Dask version may exist.

Parameters:

func : function, str, list or dict

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns:

scalar, Series or DataFrame

The return can be:

  • scalar : when Series.agg is called with single function
  • Series : when DataFrame.agg is called with a single function
  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.DataFrame.groupby.apply, pandas.DataFrame.groupby.transform, pandas.DataFrame.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  # doctest: +SKIP
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})
>>> df  # doctest: +SKIP
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  # doctest: +SKIP
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  # doctest: +SKIP
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  # doctest: +SKIP
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  # doctest: +SKIP
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907

To control the output names with different aggregations per column, pandas supports “named aggregation”

>>> df.groupby("A").agg(  # doctest: +SKIP
...     b_min=pd.NamedAgg(column="B", aggfunc="min"),
...     c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
   b_min     c_sum
A
1      1 -1.956929
2      3 -0.322183
  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

See Named aggregation for more.

apply(func, *args, **kwargs)

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
  2. Dask’s GroupBy.apply is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:

func: function

Function to apply

args, kwargs : Scalar, Delayed or object

Arguments and keywords to pass to the function.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

corr(ddof=1, split_every=None, split_out=1)

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)

Parameters:

method : {‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
  • callable: callable with input two 1d ndarrays
    and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior .. versionadded:: 0.24.0

min_periods : int, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

Returns:

DataFrame

Correlation matrix.

See also

DataFrame.corrwith, Series.corr

Examples

>>> def histogram_intersection(a, b):  # doctest: +SKIP
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  # doctest: +SKIP
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
count(split_every=None, split_out=1)

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Count of values within each group.

See also

Series.groupby, DataFrame.groupby

cov(ddof=1, split_every=None, split_out=1, std=False)

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Groupby covariance is accomplished by

  1. Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
  2. The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar

When std is True calculate Correlation

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:

min_periods : int, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result.

Returns:

DataFrame

The covariance matrix of the series of the DataFrame.

See also

Series.cov
Compute covariance with another Series.
core.window.EWM.cov
Exponential weighted sample covariance.
core.window.Expanding.cov
Expanding sample covariance.
core.window.Rolling.cov
Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.cov()  # doctest: +SKIP
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(1000, 5),  # doctest: +SKIP
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  # doctest: +SKIP
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(20, 3),  # doctest: +SKIP
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  # doctest: +SKIP
>>> df.loc[df.index[5:10], 'b'] = np.nan  # doctest: +SKIP
>>> df.cov(min_periods=12)  # doctest: +SKIP
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
cumcount(axis=None)

Number each item in each group from 0 to the length of that group - 1.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.

Some inconsistencies with the Dask version may exist.

Essentially this is equivalent to

>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  # doctest: +SKIP
Parameters:

ascending : bool, default True (Not supported in Dask)

If False, number in reverse, from length of group - 1 to 0.

Returns:

Series

Sequence number of each element within each group.

See also

ngroup
Number the groups themselves.

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  # doctest: +SKIP
...                   columns=['A'])
>>> df  # doctest: +SKIP
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  # doctest: +SKIP
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  # doctest: +SKIP
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64
cumprod(axis=0)

Cumulative product for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.

Some inconsistencies with the Dask version may exist.

Returns:Series or DataFrame

See also

Series.groupby, DataFrame.groupby

cumsum(axis=0)

Cumulative sum for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.

Some inconsistencies with the Dask version may exist.

Returns:Series or DataFrame

See also

Series.groupby, DataFrame.groupby

first(split_every=None, split_out=1)

Compute first of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed first of values within each group.

get_group(key)

Construct DataFrame from group with provided name.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.

Some inconsistencies with the Dask version may exist.

Parameters:

name : object (Not supported in Dask)

the name of the group to get as a DataFrame

obj : DataFrame, default None (Not supported in Dask)

the DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : same type as obj

idxmax(split_every=None, split_out=1, axis=None, skipna=True)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:

Series

Indexes of maxima along the specified axis.

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(split_every=None, split_out=1, axis=None, skipna=True)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:

Series

Indexes of minima along the specified axis.

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

last(split_every=None, split_out=1)

Compute last of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed last of values within each group.

max(split_every=None, split_out=1)

Compute max of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed max of values within each group.

mean(split_every=None, split_out=1)

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.

Some inconsistencies with the Dask version may exist.

Returns:pandas.Series or pandas.DataFrame

See also

Series.groupby, DataFrame.groupby

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  # doctest: +SKIP
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  # doctest: +SKIP
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  # doctest: +SKIP
       C
A B
1 2.0  2
  4.0  1
2 3.0  1
  5.0  2

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  # doctest: +SKIP
A
1    3.0
2    4.0
Name: B, dtype: float64
min(split_every=None, split_out=1)

Compute min of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed min of values within each group.

prod(split_every=None, split_out=1, min_count=None)

Compute prod of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed prod of values within each group.

size(split_every=None, split_out=1)

Compute group sizes.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.

Some inconsistencies with the Dask version may exist.

Returns:

Series

Number of rows in each group.

See also

Series.groupby, DataFrame.groupby

std(ddof=1, split_every=None, split_out=1)

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:

ddof : integer, default 1

degrees of freedom

Returns:

Series or DataFrame

Standard deviation of values within each group.

See also

Series.groupby, DataFrame.groupby

sum(split_every=None, split_out=1, min_count=None)

Compute sum of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed sum of values within each group.

transform(func, *args, **kwargs)

Parallel version of pandas GroupBy.transform

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
  2. Dask’s GroupBy.transform is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:

func: function

Function to apply

args, kwargs : Scalar, Delayed or object

Arguments and keywords to pass to the function.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

var(ddof=1, split_every=None, split_out=1)

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:

ddof : integer, default 1

degrees of freedom

Returns:

Series or DataFrame

Variance of values within each group.

See also

Series.groupby, DataFrame.groupby

SeriesGroupBy

class dask.dataframe.groupby.SeriesGroupBy(df, by=None, slice=None, **kwargs)
agg(arg, split_every=None, split_out=1)

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.agg.

Some inconsistencies with the Dask version may exist.

Parameters:

func : function, str, list or dict

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns:

scalar, Series or DataFrame

The return can be:

  • scalar : when Series.agg is called with single function
  • Series : when DataFrame.agg is called with a single function
  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.Series.groupby.apply, pandas.Series.groupby.transform, pandas.Series.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    1
1    2
2    3
3    4
dtype: int64
>>> s.groupby([1, 1, 2, 2]).min()  # doctest: +SKIP
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min')  # doctest: +SKIP
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  # doctest: +SKIP
   min  max
1    1    2
2    3    4

The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.

>>> s.groupby([1, 1, 2, 2]).agg(  # doctest: +SKIP
...     minimum='min',
...     maximum='max',
... )
   minimum  maximum
1        1        2
2        3        4
aggregate(arg, split_every=None, split_out=1)

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.aggregate.

Some inconsistencies with the Dask version may exist.

Parameters:

func : function, str, list or dict

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns:

scalar, Series or DataFrame

The return can be:

  • scalar : when Series.agg is called with single function
  • Series : when DataFrame.agg is called with a single function
  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.Series.groupby.apply, pandas.Series.groupby.transform, pandas.Series.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    1
1    2
2    3
3    4
dtype: int64
>>> s.groupby([1, 1, 2, 2]).min()  # doctest: +SKIP
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min')  # doctest: +SKIP
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  # doctest: +SKIP
   min  max
1    1    2
2    3    4

The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.

>>> s.groupby([1, 1, 2, 2]).agg(  # doctest: +SKIP
...     minimum='min',
...     maximum='max',
... )
   minimum  maximum
1        1        2
2        3        4
apply(func, *args, **kwargs)

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
  2. Dask’s GroupBy.apply is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:

func: function

Function to apply

args, kwargs : Scalar, Delayed or object

Arguments and keywords to pass to the function.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

corr(ddof=1, split_every=None, split_out=1)

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)

Parameters:

method : {‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
  • callable: callable with input two 1d ndarrays
    and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior .. versionadded:: 0.24.0

min_periods : int, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

Returns:

DataFrame

Correlation matrix.

See also

DataFrame.corrwith, Series.corr

Examples

>>> def histogram_intersection(a, b):  # doctest: +SKIP
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  # doctest: +SKIP
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
count(split_every=None, split_out=1)

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Count of values within each group.

See also

Series.groupby, DataFrame.groupby

cov(ddof=1, split_every=None, split_out=1, std=False)

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Groupby covariance is accomplished by

  1. Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
  2. The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar

When std is True calculate Correlation

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:

min_periods : int, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result.

Returns:

DataFrame

The covariance matrix of the series of the DataFrame.

See also

Series.cov
Compute covariance with another Series.
core.window.EWM.cov
Exponential weighted sample covariance.
core.window.Expanding.cov
Expanding sample covariance.
core.window.Rolling.cov
Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.cov()  # doctest: +SKIP
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(1000, 5),  # doctest: +SKIP
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  # doctest: +SKIP
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(20, 3),  # doctest: +SKIP
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  # doctest: +SKIP
>>> df.loc[df.index[5:10], 'b'] = np.nan  # doctest: +SKIP
>>> df.cov(min_periods=12)  # doctest: +SKIP
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
cumcount(axis=None)

Number each item in each group from 0 to the length of that group - 1.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.

Some inconsistencies with the Dask version may exist.

Essentially this is equivalent to

>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  # doctest: +SKIP
Parameters:

ascending : bool, default True (Not supported in Dask)

If False, number in reverse, from length of group - 1 to 0.

Returns:

Series

Sequence number of each element within each group.

See also

ngroup
Number the groups themselves.

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  # doctest: +SKIP
...                   columns=['A'])
>>> df  # doctest: +SKIP
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  # doctest: +SKIP
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  # doctest: +SKIP
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64
cumprod(axis=0)

Cumulative product for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.

Some inconsistencies with the Dask version may exist.

Returns:Series or DataFrame

See also

Series.groupby, DataFrame.groupby

cumsum(axis=0)

Cumulative sum for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.

Some inconsistencies with the Dask version may exist.

Returns:Series or DataFrame

See also

Series.groupby, DataFrame.groupby

first(split_every=None, split_out=1)

Compute first of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed first of values within each group.

get_group(key)

Construct DataFrame from group with provided name.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.

Some inconsistencies with the Dask version may exist.

Parameters:

name : object (Not supported in Dask)

the name of the group to get as a DataFrame

obj : DataFrame, default None (Not supported in Dask)

the DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : same type as obj

idxmax(split_every=None, split_out=1, axis=None, skipna=True)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:

Series

Indexes of maxima along the specified axis.

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(split_every=None, split_out=1, axis=None, skipna=True)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:

Series

Indexes of minima along the specified axis.

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

last(split_every=None, split_out=1)

Compute last of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed last of values within each group.

max(split_every=None, split_out=1)

Compute max of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed max of values within each group.

mean(split_every=None, split_out=1)

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.

Some inconsistencies with the Dask version may exist.

Returns:pandas.Series or pandas.DataFrame

See also

Series.groupby, DataFrame.groupby

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  # doctest: +SKIP
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  # doctest: +SKIP
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  # doctest: +SKIP
       C
A B
1 2.0  2
  4.0  1
2 3.0  1
  5.0  2

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  # doctest: +SKIP
A
1    3.0
2    4.0
Name: B, dtype: float64
min(split_every=None, split_out=1)

Compute min of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed min of values within each group.

prod(split_every=None, split_out=1, min_count=None)

Compute prod of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed prod of values within each group.

size(split_every=None, split_out=1)

Compute group sizes.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.

Some inconsistencies with the Dask version may exist.

Returns:

Series

Number of rows in each group.

See also

Series.groupby, DataFrame.groupby

std(ddof=1, split_every=None, split_out=1)

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:

ddof : integer, default 1

degrees of freedom

Returns:

Series or DataFrame

Standard deviation of values within each group.

See also

Series.groupby, DataFrame.groupby

sum(split_every=None, split_out=1, min_count=None)

Compute sum of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.

Some inconsistencies with the Dask version may exist.

Returns:

Series or DataFrame

Computed sum of values within each group.

transform(func, *args, **kwargs)

Parallel version of pandas GroupBy.transform

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
  2. Dask’s GroupBy.transform is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:

func: function

Function to apply

args, kwargs : Scalar, Delayed or object

Arguments and keywords to pass to the function.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

var(ddof=1, split_every=None, split_out=1)

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:

ddof : integer, default 1

degrees of freedom

Returns:

Series or DataFrame

Variance of values within each group.

See also

Series.groupby, DataFrame.groupby

Custom Aggregation

class dask.dataframe.groupby.Aggregation(name, chunk, agg, finalize=None)

User defined groupby-aggregation.

This class allows users to define their own custom aggregation in terms of operations on Pandas dataframes in a map-reduce style. You need to specify what operation to do on each chunk of data, how to combine those chunks of data together, and then how to finalize the result.

See Aggregate for more.

Parameters:

name : str

the name of the aggregation. It should be unique, since intermediate result will be identified by this name.

chunk : callable

a function that will be called with the grouped column of each partition. It can either return a single series or a tuple of series. The index has to be equal to the groups.

agg : callable

a function that will be called to aggregate the results of each chunk. Again the argument(s) will be grouped series. If chunk returned a tuple, agg will be called with all of them as individual positional arguments.

finalize : callable

an optional finalizer that will be called with the results from the aggregation.

Examples

We could implement sum as follows:

>>> custom_sum = dd.Aggregation(
...     name='custom_sum',
...     chunk=lambda s: s.sum(),
...     agg=lambda s0: s0.sum()
... )  # doctest: +SKIP
>>> df.groupby('g').agg(custom_sum)  # doctest: +SKIP

We can implement mean as follows:

>>> custom_mean = dd.Aggregation(
...     name='custom_mean',
...     chunk=lambda s: (s.count(), s.sum()),
...     agg=lambda count, sum: (count.sum(), sum.sum()),
...     finalize=lambda count, sum: sum / count,
... )  # doctest: +SKIP
>>> df.groupby('g').agg(custom_mean)  # doctest: +SKIP

Though of course, both of these are built-in and so you don’t need to implement them yourself.

Storage and Conversion

Rolling

Dask Metadata

Other functions