API¶
Dataframe¶
DataFrame |
|
DataFrame.add |
|
DataFrame.append |
|
DataFrame.apply |
|
DataFrame.assign |
|
DataFrame.astype |
|
DataFrame.categorize |
|
DataFrame.columns |
|
DataFrame.compute |
|
DataFrame.corr |
|
DataFrame.count |
|
DataFrame.cov |
|
DataFrame.cummax |
|
DataFrame.cummin |
|
DataFrame.cumprod |
|
DataFrame.cumsum |
|
DataFrame.describe |
|
DataFrame.div |
|
DataFrame.drop |
|
DataFrame.drop_duplicates |
|
DataFrame.dropna |
|
DataFrame.dtypes |
|
DataFrame.explode |
|
DataFrame.fillna |
|
DataFrame.floordiv |
|
DataFrame.get_partition |
|
DataFrame.groupby |
|
DataFrame.head |
|
DataFrame.iloc |
|
DataFrame.index |
|
DataFrame.isna |
|
DataFrame.isnull |
|
DataFrame.iterrows |
|
DataFrame.itertuples |
|
DataFrame.join |
|
DataFrame.known_divisions |
|
DataFrame.loc |
|
DataFrame.map_partitions |
|
DataFrame.mask |
|
DataFrame.max |
|
DataFrame.mean |
|
DataFrame.merge |
|
DataFrame.min |
|
DataFrame.mod |
|
DataFrame.mul |
|
DataFrame.ndim |
|
DataFrame.nlargest |
|
DataFrame.npartitions |
|
DataFrame.partitions |
|
DataFrame.pop |
|
DataFrame.pow |
|
DataFrame.prod |
|
DataFrame.quantile |
|
DataFrame.query |
|
DataFrame.radd |
|
DataFrame.random_split |
|
DataFrame.rdiv |
|
DataFrame.rename |
|
DataFrame.repartition |
|
DataFrame.replace |
|
DataFrame.reset_index |
|
DataFrame.rfloordiv |
|
DataFrame.rmod |
|
DataFrame.rmul |
|
DataFrame.rpow |
|
DataFrame.rsub |
|
DataFrame.rtruediv |
|
DataFrame.sample |
|
DataFrame.set_index |
|
DataFrame.shape |
|
DataFrame.std |
|
DataFrame.sub |
|
DataFrame.sum |
|
DataFrame.tail |
|
DataFrame.to_bag |
|
DataFrame.to_csv |
|
DataFrame.to_dask_array |
|
DataFrame.to_delayed |
|
DataFrame.to_hdf |
|
DataFrame.to_json |
|
DataFrame.to_parquet |
|
DataFrame.to_records |
|
DataFrame.truediv |
|
DataFrame.values |
|
DataFrame.var |
|
DataFrame.visualize |
|
DataFrame.where |
Series¶
Series |
|
Series.add |
|
Series.align |
|
Series.all |
|
Series.any |
|
Series.append |
|
Series.apply |
|
Series.astype |
|
Series.autocorr |
|
Series.between |
|
Series.bfill |
|
Series.cat |
|
Series.clear_divisions |
|
Series.clip |
|
Series.clip_lower |
|
Series.clip_upper |
|
Series.compute |
|
Series.copy |
|
Series.corr |
|
Series.count |
|
Series.cov |
|
Series.cummax |
|
Series.cummin |
|
Series.cumprod |
|
Series.cumsum |
|
Series.describe |
|
Series.diff |
|
Series.div |
|
Series.drop_duplicates |
|
Series.dropna |
|
Series.dt |
|
Series.dtype |
|
Series.eq |
|
Series.explode |
|
Series.ffill |
|
Series.fillna |
|
Series.first |
|
Series.floordiv |
|
Series.ge |
|
Series.get_partition |
|
Series.groupby |
|
Series.gt |
|
Series.head |
|
Series.idxmax |
|
Series.idxmin |
|
Series.isin |
|
Series.isna |
|
Series.isnull |
|
Series.iteritems |
|
Series.known_divisions |
|
Series.last |
|
Series.le |
|
Series.loc |
|
Series.lt |
|
Series.map |
|
Series.map_overlap |
|
Series.map_partitions |
|
Series.mask |
|
Series.max |
|
Series.mean |
|
Series.memory_usage |
|
Series.min |
|
Series.mod |
|
Series.mul |
|
Series.nbytes |
|
Series.ndim |
|
Series.ne |
|
Series.nlargest |
|
Series.notnull |
|
Series.nsmallest |
|
Series.nunique |
|
Series.nunique_approx |
|
Series.persist |
|
Series.pipe |
|
Series.pow |
|
Series.prod |
|
Series.quantile |
|
Series.radd |
|
Series.random_split |
|
Series.rdiv |
|
Series.reduction |
|
Series.repartition |
|
Series.replace |
|
Series.rename |
|
Series.resample |
|
Series.reset_index |
|
Series.rolling |
|
Series.round |
|
Series.sample |
|
Series.sem |
|
Series.shape |
|
Series.shift |
|
Series.size |
|
Series.std |
|
Series.str |
|
Series.sub |
|
Series.sum |
|
Series.to_bag |
|
Series.to_csv |
|
Series.to_dask_array |
|
Series.to_delayed |
|
Series.to_frame |
|
Series.to_hdf |
|
Series.to_string |
|
Series.to_timestamp |
|
Series.truediv |
|
Series.unique |
|
Series.value_counts |
|
Series.values |
|
Series.var |
|
Series.visualize |
|
Series.where |
Groupby Operations¶
DataFrameGroupBy.aggregate (arg[, …]) |
Aggregate using one or more operations over the specified axis. |
DataFrameGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
DataFrameGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values. |
DataFrameGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
DataFrameGroupBy.cumprod ([axis]) |
Cumulative product for each group. |
DataFrameGroupBy.cumsum ([axis]) |
Cumulative sum for each group. |
DataFrameGroupBy.get_group (key) |
Construct DataFrame from group with provided name. |
DataFrameGroupBy.max ([split_every, split_out]) |
Compute max of group values. |
DataFrameGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values. |
DataFrameGroupBy.min ([split_every, split_out]) |
Compute min of group values. |
DataFrameGroupBy.size ([split_every, split_out]) |
Compute group sizes. |
DataFrameGroupBy.std ([ddof, split_every, …]) |
Compute standard deviation of groups, excluding missing values. |
DataFrameGroupBy.sum ([split_every, …]) |
Compute sum of group values. |
DataFrameGroupBy.var ([ddof, split_every, …]) |
Compute variance of groups, excluding missing values. |
DataFrameGroupBy.cov ([ddof, split_every, …]) |
Compute pairwise covariance of columns, excluding NA/null values. |
DataFrameGroupBy.corr ([ddof, split_every, …]) |
Compute pairwise correlation of columns, excluding NA/null values. |
DataFrameGroupBy.first ([split_every, split_out]) |
Compute first of group values. |
DataFrameGroupBy.last ([split_every, split_out]) |
Compute last of group values. |
DataFrameGroupBy.idxmin ([split_every, …]) |
Return index of first occurrence of minimum over requested axis. |
DataFrameGroupBy.idxmax ([split_every, …]) |
Return index of first occurrence of maximum over requested axis. |
SeriesGroupBy.aggregate (arg[, split_every, …]) |
Aggregate using one or more operations over the specified axis. |
SeriesGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
SeriesGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values. |
SeriesGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
SeriesGroupBy.cumprod ([axis]) |
Cumulative product for each group. |
SeriesGroupBy.cumsum ([axis]) |
Cumulative sum for each group. |
SeriesGroupBy.get_group (key) |
Construct DataFrame from group with provided name. |
SeriesGroupBy.max ([split_every, split_out]) |
Compute max of group values. |
SeriesGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values. |
SeriesGroupBy.min ([split_every, split_out]) |
Compute min of group values. |
SeriesGroupBy.nunique ([split_every, split_out]) |
|
SeriesGroupBy.size ([split_every, split_out]) |
Compute group sizes. |
SeriesGroupBy.std ([ddof, split_every, split_out]) |
Compute standard deviation of groups, excluding missing values. |
SeriesGroupBy.sum ([split_every, split_out, …]) |
Compute sum of group values. |
SeriesGroupBy.var ([ddof, split_every, split_out]) |
Compute variance of groups, excluding missing values. |
SeriesGroupBy.first ([split_every, split_out]) |
Compute first of group values. |
SeriesGroupBy.last ([split_every, split_out]) |
Compute last of group values. |
SeriesGroupBy.idxmin ([split_every, …]) |
Return index of first occurrence of minimum over requested axis. |
SeriesGroupBy.idxmax ([split_every, …]) |
Return index of first occurrence of maximum over requested axis. |
Aggregation (name, chunk, agg[, finalize]) |
User defined groupby-aggregation. |
Rolling Operations¶
rolling.map_overlap |
|
Series.rolling |
|
DataFrame.rolling |
Rolling.apply |
|
Rolling.count |
|
Rolling.kurt |
|
Rolling.max |
|
Rolling.mean |
|
Rolling.median |
|
Rolling.min |
|
Rolling.quantile |
|
Rolling.skew |
|
Rolling.std |
|
Rolling.sum |
|
Rolling.var |
Create DataFrames¶
read_csv |
|
read_table |
|
read_fwf |
|
read_parquet |
|
read_hdf |
|
read_json |
|
read_orc |
|
read_sql_table |
|
from_array |
|
from_bcolz |
|
from_dask_array |
|
from_delayed |
|
from_pandas |
|
dask.bag.core.Bag.to_dataframe |
Store DataFrames¶
to_csv |
|
to_parquet |
|
to_hdf |
|
to_records |
|
to_bag |
|
to_json |
Convert DataFrames¶
to_dask_array |
|
to_delayed |
Reshape DataFrames¶
get_dummies |
|
pivot_table |
|
melt |
DataFrame Methods¶
Series Methods¶
DataFrameGroupBy¶
-
class
dask.dataframe.groupby.
DataFrameGroupBy
(df, by=None, slice=None, group_keys=True)¶ -
agg
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.agg.
Some inconsistencies with the Dask version may exist.
Parameters: func : function, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: scalar, Series or DataFrame
The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also
pandas.DataFrame.groupby.apply
,pandas.DataFrame.groupby.transform
,pandas.DataFrame.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 2], # doctest: +SKIP ... 'B': [1, 2, 3, 4], ... 'C': np.random.randn(4)})
>>> df # doctest: +SKIP A B C 0 1 1 0.362838 1 1 2 0.227877 2 2 3 1.267767 3 2 4 -0.562860
The aggregation is for each column.
>>> df.groupby('A').agg('min') # doctest: +SKIP B C A 1 1 0.227877 2 3 -0.562860
Multiple aggregations
>>> df.groupby('A').agg(['min', 'max']) # doctest: +SKIP B C min max min max A 1 1 2 0.227877 0.362838 2 3 4 -0.562860 1.267767
Select a column for aggregation
>>> df.groupby('A').B.agg(['min', 'max']) # doctest: +SKIP min max A 1 1 2 2 3 4
Different aggregations per column
>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'}) # doctest: +SKIP B C min max sum A 1 1 2 0.590716 2 3 4 0.704907
To control the output names with different aggregations per column, pandas supports “named aggregation”
>>> df.groupby("A").agg( # doctest: +SKIP ... b_min=pd.NamedAgg(column="B", aggfunc="min"), ... c_sum=pd.NamedAgg(column="C", aggfunc="sum")) b_min c_sum A 1 1 -1.956929 2 3 -0.322183
- The keywords are the output column names
- The values are tuples whose first element is the column to select
and the second element is the aggregation to apply to that column.
Pandas provides the
pandas.NamedAgg
namedtuple with the fields['column', 'aggfunc']
to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
See Named aggregation for more.
-
aggregate
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.aggregate.
Some inconsistencies with the Dask version may exist.
Parameters: func : function, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: scalar, Series or DataFrame
The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also
pandas.DataFrame.groupby.apply
,pandas.DataFrame.groupby.transform
,pandas.DataFrame.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 2], # doctest: +SKIP ... 'B': [1, 2, 3, 4], ... 'C': np.random.randn(4)})
>>> df # doctest: +SKIP A B C 0 1 1 0.362838 1 1 2 0.227877 2 2 3 1.267767 3 2 4 -0.562860
The aggregation is for each column.
>>> df.groupby('A').agg('min') # doctest: +SKIP B C A 1 1 0.227877 2 3 -0.562860
Multiple aggregations
>>> df.groupby('A').agg(['min', 'max']) # doctest: +SKIP B C min max min max A 1 1 2 0.227877 0.362838 2 3 4 -0.562860 1.267767
Select a column for aggregation
>>> df.groupby('A').B.agg(['min', 'max']) # doctest: +SKIP min max A 1 1 2 2 3 4
Different aggregations per column
>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'}) # doctest: +SKIP B C min max sum A 1 1 2 0.590716 2 3 4 0.704907
To control the output names with different aggregations per column, pandas supports “named aggregation”
>>> df.groupby("A").agg( # doctest: +SKIP ... b_min=pd.NamedAgg(column="B", aggfunc="min"), ... c_sum=pd.NamedAgg(column="C", aggfunc="sum")) b_min c_sum A 1 1 -1.956929 2 3 -0.322183
- The keywords are the output column names
- The values are tuples whose first element is the column to select
and the second element is the aggregation to apply to that column.
Pandas provides the
pandas.NamedAgg
namedtuple with the fields['column', 'aggfunc']
to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
See Named aggregation for more.
-
apply
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
- Dask’s GroupBy.apply is not appropriate for aggregations. For custom
aggregations, use
dask.dataframe.groupby.Aggregation
.
Warning
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply
func
once to each partition-group pair, so whenfunc
is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, usedask.dataframe.groupby.Aggregation
.Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
corr
(ddof=1, split_every=None, split_out=1)¶ Compute pairwise correlation of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.corr.
Some inconsistencies with the Dask version may exist.
Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)
Parameters: method : {‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
- and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior .. versionadded:: 0.24.0
min_periods : int, optional (Not supported in Dask)
Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns: DataFrame
Correlation matrix.
See also
DataFrame.corrwith
,Series.corr
Examples
>>> def histogram_intersection(a, b): # doctest: +SKIP ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) # doctest: +SKIP dogs cats dogs 1.0 0.3 cats 0.3 1.0
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Count of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
cov
(ddof=1, split_every=None, split_out=1, std=False)¶ Compute pairwise covariance of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.cov.
Some inconsistencies with the Dask version may exist.
Groupby covariance is accomplished by
- Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
- The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar
When std is True calculate Correlation
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
Parameters: min_periods : int, optional (Not supported in Dask)
Minimum number of observations required per pair of columns to have a valid result.
Returns: DataFrame
The covariance matrix of the series of the DataFrame.
See also
Series.cov
- Compute covariance with another Series.
core.window.EWM.cov
- Exponential weighted sample covariance.
core.window.Expanding.cov
- Expanding sample covariance.
core.window.Rolling.cov
- Rolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.cov() # doctest: +SKIP dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(1000, 5), # doctest: +SKIP ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() # doctest: +SKIP a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periods
keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(20, 3), # doctest: +SKIP ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan # doctest: +SKIP >>> df.loc[df.index[5:10], 'b'] = np.nan # doctest: +SKIP >>> df.cov(min_periods=12) # doctest: +SKIP a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.
Some inconsistencies with the Dask version may exist.
Essentially this is equivalent to
>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index)) # doctest: +SKIP
Parameters: ascending : bool, default True (Not supported in Dask)
If False, number in reverse, from length of group - 1 to 0.
Returns: Series
Sequence number of each element within each group.
See also
ngroup
- Number the groups themselves.
Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], # doctest: +SKIP ... columns=['A']) >>> df # doctest: +SKIP A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() # doctest: +SKIP 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) # doctest: +SKIP 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
-
cumprod
(axis=0)¶ Cumulative product for each group.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame See also
Series.groupby
,DataFrame.groupby
-
cumsum
(axis=0)¶ Cumulative sum for each group.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame See also
Series.groupby
,DataFrame.groupby
-
first
(split_every=None, split_out=1)¶ Compute first of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed first of values within each group.
-
get_group
(key)¶ Construct DataFrame from group with provided name.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.
Some inconsistencies with the Dask version may exist.
Parameters: name : object (Not supported in Dask)
the name of the group to get as a DataFrame
obj : DataFrame, default None (Not supported in Dask)
the DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : same type as obj
-
idxmax
(split_every=None, split_out=1, axis=None, skipna=True)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
This docstring was copied from pandas.core.frame.DataFrame.idxmax.
Some inconsistencies with the Dask version may exist.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Series
Indexes of maxima along the specified axis.
Raises: ValueError
- If the row/column is empty
See also
Series.idxmax
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(split_every=None, split_out=1, axis=None, skipna=True)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
This docstring was copied from pandas.core.frame.DataFrame.idxmin.
Some inconsistencies with the Dask version may exist.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Series
Indexes of minima along the specified axis.
Raises: ValueError
- If the row/column is empty
See also
Series.idxmin
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
last
(split_every=None, split_out=1)¶ Compute last of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed last of values within each group.
-
max
(split_every=None, split_out=1)¶ Compute max of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed max of values within each group.
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.
Some inconsistencies with the Dask version may exist.
Returns: pandas.Series or pandas.DataFrame See also
Series.groupby
,DataFrame.groupby
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2], # doctest: +SKIP ... 'B': [np.nan, 2, 3, 4, 5], ... 'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean() # doctest: +SKIP B C A 1 3.0 1.333333 2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
>>> df.groupby(['A', 'B']).mean() # doctest: +SKIP C A B 1 2.0 2 4.0 1 2 3.0 1 5.0 2
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean() # doctest: +SKIP A 1 3.0 2 4.0 Name: B, dtype: float64
-
min
(split_every=None, split_out=1)¶ Compute min of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed min of values within each group.
-
prod
(split_every=None, split_out=1, min_count=None)¶ Compute prod of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed prod of values within each group.
-
size
(split_every=None, split_out=1)¶ Compute group sizes.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.
Some inconsistencies with the Dask version may exist.
Returns: Series
Number of rows in each group.
See also
Series.groupby
,DataFrame.groupby
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.
Some inconsistencies with the Dask version may exist.
For multiple groupings, the result index will be a MultiIndex.
Parameters: ddof : integer, default 1
degrees of freedom
Returns: Series or DataFrame
Standard deviation of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
sum
(split_every=None, split_out=1, min_count=None)¶ Compute sum of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed sum of values within each group.
-
transform
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.transform
This mimics the pandas version except for the following:
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
- Dask’s GroupBy.transform is not appropriate for aggregations. For custom
aggregations, use
dask.dataframe.groupby.Aggregation
.
Warning
Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply
func
once to each partition-group pair, so whenfunc
is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, usedask.dataframe.groupby.Aggregation
.Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.
Some inconsistencies with the Dask version may exist.
For multiple groupings, the result index will be a MultiIndex.
Parameters: ddof : integer, default 1
degrees of freedom
Returns: Series or DataFrame
Variance of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
SeriesGroupBy¶
-
class
dask.dataframe.groupby.
SeriesGroupBy
(df, by=None, slice=None, **kwargs)¶ -
agg
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.agg.
Some inconsistencies with the Dask version may exist.
Parameters: func : function, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: scalar, Series or DataFrame
The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also
pandas.Series.groupby.apply
,pandas.Series.groupby.transform
,pandas.Series.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) # doctest: +SKIP
>>> s # doctest: +SKIP 0 1 1 2 2 3 3 4 dtype: int64
>>> s.groupby([1, 1, 2, 2]).min() # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min') # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max']) # doctest: +SKIP min max 1 1 2 2 3 4
The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.
>>> s.groupby([1, 1, 2, 2]).agg( # doctest: +SKIP ... minimum='min', ... maximum='max', ... ) minimum maximum 1 1 2 2 3 4
-
aggregate
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.aggregate.
Some inconsistencies with the Dask version may exist.
Parameters: func : function, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: scalar, Series or DataFrame
The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also
pandas.Series.groupby.apply
,pandas.Series.groupby.transform
,pandas.Series.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) # doctest: +SKIP
>>> s # doctest: +SKIP 0 1 1 2 2 3 3 4 dtype: int64
>>> s.groupby([1, 1, 2, 2]).min() # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min') # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max']) # doctest: +SKIP min max 1 1 2 2 3 4
The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.
>>> s.groupby([1, 1, 2, 2]).agg( # doctest: +SKIP ... minimum='min', ... maximum='max', ... ) minimum maximum 1 1 2 2 3 4
-
apply
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
- Dask’s GroupBy.apply is not appropriate for aggregations. For custom
aggregations, use
dask.dataframe.groupby.Aggregation
.
Warning
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply
func
once to each partition-group pair, so whenfunc
is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, usedask.dataframe.groupby.Aggregation
.Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
corr
(ddof=1, split_every=None, split_out=1)¶ Compute pairwise correlation of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.corr.
Some inconsistencies with the Dask version may exist.
Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)
Parameters: method : {‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
- and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior .. versionadded:: 0.24.0
min_periods : int, optional (Not supported in Dask)
Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns: DataFrame
Correlation matrix.
See also
DataFrame.corrwith
,Series.corr
Examples
>>> def histogram_intersection(a, b): # doctest: +SKIP ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) # doctest: +SKIP dogs cats dogs 1.0 0.3 cats 0.3 1.0
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Count of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
cov
(ddof=1, split_every=None, split_out=1, std=False)¶ Compute pairwise covariance of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.cov.
Some inconsistencies with the Dask version may exist.
Groupby covariance is accomplished by
- Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
- The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar
When std is True calculate Correlation
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
Parameters: min_periods : int, optional (Not supported in Dask)
Minimum number of observations required per pair of columns to have a valid result.
Returns: DataFrame
The covariance matrix of the series of the DataFrame.
See also
Series.cov
- Compute covariance with another Series.
core.window.EWM.cov
- Exponential weighted sample covariance.
core.window.Expanding.cov
- Expanding sample covariance.
core.window.Rolling.cov
- Rolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.cov() # doctest: +SKIP dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(1000, 5), # doctest: +SKIP ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() # doctest: +SKIP a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periods
keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(20, 3), # doctest: +SKIP ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan # doctest: +SKIP >>> df.loc[df.index[5:10], 'b'] = np.nan # doctest: +SKIP >>> df.cov(min_periods=12) # doctest: +SKIP a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.
Some inconsistencies with the Dask version may exist.
Essentially this is equivalent to
>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index)) # doctest: +SKIP
Parameters: ascending : bool, default True (Not supported in Dask)
If False, number in reverse, from length of group - 1 to 0.
Returns: Series
Sequence number of each element within each group.
See also
ngroup
- Number the groups themselves.
Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], # doctest: +SKIP ... columns=['A']) >>> df # doctest: +SKIP A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() # doctest: +SKIP 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) # doctest: +SKIP 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
-
cumprod
(axis=0)¶ Cumulative product for each group.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame See also
Series.groupby
,DataFrame.groupby
-
cumsum
(axis=0)¶ Cumulative sum for each group.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame See also
Series.groupby
,DataFrame.groupby
-
first
(split_every=None, split_out=1)¶ Compute first of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed first of values within each group.
-
get_group
(key)¶ Construct DataFrame from group with provided name.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.
Some inconsistencies with the Dask version may exist.
Parameters: name : object (Not supported in Dask)
the name of the group to get as a DataFrame
obj : DataFrame, default None (Not supported in Dask)
the DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : same type as obj
-
idxmax
(split_every=None, split_out=1, axis=None, skipna=True)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
This docstring was copied from pandas.core.frame.DataFrame.idxmax.
Some inconsistencies with the Dask version may exist.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Series
Indexes of maxima along the specified axis.
Raises: ValueError
- If the row/column is empty
See also
Series.idxmax
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(split_every=None, split_out=1, axis=None, skipna=True)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
This docstring was copied from pandas.core.frame.DataFrame.idxmin.
Some inconsistencies with the Dask version may exist.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Series
Indexes of minima along the specified axis.
Raises: ValueError
- If the row/column is empty
See also
Series.idxmin
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
last
(split_every=None, split_out=1)¶ Compute last of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed last of values within each group.
-
max
(split_every=None, split_out=1)¶ Compute max of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed max of values within each group.
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.
Some inconsistencies with the Dask version may exist.
Returns: pandas.Series or pandas.DataFrame See also
Series.groupby
,DataFrame.groupby
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2], # doctest: +SKIP ... 'B': [np.nan, 2, 3, 4, 5], ... 'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean() # doctest: +SKIP B C A 1 3.0 1.333333 2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
>>> df.groupby(['A', 'B']).mean() # doctest: +SKIP C A B 1 2.0 2 4.0 1 2 3.0 1 5.0 2
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean() # doctest: +SKIP A 1 3.0 2 4.0 Name: B, dtype: float64
-
min
(split_every=None, split_out=1)¶ Compute min of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed min of values within each group.
-
prod
(split_every=None, split_out=1, min_count=None)¶ Compute prod of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed prod of values within each group.
-
size
(split_every=None, split_out=1)¶ Compute group sizes.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.
Some inconsistencies with the Dask version may exist.
Returns: Series
Number of rows in each group.
See also
Series.groupby
,DataFrame.groupby
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.
Some inconsistencies with the Dask version may exist.
For multiple groupings, the result index will be a MultiIndex.
Parameters: ddof : integer, default 1
degrees of freedom
Returns: Series or DataFrame
Standard deviation of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
sum
(split_every=None, split_out=1, min_count=None)¶ Compute sum of group values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.
Some inconsistencies with the Dask version may exist.
Returns: Series or DataFrame
Computed sum of values within each group.
-
transform
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.transform
This mimics the pandas version except for the following:
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
- Dask’s GroupBy.transform is not appropriate for aggregations. For custom
aggregations, use
dask.dataframe.groupby.Aggregation
.
Warning
Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply
func
once to each partition-group pair, so whenfunc
is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, usedask.dataframe.groupby.Aggregation
.Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values.
This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.
Some inconsistencies with the Dask version may exist.
For multiple groupings, the result index will be a MultiIndex.
Parameters: ddof : integer, default 1
degrees of freedom
Returns: Series or DataFrame
Variance of values within each group.
See also
Series.groupby
,DataFrame.groupby
-
Custom Aggregation¶
-
class
dask.dataframe.groupby.
Aggregation
(name, chunk, agg, finalize=None)¶ User defined groupby-aggregation.
This class allows users to define their own custom aggregation in terms of operations on Pandas dataframes in a map-reduce style. You need to specify what operation to do on each chunk of data, how to combine those chunks of data together, and then how to finalize the result.
See Aggregate for more.
Parameters: name : str
the name of the aggregation. It should be unique, since intermediate result will be identified by this name.
chunk : callable
a function that will be called with the grouped column of each partition. It can either return a single series or a tuple of series. The index has to be equal to the groups.
agg : callable
a function that will be called to aggregate the results of each chunk. Again the argument(s) will be grouped series. If
chunk
returned a tuple,agg
will be called with all of them as individual positional arguments.finalize : callable
an optional finalizer that will be called with the results from the aggregation.
Examples
We could implement
sum
as follows:>>> custom_sum = dd.Aggregation( ... name='custom_sum', ... chunk=lambda s: s.sum(), ... agg=lambda s0: s0.sum() ... ) # doctest: +SKIP >>> df.groupby('g').agg(custom_sum) # doctest: +SKIP
We can implement
mean
as follows:>>> custom_mean = dd.Aggregation( ... name='custom_mean', ... chunk=lambda s: (s.count(), s.sum()), ... agg=lambda count, sum: (count.sum(), sum.sum()), ... finalize=lambda count, sum: sum / count, ... ) # doctest: +SKIP >>> df.groupby('g').agg(custom_mean) # doctest: +SKIP
Though of course, both of these are built-in and so you don’t need to implement them yourself.