pandas groupby count和pivot table的区别

点击联系发帖人 时间：2016-12-05 14:29

pandas groupby 用法

pandas:groupby,pivot_table,crosstable比较 - 推酷
pandas:groupby,pivot_table,crosstable比较
pivot_table
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
The function pandas.pivot_table can be used to create spreadsheet-style pivot tables.
It takes a number of arguments:
data: A DataFrame object
values: a column or a list of columns to aggregate
index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function to use for aggregation, defaulting to numpy.mean
fill_value : scalar, default None
Value to replace missing values with
它的使用至少需要三个字段：index,columns,
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
It takes a number of arguments
index: array-like, values to group by in the rows
columns: array-like, values to group by in the columns
values: array-like, optional, array of values to aggregate according to the factors
aggfunc: function, optional, If no values array is passed, computes a frequency table
rownames: sequence, default None, must match number of row arrays passed
colnames: sequence, default None, if passed, must match number of column arrays passed
margins: boolean, default False, Add row/column margins (subtotals)
normalize: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all values by the sum of values.
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified
注意：此处index, columns, values均为numpy.array的格式。
统计频次。
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
by : mapping function / list of functions, dict, Series, or tuple list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible, otherwise return a consistent type
import pandas as pd
import numpy as np
df_x=pd.DataFrame({'type':['x','y','z','x','z'],'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],'C': [1, 1, np.nan, 1, 1]})
#注意fill_value的用法
pd.pivot_table(df_x,index=['type'],columns=['B'],values=['A'],aggfunc=np.sum,fill_value=0)
#统计值出现次数的方法：
pd.pivot_table(df_x,index=['type'],columns=['B'],values=['A'],aggfunc=len,fill_value=0)
pd.crosstab(df_x.type,df_x.B)
# crosstab can also be passed a third Series and an aggregation function (aggfunc)
#that will be applied to the values of the third Series within each group defined by the first two Series:
pd.crosstab(df_x.type,df_x.B,values=df_x.A,aggfunc=np.sum)
df_x.groupby(['type']).get_group('x')
df_x.groupby(['type']).get_group('x').sum()
dtype: object
已发表评论数()
请填写推刊名
描述不能大于100个字符!
权限设置：公开
仅自己可见
正文不准确
标题不准确
排版有问题
主题不准确
没有分页内容
图片无法显示
视频无法显示
与原文不一致更多公众号：gh_7396e39edb66数据分析师，需要掌握的知识面非常广，但只需每天进步一点点，终能成为数据大咖！最新文章相关推荐搜狗：感谢您阅读Python数据分析之pandas学习（二），本文可能来自网络，如果侵犯了您的相关权益，请联系管理员。QQ:1766人阅读
python（17）
pandas模块
方法有两个
1.在windows下安装pandas，只安装pandas一个包显然是不够的，它并没有把用到的相关包都打进去，这点是很麻烦的，只有等错误信息出来后才知道少了哪些包。我总结了一下，一共需要安装如下包：
pyparsing-2.0.2.win32-py2.7.exe
matplotlib-1.3.1.win32-py2.7.exe
openpyxl-openpyxl-5d2c0c874d2.tar.gz
setuptools-3.8.1.win32-py2.7.exe
numpy-MKL-1.8.1.win32-py2.7.exe
six-1.7.3.win32-py2.7exe
python-dateutil-2.2.win32-py2.7.exe
这些安装包的下载地址是：
将最后的matplotlib改成对应的模块。或者到
2.直接使用pip安装这个包就不用那么麻烦了，直接输入pip install pandas就可以了
在pandas中使用Series类的plot画图。
如果tz_counts是一个Series类：
1.在python画图，需要先导入matplotlib.pyplot：
import matplotlib.pyplot
tz_counts[:
10].plot(kind=
plt.show()
2.在ipython画图需要打开pylab模式：
ipython --pylab
DataFrame类：
DataFrame有四个重要的属性：
index：行索引。
columns：列索引。
values：值的二维数组。
name：名字。
这个类是Pandas最重要的类之一。
构建方法，DataFrame(sequence)，通过序列构建，序列中的每个元素是一个字典。
frame=DateFrame构建完之后，假设frame中有'name','age','addr'三个属性，可以使用fame['name']查看属性列内容，也可以fame.name这样直接查看。
frame按照'属性提取出来的每个列是一个Series类。
DataFrame类可以使用布尔型索引。
groupby(str|array...)函数：可以使用frame中对应属性的str或者和frame行数相同的array作为参数还可以使用一个会返回和frame长度相同list的函数作为参数，如果使用函数做分组参数，这个用做分组的函数传入的参数将会是fame的index，参数个数任意。使用了groupby函数之后配合,size()函数就可以对groupby结果进行统计。
groupby后可以使用：
size()：就是count
sum()：分组求和
apply(func，axis=0)：在分组上单独使用函数func返回frame，不groupby用在DataFrame会默认将func用在每个列上，如果axis=1表示将func用在行上。
reindex(index,column,method)：用来重新命名索引，和插值。
size()：会返回一个frame，这个frame是groupby后的结果。
sum(n).argsort()：如果frame中的值是数字，可以使用sum函数计算frame中摸个属性，各个因子分别求和，并返回一个Series，这个Series可以做为frame.take的参数，拿到frame中对应的行。
pivot_table(操作str1,index=str2,columns=str3,aggfunc=str4)透视图函数：
str1：是给函数str4作为参数的部分。
str2：是返回frame的行名。
str3：是返回frame的列名。
str4：是集合函数名，有'mean','sum'这些，按照str2，str3分组。
使用透视图函数之后，可以使用.sum()这类型函数，使用后会按照index和columns的分组求和。
order_index(by,ascending):
返回一个根据by排序，asceding=True表示升序，False表示降序的frame
concat(list)：将一个列表的frame行数加起来。
ix[index]：就是行索引，DataFrame的普通下标是列索引。
take(index)：作用和ix差不多，都是查询行，但是ix传入行号，take传入行索引。
unstack()：将行信息变成列信息。
apply(func，axis=0)和applymap(func)：apply用在DataFrame会默认将func用在每个列上，如果axis=1表示将func用在行上。applymap表示func用在每个元素上。
combine_first(frame2)：combine_first会把frame中的空值用frame1中对应位置的数据进行填充。Series方法也有相同的方法。
stack()函数，可以将DataFrame的列转化成行，原来的列索引成为行的层次索引。（stack和unstack方法是两个互逆的方法，可以用来进行Series和DataFrame之间的转换）
duplicated()：返回一个布尔型Series，表示各行是否重复。
drop_duplicates()：返回一个移除了重复行后的DataFrame
pct_change()：Series也有这个函数，这个函数用来计算同colnums两个相邻的数字之间的变化率。
corr()：计算相关系数矩阵。
cov()：计算协方差系数矩阵。
corrwith(Series|list,axis=0)：axis=0时计算frame的每列和参数的相关系数。
Series类：
两个重要的属性：
value：存放series值的一个数组。
index：Series的下标索引。
name：就是Series的名字
index有一个name属性。
可以通过Series(list,index=None)的形式来创建一个Series类，index表示的是用下标访问对应的数据。
也可以直接使用字典创建Series(dict)
value_counts()：该方法可以用来统计series类中各因子出现的次数，返回一个带统计结果的series。
fillna(str)：给series中的空值赋值。
plot()函数：可以用来给带统计结果的函数画图。但是要配合matplotlib使用
notnull()：返回一个判断series位置是否空值的布尔型索引。
sum()：如果是数字型Series，可以求和。
cumsum()：如果是数字型Series，可以返回一个累加的Series。
searchsorted()：在数字Series中定位一个数字的位置，这个数字不完全相同，但接近。
map(func)：将Series中的元素，每个都当做func的参数使用一遍，返回执行结果组成的Series
unique()：类似于sql中的distinct
isnull()/notnull()：返回一个布尔型索引
order()：对值进行排序。
order_value()：对索引进行排序。
unstack()方法：将Series的层次索引转换成列索引，变成一个DataFrame。
replace()：可以用list或dict作为参数，替换需要替换的值
str属性：Series.str后会将Series单做一个字符串的集合，这个集合能够使用字符串的操作，例如：
data=Series(['abc','bcd','cde'])
data.str[1]
read_table()函数：读dat文件。
import pandas as pd
mnames=['movie_id','title','genres']
movies=pd.read_table(r'C:\Users\Administrator\Desktop\python for data analysis data\pydata-book-master\ch02\movielens\movies.dat',sep='::',header=None,names=mnames)
read_csv()函数：可以将frame文件直接读成frame。
movies=pd.read_csv(r'names\job1880.txt',names=column)
read_csv函数有一个sep参数，设置分隔符，可以给这个参数传入正则表达式。
skiprows参数，参数是一个list，表示读取文件的时候，跳过list中的几行，第一行为0
read_excel()函数
可以直接读取excel文件为DataFrame
merge(frame1,frame2)：
根据两个frame列的名字自动合并，返回一个frame。
此函数可以通过on，left_on，right_on三个属性来设置怎么frame1和frame2通过什么属性来进行连接。
concat函数：
可以将DataFrame或者Series按照axis的方向堆积起来。
cut函数和qcut函数：
可以将一些离散值分箱，cut函数用的是数值区间将数值分箱，qcut用的是分位数。
cut用在长度相等的桶，qcut用在大小相等的桶。
to_datetime(str):
解析常用的时间格式。
date_range函数：
产生时间序列。
参考知识库
* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
访问：34604次
积分：1238
积分：1238
排名：千里之外
原创：87篇
转载：15篇
(1)(1)(16)(18)(49)(14)(2)(3)官方网站上《10
Minutes to pandas》的一个简单的翻译，原文在。这篇文章是对pandas的一个简单的介绍，详细的介绍请参考：。习惯上，我们会按下面格式引入所需要的包：
一、创建对象
来查看有关该节内容的详细信息。
1、可以通过传递一个list对象来创建一个Series，pandas会默认创建整型索引：
2、通过传递一个numpy
array，时间索引以及列标签来创建一个DataFrame：
3、通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame：
4、查看不同列的数据类型：
5、如果你使用的是IPython，使用Tab自动补全功能会自动识别所有的属性以及自定义的列，下图中是所有能够被自动识别的属性的一个子集：
二、查看数据
详情请参阅：
1、查看frame中头部和尾部的行：
2、显示索引、列和底层的numpy数据：
3、describe()函数对于数据的快速统计汇总：
4、对数据的转置：
5、按轴进行排序
6、按值进行排序
虽然标准的Python/Numpy的选择和设置表达式都能够直接派上用场，但是作为工程使用的代码，我们推荐使用经过优化的pandas数据访问方式： .at,&.iat,&.loc,&.iloc&和&.ix详情请参阅和。
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、选择一个单独的列，这将会返回一个Series，等同于df.A：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过[]进行选择，这将会对行进行切片
l通过标签选择
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、使用标签来获取一个交叉的区域
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过标签来在多个轴上进行选择
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、标签切片
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、对于返回的对象进行维度缩减
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、获取一个标量
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、快速访问一个标量（与上一个方法等价）
l通过位置选择
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过传递数值进行位置选择（选择的是行）
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过数值进行切片，与numpy/python中的情况类似
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过指定一个位置的列表，与numpy/python中的情况类似
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、对行进行切片
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、对列进行切片
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、获取特定的值
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、使用一个单独列的值来选择数据：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、使用where操作来选择数据：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、使用isin()方法来过滤：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、设置一个新的列：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过标签设置新的值：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过位置设置新的值：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过一个numpy数组设置一组新值：
上述操作结果如下：
<span lang="EN-US" style="mso-bidi-font-size:10.5font-family:&Courier New&;
mso-fareast-font-family:&Courier New&;color:#、通过where操作来设置新的值：
四、缺失值处理
在pandas中，使用np.nan来代替缺失值，这些值将默认不会包含在计算中，详情请参阅：。
1、reindex()方法可以对指定轴上的索引进行改变/增加/删除操作，这将返回原始数据的一个拷贝：、
2、去掉包含缺失值的行：
3、对缺失值进行填充：
4、对数据进行布尔填充：
五、相关操作
详情请参与
l统计（相关操作通常情况下不包括缺失值）
1、执行描述性统计：
2、在其他轴上进行相同的操作：
3、对于拥有不同维度，需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播：
1、对数据应用函数：
具体请参照：
l字符串方法
Series对象在其str属性中配备了一组字符串处理方法，可以很容易的应用到数组中的每个元素，如下段代码所示。更多详情请参考：.
Pandas提供了大量的方法能够轻松的对Series，DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作。具体请参阅：
lJoin 类似于SQL类型的合并，具体请参阅：
lAppend 将一行连接到一个DataFrame上，具体请参阅：
对于”group by”操作，我们通常是指以下一个或多个操作步骤：
l（Splitting）按照一些规则将数据分为不同的组；
l（Applying）对于每组数据分别执行一个函数；
l（Combining）将结果组合到一个数据结构中；
详情请参阅：
1、分组并对每个分组执行sum函数：
2、通过多个列进行分组形成一个层次索引，然后执行函数：
八、Reshaping
详情请参阅
l数据透视表，详情请参阅：.
可以从这个数据中轻松的生成数据透视表：
九、时间序列
Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能（如将按秒采样的数据转换为按5分钟为单位进行采样的数据）。这种操作在金融领域非常常见。具体参考：。
1、时区表示：
2、时区转换：
3、时间跨度转换：
4、时期和时间戳之间的转换使得可以使用一些方便的算术函数。
十、Categorical
从0.15版本开始，pandas可以在DataFrame中支持Categorical类型的数据，详细
介绍参看：和。
1、将原始的grade转换为Categorical数据类型：
2、将Categorical类型数据重命名为更有意义的名称：
3、对类别进行重新排序，增加缺失的类别：
4、排序是按照Categorical的顺序进行的而不是按照字典顺序进行：
5、对Categorical列进行排序时存在空的类别：
十一、画图
具体文档参看：&docs
对于DataFrame来说，plot是一种将所有列及其标签进行绘制的简便方法：
十二、导入和保存数据
lCSV，参考：
1、写入csv文件：
2、从csv文件中读取：
lHDF5，参考：
1、写入HDF5存储：
2、从HDF5存储中读取：
lExcel，参考：
1、写入excel文件：
2、从excel文件中读取：
阅读(...) 评论()pandas聚合和分组运算之groupby
pandas提供了一个灵活高效的groupby功能，它使你能以一种自然的方式对数据集进行切片、切块、摘要等操作。根据一个或多个键（可以是函数、数组或DataFrame列名）拆分pandas对象。计算分组摘要统计，如计数、平均值、标准差，或用户自定义函数。对DataFrame的列应用各种各样的函数。应用组内转换或其他运算，如规格化、线性回归、排名或选取子集等。计算透视表或交叉表。执行分位数分析以及其他分组分析。1、首先来看看下面这个非常简单的表格型数据集（以DataFrame的形式）：112&&& importpandas as pd&&& df =pd.DataFrame({'key1':['a','a', 'b', 'b','a'],...&&&& 'key2':['one','two', 'one', 'two','one'],...&&&& 'data1':np.random.randn(5),...&&&& 'data2':np.random.randn(5)})&&& df&&&&&&data1&&&& data2 key1 key20 -0.410673& 0.519378&&&a& one1 -2.120793& 0.199074&&&a& two2& 0.642216 -0.143671&&&b& one3& 0.975133 -0.592994&&&b& two4 -1.017495 -0.530459&&&a& one假设你想要按key1进行分组，并计算data1列的平均值，我们可以访问data1，并根据key1调用groupby：123&&& grouped = df['data1'].groupby(df['key1'])&&& grouped&pandas.core.groupby.SeriesGroupByobject at0x04120D70&变量grouped是一个GroupBy对象，它实际上还没有进行任何计算，只是含有一些有关分组键df['key1']的中间数据而已，然后我们可以调用GroupBy的mean方法来计算分组平均值：12345&&& grouped.mean()key1a&&&&& -1.182987b&&&&&& 0.808674dtype: float64说明：数据（Series）根据分组键进行了聚合，产生了一个新的Series，其索引为key1列中的唯一值。之所以结果中索引的名称为key1，是因为原始DataFrame的列df['key1']就叫这个名字。2、如果我们一次传入多个数组，就会得到不同的结果：&&& means = df['data1'].groupby([df['key1'], df['key2']]).mean()&&& meanskey1& key2a&&&& one&&& -0.714084&&&&&&two&&&-2.120793b&&&& one&&&& 0.642216&&&&&&two&&&&0.975133dtype: float64通过两个键对数据进行了分组，得到的Series具有一个层次化索引（由唯一的键对组成）：12345&&& means.unstack()key2&&&&&& one&&&&&& twokey1&&&&&&&&&&&&&&&&&&&a&&& -0.714084-2.120793b&&&& 0.642216&0.975133在上面这些示例中，分组键均为Series。实际上，分组键可以是任何长度适当的数组：&&& states = np.array(['Ohio','California','California','Ohio', 'Ohio'])&&& years = np.array([2005,2005, 2006, 2005,2006])&&& df['data1'].groupby([states, years]).mean()California& 2005&& -2.120793&&&&&&&&&&&&2006&&&0.642216Ohio&&&&&&& 2005&&& 0.282230&&&&&&&&&&&&2006&&-1.017495dtype: float64&3、此外，你还可以将列名（可以是字符串、数字或其他Python对象）用作分组将：112&&& df.groupby('key1').mean()&&&&&&&&&data1&&&& data2key1&&&&&&&&&&&&&&&&&&&a&&& -1.182987&0.062665b&&&& 0.808674-0.368333&&& df.groupby(['key1','key2']).mean()&&&&&&&&&&&&&&data1&&&& data2key1 key2&&&&&&&&&&&&&&&&&&&a&&& one& -0.714084 -0.005540&&&&&two&-2.120793&0.199074b&&& one&& 0.642216 -0.143671&&&&&two&&0.975133 -0.592994&说明：在执行df.groupby('key1').mean()时，结果中没有key2列。这是因为df['key2']不是数值数据，所以被从结果中排除了。默认情况下，所有数值列都会被聚合，虽然有时可能会被过滤为一个子集。无论你准备拿groupby做什么，都有可能会用到GroupBy的size方法，它可以返回一个含有分组大小的Series：1234567&&& df.groupby(['key1','key2']).size()key1& key2a&&&& one&&&& 2&&&&&&two&&&&1b&&&& one&&&& 1&&&&&&two&&&&1dtype: int64&注意：分组键中的任何缺失值都会被排除在结果之外。4、对分组进行迭代GroupBy对象支持迭代，可以产生一组二元元组（由分组名和数据块组成）。看看下面这个简单的数据集：11213&&& forname, group indf.groupby('key1'):...&&&& print(name)...&&&& print(group)... a&&&&&&data1&&&& data2 key1 key20 -0.410673& 0.519378&&&a& one1 -2.120793& 0.199074&&&a& two4 -1.017495 -0.530459&&&a& oneb&&&&&&data1&&&& data2 key1 key22& 0.642216 -0.143671&&&b& one3& 0.975133 -0.592994&&&b& two&对于多重键的情况，元组的第一个元素将会是由键值组成的元组：7&&& for(k1, k2), group indf.groupby(['key1','key2']):...&&&& printk1, k2...&&&& printgroup... a one&&&&&&data1&&&& data2 key1 key20 -0.410673& 0.519378&&&a& one4 -1.017495 -0.530459&&&a& onea two&&&&&&data1&&&& data2 key1 key21 -2.120793& 0.199074&&&a& twob one&&&&&&data1&&&& data2 key1 key22& 0.642216 -0.143671&&&b& oneb two&&&&&&data1&&&& data2 key1 key23& 0.975133 -0.592994&&&b& two&当然，你可以对这些数据片段做任何操作。有一个你可能会觉得有用的运算：将这些数据片段做成一个字典：1121314&&& pieces = dict(list(df.groupby('key1')))&&& pieces['b']&&&&&&data1&&&& data2 key1 key22& 0.642216 -0.143671&&&b& one3& 0.975133 -0.592994&&&b& two&&& df.groupby('key1')&pandas.core.groupby.DataFrameGroupByobject at0x0413AE30&&&& list(df.groupby('key1'))[('a',&&&&&& data1&&&& data2 key1 key20 -0.410673& 0.519378&&&a& one1 -2.120793& 0.199074&&&a& two4 -1.017495 -0.530459&&&a& one), ('b',&&&&&& data1&&&& data2 key1 key22& 0.642216 -0.143671&&&b& one3& 0.975133 -0.592994&&&b& two)]&groupby默认是在axis=0上进行分组的，通过设置也可以在其他任何轴上进行分组。那上面例子中的df来说，我们可以根据dtype对列进行分组：71819&&& df.dtypesdata1&&& float64data2&&& float64key1&&&&& objectkey2&&&&& objectdtype: object&&& grouped = df.groupby(df.dtypes, axis=1)&&& dict(list(grouped)){dtype('O'):&& key1 key20&&& a& one1&&& a& two2&&& b& one3&&& b& two4&&& a& one, dtype('float64'):&&&&&& data1&&&& data20 -0.410673& 0.5193781 -2.120793& 0.1990742& 0.642216 -0.1436713& 0.975133 -0.5929944 -1.017495 -0.530459}&1121314&&& grouped&pandas.core.groupby.DataFrameGroupByobject at0x&&&& list(grouped)[(dtype('float64'),&&&&&& data1&&&& data20 -0.410673& 0.5193781 -2.120793& 0.1990742& 0.642216 -0.1436713& 0.975133 -0.5929944 -1.017495 -0.530459), (dtype('O'),&& key1 key20&&& a& one1&&& a& two2&&& b& one3&&& b& two4&&& a& one)]&5、选取一个或一组列对于由DataFrame产生的GroupBy对象，如果用一个（单个字符串）或一组（字符串数组）列名对其进行索引，就能实现选取部分列进行聚合的目的，即：123456&&& df.groupby('key1')['data1']&pandas.core.groupby.SeriesGroupByobject at0x06615FD0&&&& df.groupby('key1')['data2']&pandas.core.groupby.SeriesGroupByobject at0x06615CB0&&&& df.groupby('key1')[['data2']]&pandas.core.groupby.DataFrameGroupByobject at0x06615F10&&和以下代码是等效的：123456&&& df['data1'].groupby([df['key1']])&pandas.core.groupby.SeriesGroupByobject at0x06615FD0&&&& df[['data2']].groupby([df['key1']])&pandas.core.groupby.DataFrameGroupByobject at0x06615F10&&&& df['data2'].groupby([df['key1']])&pandas.core.groupby.SeriesGroupByobject at0x06615E30&&尤其对于大数据集，很可能只需要对部分列进行聚合。例如，在前面那个数据集中，如果只需计算data2列的平均值并以DataFrame形式得到结果，代码如下：1121314&&& df.groupby(['key1','key2'])[['data2']].mean()&&&&&&&&&&&&&&data2key1 key2&&&&&&&&&a&&& one& -0.005540&&&&&two&&0.199074b&&& one& -0.143671&&&&&two&-0.592994&&& df.groupby(['key1','key2'])['data2'].mean()key1& key2a&&&& one&&& -0.005540&&&&&&two&&&&0.199074b&&&& one&&& -0.143671&&&&&&two&&&-0.592994Name: data2, dtype: float64&这种索引操作所返回的对象是一个已分组的DataFrame（如果传入的是列表或数组）或已分组的Series（如果传入的是标量形式的单个列明）：&&& s_grouped = df.groupby(['key1','key2'])['data2']&&& s_grouped&pandas.core.groupby.SeriesGroupByobject at0x06615B10&&&& s_grouped.mean()key1& key2a&&&& one&&& -0.005540&&&&&&two&&&&0.199074b&&&& one&&& -0.143671&&&&&&two&&&-0.592994Name: data2, dtype: float64&6、通过字典或Series进行分组除数组以外，分组信息还可以其他形式存在，来看一个DataFrame示例：112&&& people = pd.DataFrame(np.random.randn(5,5),...&&&& columns=['a','b', 'c', 'd','e'],...&&&& index=['Joe','Steve', 'Wes', 'Jim','Travis']... )&&& people&&&&&&&&&&&&&&&a&&&&&&&& b&&&&&&&& c&&&&&&&& d&&&&&&&& eJoe&&&& 0.306336-0.139431& 0.210028 -1.489001-0.172998Steve&& 0.998335&0.494229& 0.337624-1.222726 -0.402655Wes&&&& 1.415329&0.450839 -1.052199&0.731721& 0.317225Jim&&&& 0.550551& 3.201369&0.669713& 0.725751&0.577687Travis -2.013278 -2.010304& 0.117713-0.545000 -1.228323&&& people.ix[2:3, ['b','c']] = np.nan&假设已知列的分组关系，并希望根据分组计算列的总计：123456&&& mapping = {'a':'red','b':'red','c':'blue',...&&&& 'd':'blue','e':'red','f':'orange'}&&& mapping{'a':'red', 'c': 'blue','b': 'red', 'e':'red', 'd': 'blue','f': 'orange'}&&& type(mapping)&type'dict'&&现在，只需将这个字典传给groupby即可：&&& by_column = people.groupby(mapping, axis=1)&&& by_column&pandas.core.groupby.DataFrameGroupByobject at0x&&&& by_column.sum()&&&&&&&&&&&&blue&&&&&& redJoe&&& -1.278973-0.006092Steve& -0.885102&1.089908Wes&&&& 0.731721&1.732554Jim&&&& 1.395465&4.329606Travis -0.427287-5.251905&Series也有同样的功能，它可以被看做一个固定大小的映射。对于上面那个例子，如果用Series作为分组键，则pandas会检查Series以确保其索引跟分组轴是对齐的：&&& map_series = pd.Series(mapping)&&& map_seriesa&&&&&& redb&&&&&& redc&&&&& blued&&&&& bluee&&&&&& redf&&& orangedtype: object&&& people.groupby(map_series, axis=1).count()&&&&&&&&blue& redJoe&&&&&&& 2&&& 3Steve&&&&& 2&&& 3Wes&&&&&&& 1&&& 2Jim&&&&&&& 2&&& 3Travis&&&& 2&&& 3&7、通过函数进行分组相较于字典或Series，Python函数在定义分组映射关系时可以更有创意且更为抽象。任何被当做分组键的函数都会在各个索引值上被调用一次，其返回值就会被用作分组名称。具体点说，以DataFrame为例，其索引值为人的名字。假设你希望根据人名的长度进行分组，虽然可以求取一个字符串长度数组，但其实仅仅传入len函数即可：12345&& people.groupby(len).sum()&&&&&&&&&&a&&&&&&&& b&&&&&&&& c&&&&&&&& d&&&&&&&& e3& 2.272216& 3.061938& 0.879741 -0.031529&0.7219145& 0.998335& 0.494229& 0.337624 -1.222726-0.4026556 -2.013278 -2.010304&0.117713 -0.545000-1.228323&将函数跟数组、列表、字典、Series混合使用也不是问题，因为任何东西最终都会被转换为数组：1234567&&& key_list = ['one','one', 'one', 'two','two']&&& people.groupby([len, key_list]).min()&&&&&&&&&&&&&&a&&&&&&&& b&&&&&&&& c&&&&&&&& d&&&&&&&& e3 one& 0.306336 -0.139431& 0.210028-1.489001 -0.172998&&two&0.550551& 3.201369&0.669713& 0.725751&0.5776875 one& 0.998335& 0.494229& 0.337624 -1.222726 -0.4026556 two -2.013278-2.010304& 0.117713 -0.545000-1.228323&8、根据索引级别分组层次化索引数据集最方便的地方在于它能够根据索引级别进行聚合。要实现该目的，通过level关键字传入级别编号或名称即可：71819&&& columns = pd.MultiIndex.from_arrays([['US','US', 'US', 'JP','JP'],...&&&& [1,3, 5,1, 3]], names=['cty','tenor'])&&& columnsMultiIndex[US& 1,&&&&3,&&&& 5, JP& 1,&&&&3]&&& hier_df = pd.DataFrame(np.random.randn(4,5), columns=columns)&&& hier_dfcty&&&&&&&&& US&&&&&&&&&&&&&&&&&&&&&&&&&&& JP&&&&&&&&&tenor&&&&&&&& 1&&&&&&&& 3&&&&&&&&5&&&&&&&& 1&&&&&&&&30&&&& -0.166600& 0.248159-0.082408 -0.710841 -0.0971311&&&& -1.762270& 0.687458&1.235950 -1.407513&1.3040552&&&&& 1.089944& 0.258175 -0.749688 -0.851948&1.6877683&&&& -0.378311 -0.078268&0.247147 -0.018829&0.744540&&& hier_df.groupby(level='cty', axis=1).count()cty& JP& US0&&&& 2&& 31&&&& 2&& 32&&&& 2&& 33&&&& 2&& 3
最新教程周点击榜
微信扫一扫}

我爱游戏网