当前位置：网站首页 > 技术文章 > 正文

一起学习Python常用模块——pandas

itomcoil 2025-01-04 20:22 29 浏览

作者介绍

@王多鱼

百度的一名推荐算法攻城狮。

主要负责推荐的召回和排序模型的优化工作。

1 前言

Pandas 是Python的一个数据分析包，它是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。Pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

2 数据结构

数据结构：

系列(Series)

数据框(DataFrame)

面板(Panel)

（多个series → 多个数据框 → 面板）。这些数据结构构建在Numpy数组之上，这意味着它们很快。

导入包

1 >>> import pandas as pd
2 >>> import numpy as np

系列

 1   # 以列表定义
 2   >>> s = pd.Series(['a', 'b', 'c', 'd'])
 3   >>> s
 4   0    a
 5   1    b
 6   2    c
 7   3    d
 8   dtype: object 
 9
10   # 以字典定义
11   >>> s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
12   >>> s
13   a    0.0
14   b    1.0
15   c    2.0
16   dtype: float64

数据框

数据框的数据存储格式如下：

 1   # 以列表定义
 2   >>> data = [['Alex',10], ['Bob',12], ['Clarke',13]]
 3   >>> df = pd.DataFrame(data, columns=['Name', 'Age'])
 4   >>> df
 5             Name  Age
 6   0    Alex   10
 7   1     Bob   12
 8   2  Clarke   13
 9
10   # 以字典定义
11   >>> data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age':[28,34,29,42]}
12   >>> df = pd.DataFrame(data)
13   >>> df  
14            Age   Name
15   0   28    Tom
16   1   34   Jack
17   2   29  Steve
18   3   42  Ricky

数据索引:

●底层是由索引和值构成的多元组，(index1, [index2,index3,] value)。再由这些多元组组合出可视化的数据框。

●缺失值：数据框某个位置的所以没有对应的多元组，则会显示缺省值。

定义带索引的系列

1   >>> data = {'a' : 0., 'b' : 1., 'c' : 2.}
2   >>> s = pd.Series(data, index=['b','c','d','a'])
3   >>> s
4   b    1.0
5   c    2.0
6   d    NaN
7   a    0.0
8   dtype: float64

3 数据输入/输出

方法一、直接定义

●pd.Series

●pd.DataFrame

方法二、读取器函数

读入：

●read_csv/read_table

●read_sql

●read_html

●read_json

读出：

●to_csv

4 基本功能

数据结构的属性

  1   >>> df   
  2   Age   Name
  3   0   28    Tom
  4   1   34   Jack
  5   2   29  Steve
  6   3   42  Ricky
  7 
  8   >>> df.axes
  9   [RangeIndex(start=0, stop=4, step=1), Index([u'Age', u'Name'], dtype='object')] 
10
11   >>> df.dtypes
12   Age      int64
13   Name    object
14   dtype: object 
15  
16   >>> df.size
17   8 
18 
19   >>> df.values
20   array([[28, 'Tom'],
21             [34, 'Jack'],  
22             [29, 'Steve'], 
23             [42, 'Ricky']], dtype=object)

简单统计

  1   >>> df.describe(include='all')
  2                          Age Name
  3   count       4.000000    4
  4   unique         NaN       4
  5   top               NaN  Tom
  6   freq              NaN       1
  7   mean    33.250000  NaN
  8   std          6.396614  NaN
  9   min       28.000000  NaN
10   25%       28.750000  NaN
11   50%       31.500000  NaN
12   75%       36.000000  NaN
13   max       42.000000  NaN

5 选择数据

定位函数（多轴索引）

●loc()：基于标签索引

●iloc()：基于整数索引

定位函数格式

df.loc[ 行索引, 列索引]

行索引选择数据图示：

列索引选择图示：

●指定索引

  1   >>> df = pd.DataFrame(np.random.randn(8, 4), 
  2                           index = ['a','b','c','d','e','f','g','h'],
  3                         
  4                          columns = ['A', 'B', 'C', 'D'])
  5   >>> df  
  6                   A               B                 C                D
  7   a -0.484976   1.958562   -0.073555   0.524286
  8   b  1.681393   1.041901    -0.109796  0.836486
  9   c  0.352229    0.656365    0.590963   0.908981
10   d   1.325258  1.199558    0.953455  -0.192507
11   e  0.573300  -0.202530   -0.699603   1.504382 
12   f  -1.423372 -0.311816     0.680950 -1.619343
13   g  0.771233 -0.101350     -0.207373  1.242127
14   h  0.084874 -0.655007    -0.834754   0.072229  
15
16
17   >>> df.loc['a', ['A', 'B']]
18   A   -0.484976
19   B    1.958562

●区间索引

  1   >>>>> df.loc[:, 'A']
  2   a   -0.484976
  3   b    1.681393
  4   c    0.352229
  5   d    1.325258
  6   e    0.573300
  7   f   -1.423372
  8   g    0.771233
  9   h    0.084874
10   Name: A, dtype: float64
11
12
13   >>> df.loc['a':'e','A':'C']
14                     A             B               C
15   a -0.484976  1.958562 -0.073555
16   b  1.681393  1.041901 -0.109796
17   c  0.352229  0.656365  0.590963
18   d  1.325258  1.199558  0.953455
19   e  0.573300 -0.202530 -0.699603

●布尔值索引

  1   >>> df.loc[df.A>0,]
  2                    A              B              C              D
  3   b  1.681393  1.041901 -0.109796  0.836486
  4   c  0.352229  0.656365  0.590963  0.908981
  5   d  1.325258  1.199558  0.953455 -0.192507
  6   e  0.573300 -0.202530 -0.699603  1.504382
  7   g  0.771233 -0.101350 -0.207373  1.242127
  8   h  0.084874 -0.655007 -0.834754  0.072229
  9
10   >>> df.loc[df.A.isna(), ]
11   Empty DataFrame
12   Columns: [A, B, C, D]
13   Index: []

6 操作数据

排序

●sort_index()：按索引排序

●sort_values()：按值排序

 1   >>> unsorted_df =pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
 2   >>> unsorted_df
 3     col1  col2
 4   0     2     1
 5   1     1     3
 6    2     1     2
 7    3     1     4
 8
 9
10   # 按某列排序
11   >>> unsorted_df.sort_values('col1')
12     col1  col2
13   1     1     3
14   2     1     2
15   3     1     4
16   0     2     1
17 
18
19   # 按多列排序
20   >>> unsorted_df.sort_values(['col1','col2'])
21         col1  col2
22       2     1     2
23       1     1     3
24       3     1     4
25       0     2     1

聚合

●分组聚合：groupby + agg

groupby函数的图示，用于聚合相同key的数据。

>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> df
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017
 
 
# 创建数据分组
>>> df.groupby(['Team','Year'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x112f28c10>
 
 
# 查看分组
>>> df.groupby('Team').groups


{('Kings', 2014): Int64Index([4], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64')}
 
 
 
# 查看其中一个分组
>>> df.groupby(['Team','Year']).get_group(('Kings',2014))


   Points  Rank   Team  Year
4     741     3  Kings  2014
 
# 最新年份的数据(分组排序)
>>> df.sort_values(['Team','Year'],ascending=False).groupby('Team').nth(0)


        Points  Rank  Year
Team                     
Devils     673     3  2015
Kings      788     1  2017
Riders     690     2  2017
Royals     804     1  2015
kings      812     4  2015
 
# 聚合函数
>>> df.groupby(['Year'])['Points'].agg('mean')


Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64
 
 
>>> df.groupby(['Year'])['Points'].agg(['mean','sum','median'])


        mean   sum  median
Year                     
2014  795.25  3181   802.0
2015  769.50  3078   796.5
2016  725.00  1450   725.0
2017  739.00  1478   739.0
 
 
# 过滤筛选
>>> df.groupby('Team').filter(lambda x: len(x) >= 3)
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
4      741     3   Kings  2014
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
11     690     2  Riders  2017
 
 
>>> df.groupby('Team').filter(lambda x:max(x['Points'])>=800)
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
5      812     4   kings  2015
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017

●窗口聚合：rolling + agg

做定量模型比较常用。

应用函数

●pipe()：表格应用函数，应用于整个表格，方便链式编程

>>> def adder(x,y):
        return x+y
 
>>> df = pd.DataFrame(np.random.randn(5,3), columns=['col1','col2','col3'])
>>> df
       col1      col2      col3
0  1.200842 -0.387094  0.218903
1 -2.469144  2.283831  0.342451
2  0.688127  0.445456  0.966626
3  0.912838  0.577441 -0.967456
4 -0.706913  0.791318 -1.040644
 
 
>>> df.pipe(adder, 2)
       col1      col2      col3
0  3.200842  1.612906  2.218903
1 -0.469144  4.283831  2.342451
2  2.688127  2.445456  2.966626
3  2.912838  2.577441  1.032544
4  1.293087  2.791318  0.959356

●apply()：行列应用函数

>>> df.apply(np.mean)
col1   -0.074850
col2    0.742191
col3   -0.096024
dtype: float64
 
 
>>> df.apply(np.mean,axis=1)
0    0.344217
1    0.052380
2    0.700070
3    0.174274
4   -0.318746
dtype: float64

●applymap()：元素映射函数，类似于map()

>>> aes_encrypt = crypto_util.AesEncrypt()


>>> def decrypt(line):
        decrypt_str = aes_encrypt.decrypt(line,
            crypto_util.constants.Constants.CRM_ENCRYPT_PREFIX)
        return decrypt_str


>>> df = pd.DataFrame(

             ['baiducrmcommonciper_LUjEqeTBXHcHFak5E3lwcgOR+Xfl6v/hkbSrzqBBFI4=',

             'baiducrmcommonciper_4TReevfj06k3mg8871PvslHvPuPwlCUkn4xM6ZjrAn4=',

             'baiducrmcommonciper_zmrudGYBOalk5LTqlF5ncg=='])


>>> df.applymap(decrypt)
                   0
0  25339384668@qq.com
1   1909062174@qq.com
2    8076719440@qq.om

7 操作数据框

连结

●append

●concat

concat 函数功能如下图所示，（1）不指定axis时，默认axis=0，上下拼接；（2）指定axis=1时，左右拼接。

>>> one = pd.DataFrame({
         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
         'subject_id':['sub1','sub2','sub4','sub6','sub5'],
         'Marks_scored':[98,90,87,69,78]},
         index=[1,2,3,4,5])


>>> two = pd.DataFrame({
         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
         'subject_id':['sub2','sub4','sub3','sub6','sub5'],
         'Marks_scored':[89,80,79,97,88]},
         index=[1,2,3,4,5])


>>> pd.concat([one, two])
   Marks_scored    Name subject_id
1            98    Alex       sub1
2            90     Amy       sub2
3            87   Allen       sub4
4            69   Alice       sub6
5            78  Ayoung       sub5
1            89   Billy       sub2
2            80   Brian       sub4
3            79    Bran       sub3
4            97   Bryce       sub6
5            88   Betty       sub5

>>> pd.concat([one, two], axis = 1)
   Marks_scored    Name subject_id  Marks_scored   Name subject_id
1            98    Alex       sub1            89  Billy       sub2
2            90     Amy       sub2            80  Brian       sub4
3            87   Allen       sub4            79   Bran       sub3
4            69   Alice       sub6            97  Bryce       sub6
5            78  Ayoung       sub5            88  Betty       sub5

Merge

pd.merge(left,right,how='inner',on=None, left_on=None, right_on=None,left_index=False,right_index=False,sort=True)

merge函数图示：

8 画图

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
plt.rc('figure', figsize=(5, 3))
ts = pd.Series(np.random.randn(1000), 
               index=pd.date_range('1/1/2000', periods=1000))
 
ts = ts.cumsum()
ts.plot()

df = pd.DataFrame(np.random.randn(1000, 4), 
                  index = ts.index, 
                  columns=list('ABCD'))   
 
df = df.cumsum() 
 
plt.figure(); df.plot(); plt.legend(loc='best')

大家尽情的用Pandas玩耍数据吧 ~

pandas append

上一篇：Pandas宝藏函数-concat()
下一篇：Pandas Dataframe 新增加一行

一起学习Python常用模块——pandas

相关推荐

我用 1 个 2 手计算器换了 3 台 MacBook(上)

零基础也能搞定!DeepSeek大模型本地安装全攻略

Win7中同时安装python2和python3的方法

Python三目运算符(三元运算符)用法详解

PS零基础入门教程:Photoshop 2024工具详解—标尺工具

按颜色计数、求和、算平均值或最大值?学这个函数就够啦!

SpringBoot中使用LocalDateTime踩坑记录

中药古今研究:人参

「mysql第二次安装不了」mysql安装失败怎么清理干净?

最全的linux安装软件方法 linux安装软件流程