pandas apply - 书闪Bookchips

介绍apply方法

简单介绍：可以替代python的for循环
详细介绍：Pandas 中的 apply 函数是一个非常强大的工具，它允许你对 DataFrame 的每一行或每一列应用一个函数，并返回一个 Series 或 DataFrame 作为结果。apply 可以用于复杂的数据转换和分析任务，特别是当你需要对数据集中的每个元素执行自定义操作时。

DataFrame.apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs)

返回数据	原数据	apply方法	func/lambda	axis	raw	result_type	func 的位置参数	func 的其他关键字参数
			自定义方法名/lambda	作用在行/列	每行、整个传入函数	result_type	args	**kwargs
`df =`	`df`	`.apply(`	`myfun,`	`axis=0,`每一列	`raw=False,`	`result_type=None,`	`args=(),`	`**kwargs`	`)`
	`df['A列']`			`axis=1,`每一行	False：把每一行或列作为 Series 传入函数中	None			`)`
					`raw=True,`	`result_type=expand,`			`)`
					True：接受的是 ndarray 数据类型；	expand：列表式[1,2]的结果将被转化为列(2列)			`)`
						`result_type=reduce,`			`)`
						reduce：如果可能的话，返回一个 Series，而不是展开类似列表的结果。这与 expand 相反。			`)`
						`result_type=broadcast,`			`)`
						broadcast：结果将被广播到 DataFrame 的原始形状，原始索引和列将被保留。			`)`

DataFrame使用apply，开根号

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

	A	B
0	4	9
1	4	9
2	4	9

创建一个df

# 使用numpy通用函数 (如 np.sqrt(df)),
df.apply(np.sqrt)

	A	B
0	2.0	3.0
1	2.0	3.0
2	2.0	3.0

开根号

DataFrame使用apply，使用聚合功能

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df = df.apply(np.sum, axis=0)
print(df)
'''
A    12
B    27
dtype: int64
'''

agg聚合功能

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df = df.apply(np.sum, axis=1)
print(df)
'''
0    13
1    13
2    13
dtype: int64
'''

agg聚合功能

在每行上返回类似列表的内容

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df = df.apply(lambda x: [1, 2], axis=1)
print(df)
'''
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object
'''

在每行上返回类似列表的内容

# result_type='expand' 将类似列表的结果扩展到数据的列
import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df = df.apply(lambda x: [1, 2], axis=1, result_type='expand')
print(df)

	0	1
0	1	2
1	1	2
2	1	2

将类似列表的结果扩展到数据的列

DataFrame使用apply，在函数中返回一个序列，生成的列名将是序列索引

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
print(df)
df = df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
print(df)

	foo	bar
0	1	2
1	1	2
2	1	2

apply生成的列名将是序列索引

# result_type='broadcast' 将确保函数返回相同的形状结果
# 无论是 list-like 还是 scalar，并沿轴进行广播
# 生成的列名将是原始列名。
import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
print(df)
df = df.apply(lambda x: [1, 2], axis=1, result_type='broadcast')
print(df)

	A	B
0	1	2
1	1	2
2	1	2

apply broadcast

DataFrame使用apply，列求和、行求和

import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]},
                  index=['a', 'b', 'c'])
print(df)
# 对各列应用函数 axis=0
df = df.apply(lambda x: np.sum(x))
print(df)

	A	B	C
a	1	4	7
b	2	5	8
c	3	6	9

apply列求和

# 对各行应用函数
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]},
                  index=['a', 'b', 'c'])
print(df)
df = df.apply(lambda x: np.sum(x), axis=1)
print(df)
a    12
b    15
c    18
dtype: int64

apply行求和

DataFrame使用apply，args

# 定义一个需要附加位置参数的自定义函数
# 并使用args关键字传递这些附加参数。
def fun(df):  # df可以直接获得
    print(df)  # axis=1 这里是1行数据  # 没有axis=1 这里是1列数据
    return ""  # 新的值一般是字符串一个值
df = df.apply(fun, args=(5,),axis=1)
df['列名'] = df.apply(fun, args=(5,),axis=1)
# 5, 这个逗号很关键，传成元祖,否则字符串会被分割
# 直接df.apply改原数据
# df['列名'] = df.apply(x) # 这样的话会改新的一行
# 所以方法名一定要返回一个确切的数据 return "data"
# axis=1 是一行一行取出来，不加的话就是一列一列取出来

axis=1 这里是1行数据

Series使用apply，求平方

官网案例

import numpy as np
import pandas as pd
s = pd.Series([20, 21, 12],index=['London', 'New York', 'Helsinki'])
print(s)
# 定义函数并将其作为参数传递给 apply，求值平方。
def square(x):
     return x ** 2
s = s.apply(square)
print(s)
'''
London      400
New York    441
Helsinki    144
dtype: int64
'''

apply求值平方


# 通过将匿名函数作为参数传递给 apply
s = s.apply(lambda x: x ** 2)
'''
London      400
New York    441
Helsinki    144
dtype: int64
'''

apply求值平方

Series使用apply，args

# 定义一个需要附加位置参数的自定义函数
# 并使用args关键字传递这些附加参数。
import numpy as np
import pandas as pd
se = pd.Series([20, 21, 12],index=['London', 'New York', 'Helsinki'])
print(se)
def fun(x, custom_value):
    return x - custom_value
se = se.apply(fun, args=(5,))
print(se)
# 5, 这个逗号很关键，传成元祖,否则字符串会被分割
# 直接s.apply改原数据
# s['列名'] = s.apply(x) # 这样的话会改新的一行
# 所以方法名一定要返回一个确切的数据 return "data"
# axis=1 是一行一行取出来，不加的话就是一列一列取出来
'''
London      15
New York    16
Helsinki     7
dtype: int64
'''

Series的apply方法

Series使用apply，kwargs

# 定义一个接受关键字参数并将这些参数传递
# 给 apply 的自定义函数。
import numpy as np
import pandas as pd
se = pd.Series([20, 21, 12],index=['London', 'New York', 'Helsinki'])
print(se)
def add_custom_values(x, **kwargs):
     for month in kwargs:
         x += kwargs[month]
     return x
se = se.apply(add_custom_values, june=30, july=20, august=25)
print(se)
'''
London      95
New York    96
Helsinki    87
dtype: int64
'''

Series使用apply，kwargs

Series使用apply，np.log

# 使用Numpy库中的函数
import numpy as np
import pandas as pd
se = pd.Series([20, 21, 12],index=['London', 'New York', 'Helsinki'])
print(se)
se = se.apply(np.log)
print(se)
'''
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
'''

Series使用apply，np.log

结合tqdm给apply()过程添加进度条

我们知道apply()在运算时实际上仍然是一行一行遍历的方式，因此在计算量很大时如果有一个进度条来监视运行进度就很舒服。

tqdm:用于添加代码进度条的第三方库
tqdm对pandas也是有着很好的支持。

我们可以使用progress_apply()代替apply()，并在运行progress_apply()之前添加tqdm.tqdm.pandas(desc=’’)来启动对apply过程的监视。

其中desc参数传入对进度进行说明的字符串，下面我们在上一小部分示例的基础上进行改造来添加进度条功能：

from tqdm import tqdm
def fun_all(name, gender, age):
    gender = '女性' if gender is 'F' else '男性'
    return '有个名字叫{}的人，性别为{}，年龄为{}。'.format(name, gender, age)

启动对紧跟着的apply过程的监视

from tqdm import tqdm
tqdm.pandas(desc='apply')
df.progress_apply(lambda row:fun_all(row['name'],row['gender'],
                  row['age']), axis = 1)
apply: 100|██████████| 10/10 [00:00<00:00, 5011.71it/s]
0     有个名字叫Jack的人，性别为女性，年龄为25。
1    有个名字叫Alice的人，性别为男性，年龄为34。
2     有个名字叫Lily的人，性别为女性，年龄为49。
3    有个名字叫Mshis的人，性别为女性，年龄为42。
4     有个名字叫Gdli的人，性别为男性，年龄为28。
5    有个名字叫Agosh的人，性别为女性，年龄为23。
6     有个名字叫Filu的人，性别为男性，年龄为45。
7     有个名字叫Mack的人，性别为男性，年龄为21。
8     有个名字叫Lucy的人，性别为女性，年龄为34。
9     有个名字叫Pony的人，性别为女性，年龄为2

27 - 重要函数 - apply(fun) - 传入方法

介绍apply方法

DataFrame使用apply，开根号

DataFrame使用apply， 使用聚合功能

在每行上返回类似列表的内容

DataFrame使用apply， 在函数中返回一个序列，生成的列名将是序列索引

DataFrame使用apply，列求和、行求和

DataFrame使用apply，args

Series使用apply，求平方

Series使用apply，args

Series使用apply，kwargs

Series使用apply，np.log

结合tqdm给apply()过程添加进度条

启动对紧跟着的apply过程的监视

DataFrame使用apply，使用聚合功能

DataFrame使用apply，在函数中返回一个序列，生成的列名将是序列索引