pandas 用户数据分析2

手机

iphone11大小尺寸是多少？苹果iPhone11和iPhone13的区别是什么？

警方通报辅警执法直播中被撞飞：犯罪嫌疑人已投案

家电

pandas 用户数据分析2

2023-02-07 16:09:44 来源：博客园

(资料图片)

user_analysis

第一部分:数据类型处理¶

数据加载¶

字段含义:    user_id:用户ID    order_dt:购买日期    order_product:购买产品的数量    order_amount:购买金额

观察数据¶

查看数据的数据类型数据中是否存储在缺失值将order_dt转换成时间类型查看数据的统计描述    计算所有用户购买商品的平均数量    计算所有用户购买商品的平均花费    在源数据中添加一列表示月份:astype(datetime64[M])

In[]:

# 加载数据，定义字段含义import pandas as pdimport numpy as npfrom matplotlib import pyplot as pltpd.set_option("display.float_format", lambda x: "%.3f" % x)df = pd.read_csv("./CDNOW_master.txt", header=None,                 sep="\s+", names=["user_id", "order_dt", "order_product", "order_amount"])df.head()

Out[]:

user_id	order_dt	order_product	order_amount
0	1	19970101	1	11.770
1	2	19970112	1	12.000
2	2	19970112	5	77.000
3	3	19970102	2	20.760
4	3	19970330	2	20.760

In[]:

# 将order_dt转换成时间类型，格式化时间df["order_dt"] = pd.to_datetime(df["order_dt"], format="%Y%m%d")

In[]:

# 添加month列df["month"] = df["order_dt"].values.astype("datetime64[M]")df.head(20)

Out[]:

user_id	order_dt	order_product	order_amount	month
0	1	1997-01-01	1	11.770	1997-01-01
1	2	1997-01-12	1	12.000	1997-01-01
2	2	1997-01-12	5	77.000	1997-01-01
3	3	1997-01-02	2	20.760	1997-01-01
4	3	1997-03-30	2	20.760	1997-03-01
5	3	1997-04-02	2	19.540	1997-04-01
6	3	1997-11-15	5	57.450	1997-11-01
7	3	1997-11-25	4	20.960	1997-11-01
8	3	1998-05-28	1	16.990	1998-05-01
9	4	1997-01-01	2	29.330	1997-01-01
10	4	1997-01-18	2	29.730	1997-01-01
11	4	1997-08-02	1	14.960	1997-08-01
12	4	1997-12-12	2	26.480	1997-12-01
13	5	1997-01-01	2	29.330	1997-01-01
14	5	1997-01-14	1	13.970	1997-01-01
15	5	1997-02-04	3	38.900	1997-02-01
16	5	1997-04-11	3	45.550	1997-04-01
17	5	1997-05-31	3	38.710	1997-05-01
18	5	1997-06-16	2	26.140	1997-06-01
19	5	1997-07-22	2	28.140	1997-07-01

In[]:

# 计算所有用户购买商品的平均数量 2.410040# 计算所有用户购买商品的平均花费 35.893648df.describe()[["order_product", "order_amount"]]

Out[]:

order_product	order_amount
count	69659.000	69659.000
mean	2.410	35.894
std	2.334	36.282
min	1.000	0.000
25%	1.000	14.490
50%	2.000	25.980
75%	3.000	43.700
max	99.000	1286.010

第二部分:按月数据分析¶

用户每月花费的总金额¶

绘制曲线图展示

所有用户每月的产品购买量¶

所有用户每月的消费总次数¶

统计每月的消费人数¶

In[]:

# 用户每月花费的总金额，并绘制折线图df.groupby(by="month")["order_amount"].sum().plot()

Out[]:

In[]:

# 所有用户每月的产品购买量df.groupby(by="month")["order_product"].sum().plot()

Out[]:

In[]:

# 所有用户每月的消费总次数df.groupby(by="month")["user_id"].count()

Out[]:

month1997-01-01     89281997-02-01    112721997-03-01    115981997-04-01     37811997-05-01     28951997-06-01     30541997-07-01     29421997-08-01     23201997-09-01     22961997-10-01     25621997-11-01     27501997-12-01     25041998-01-01     20321998-02-01     20261998-03-01     27931998-04-01     18781998-05-01     19851998-06-01     2043Name: user_id, dtype: int64

In[]:

# 统计每月的消费人数df.groupby(by="month")["user_id"].nunique()

Out[]:

month1997-01-01    78461997-02-01    96331997-03-01    95241997-04-01    28221997-05-01    22141997-06-01    23391997-07-01    21801997-08-01    17721997-09-01    17391997-10-01    18391997-11-01    20281997-12-01    18641998-01-01    15371998-02-01    15511998-03-01    20601998-04-01    14371998-05-01    14881998-06-01    1506Name: user_id, dtype: int64

第三部分: 用户个体消费数据分析¶

用户消费总金额和消费总次数的统计描述¶

用户消费金额和消费次数的散点图¶

各个用户消费总金额的直方分布图(消费金额在1000之内的分布)¶

各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)¶

In[]:

# 用户消费总金额df.groupby(by="user_id")["order_amount"].sum()

Out[]:

user_id1        11.7702        89.0003       156.4604       100.5005       385.610          ...  23566    36.00023567    20.97023568   121.70023569    25.74023570    94.080Name: order_amount, Length: 23570, dtype: float64

In[]:

# 用户消费总次数df.groupby(by="user_id")["order_amount"].count()

Out[]:

user_id1         12         23         64         45        11         ..23566     123567     123568     323569     123570     2Name: order_amount, Length: 23570, dtype: int64

In[]:

# 用户消费金额和消费次数的散点图# 用户消费金额money = df.groupby(by="user_id")["order_amount"].sum()# 用户消费次数times = df.groupby(by="user_id")["order_product"].count()# 绘图plt.scatter(times, money)

Out[]:

In[]:

# 各个用户消费总金额的直方分布图(消费金额在1000之内的分布)df.groupby(by="user_id").sum().query("order_amount < 1000")["order_amount"].hist()

C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\701786761.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.  df.groupby(by="user_id").sum().query("order_amount < 1000")["order_amount"].hist()

Out[]:

In[]:In[]:

# 各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()

C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\2679188117.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.  df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()

Out[]:

第四部分: 用户消费行为分析¶

用户第一次消费的月份分布，和人数统计¶

绘制线形图

用户最后一次消费的时间分布，和人数统计¶

绘制线形图

新老客户的占比¶

消费一次为新用户消费多次为老用户    分析出每一个用户的第一个消费和最后一次消费的时间    agg(["func1func2]):对分组后的结果进行指定聚合    分析出新老客户的消费比例

用户分层¶

分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfmRFM模型设计    R表示客户最近一次交易时间的间隔        /np.timedelta64(1，"D"):去除days。    F表示客户购买商品的总数量,F值越大，表示客户交易越频繁，反之则表示客户交易不够活跃。    M表示客户交易的金额。M值越大，表示客户价值越高，反之则表示客户价值越低。    将R，F，M作用到rfm表中根据价值分层，将用户分为:    "重要价值客户"    "重要保持客户"    "重要挽留客户"    "重要发展客户"    "一般价值客户"    "一般保持客户"    "一般挽留客户"    "一般发展客户"        使用已有的分层模型rfm_func

In[]:

# 用户第一次消费的月份统计，和人数统计，绘制折线图first_con = df.groupby(by="user_id")["month"].min().value_counts().plot()

In[]:

# 用户最后一次消费的月份统计和人数统计，绘制折线图df.groupby(by="user_id")["month"].max().value_counts().plot()

Out[]:

In[]:

# # 新老用户占比# 消费一次新用户,消费多次老用户# 如何获知用户是否为第一次消费? 可以根据用户的消费时间进行判定?# 如果用户的第一次消费时间和最后一次消费时间一样，则该用户只消费了一次为新用户，否则为老用户new_old_con_df = df.groupby(by="user_id")["order_dt"].agg(["min","max"])new_old = new_old_con_df["min"] == new_old_con_df["max"].valuesnew = new_old.value_counts()[True]old = new_old.value_counts()[False]new_proportion = new / (new + old)old_proportion = old / (new + old)"老用户占比：{:.2f}%".format(old_proportion*100),"新用户占比：{:.2f}%".format(new_proportion*100)

Out[]:

("老用户占比：48.86%", "新用户占比：51.14%")

In[]:

# 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfm 用透视表rfm = df.pivot_table(index="user_id", aggfunc={"order_product":"sum", "order_amount": "sum", "order_dt":"max"})

In[]:

# R表示用户最近一次交易时间的间隔# R = df中最大的日期 - 每个用户最后一次交易的日期# 去除days用 /np.timedelta64(1，"D")today = df["order_dt"].max()rfm["R"] = (today - df.groupby(by="user_id")["order_dt"].max()) / np.timedelta64(1,"D")

In[]:

# 删除order_dt字段rfm.drop("order_dt", axis=1, inplace=True)

In[]:

# 重命名字段名为MRFrfm.columns = ["M", "F", "R"]rfm

Out[]:

M	F	R
user_id
1	11.770	1	545.000
2	89.000	6	534.000
3	156.460	16	33.000
4	100.500	7	200.000
5	385.610	29	178.000
...	...	...	...
23566	36.000	2	462.000
23567	20.970	1	462.000
23568	121.700	6	434.000
23569	25.740	2	462.000
23570	94.080	5	461.000

23570 rows × 3 columns

In[]:

# RFM模型def rfm_func(x):    level = x.map(lambda x: "1" if x >= 0 else "0")    label = level.R + level.F + level.M    d = {        "111": "重要价值客户",        "011": "重要保持客户",        "101": "重要挽留客户",        "001": "重要发展客户",        "110": "一般价值客户",        "010": "一般保持客户",        "100": "一般挽留客户",        "000": "一般发展客户"    }    result = d[label]    return result

In[]:

# 将rfm_func计算的结果返回给新建label列 (lambda x: x - x.mean()).rfm_funcrfm["label"] = rfm.apply(lambda x: x - x.mean()).apply(rfm_func, axis=1)rfm.head()

Out[]:

M	F	R	label
user_id
1	11.770	1	545.000	一般挽留客户
2	89.000	6	534.000	一般挽留客户
3	156.460	16	33.000	重要保持客户
4	100.500	7	200.000	一般发展客户
5	385.610	29	178.000	重要保持客户

第五部分: 用户的生命周期¶

将用户划分为活跃用户和其他用户¶

统计每个用户每个月的消费次数统计每个用户每个月是否消费，消费记录为1否则记录为0    知识点: DataFrame的apply和applymap的区别        applymap:返回df            将函数做用于DataFrame中的所有元素(elements)        apply:返回Series            apply()将一个函数作用于DataFrame中的每个行或者列

将用户按照每一个月份分成:¶

unreg:观望用户(前两月没买，第三个月才第一次买,则用户前两个月为观望用户)。unactive:首月购买后，后序月份没有购买则在没有购买的月份中该用户的为非活用户。 new:当前月就进行首次购买的用户在当前月为新用户active:连续月份购买的用户在这些月中为活跃用户return:购买之后间隔n月再次购买的第一个月份为该月份的回头客

In[]:

# 统计每个用户每个月的消费次数 用透视 var:user_month_count_dfuser_month_count_df = df.pivot_table(index="user_id",values="order_dt",aggfunc="count", columns="month").fillna(value=0)user_month_count_df

Out[]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
2	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
3	1.000	0.000	1.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	2.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
4	2.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000
5	2.000	1.000	0.000	1.000	1.000	1.000	1.000	0.000	1.000	0.000	0.000	2.000	1.000	0.000	0.000	0.000	0.000	0.000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23566	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23567	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23568	0.000	0.000	1.000	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23569	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23570	0.000	0.000	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

23570 rows × 18 columns

In[]:

# 统计每个用户每个月是否消费，消费记录为1否则记录为0  var:df_purchasedf_purchase = user_month_count_df.applymap(lambda x : 1 if x >=1 else 0 )df_purchase

Out[]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0
4	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0
5	1	1	0	1	1	1	1	0	1	0	0	1	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23566	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23567	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23568	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23569	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23570	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

23570 rows × 18 columns

In[]:

# 用户生命周期模型，固定算法def active_status(data):    status = []    for i in range(18):        # 若本月没有消费        if data[i] == 0:            if len(status) > 0:                if status[i-1] == "unreg":                    status.append("unreg")                else:                    status.append("unactive")            else:                status.append("unreg")        # 若本月消费        else:            if len(status) == 0:                status.append("new")            else:                if status[i-1] == "unactive":                    status.append("return")                elif status[i-1] == "ureg":                    status.append("new")                else:                    status.append("active")    return status

In[]:

# 将df_purchase中的原始数据0和1修改为new,unactive...返回新var:df_purchase_newdf_purchase_new = df_purchase.apply(active_status,axis=1)df_purchase_new

Out[]:

user_id1        [new, unactive, unactive, unactive, unactive, ...2        [new, unactive, unactive, unactive, unactive, ...3        [new, unactive, return, active, unactive, unac...4        [new, unactive, unactive, unactive, unactive, ...5        [new, active, unactive, return, active, active...                               ...                        23566    [unreg, unreg, active, unactive, unactive, una...23567    [unreg, unreg, active, unactive, unactive, una...23568    [unreg, unreg, active, active, unactive, unact...23569    [unreg, unreg, active, unactive, unactive, una...23570    [unreg, unreg, active, unactive, unactive, una...Length: 23570, dtype: object

In[]:

# 将pivoted_status的values转成list，再将list转成DataFrame# 将df_purchase的index作为df_pruchase的index，columns相同# var:df_puechase_newdf_purchase_new1 = pd.DataFrame(data=df_purchase_new.to_list(),index=df_purchase.index, columns=df_purchase.columns)df_purchase_new1.head()

Out[]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
2	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
3	new	unactive	return	active	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	return	unactive
4	new	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	unactive
5	new	active	unactive	return	active	active	active	unactive	return	unactive	unactive	return	active	unactive	unactive	unactive	unactive	unactive

In[]:

# 将每月不同活跃用户进行计数 var:purchase_status_ctpurchase_status_ct = df_purchase_new1.apply(lambda x : pd.value_counts(x),axis=0).fillna(0)purchase_status_ct.head()

Out[]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
active	0.000	9633.000	8929.000	1773.000	852.000	747.000	746.000	604.000	528.000	532.000	624.000	632.000	512.000	472.000	571.000	518.000	459.000	446.000
new	7846.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
return	0.000	0.000	595.000	1049.000	1362.000	1592.000	1434.000	1168.000	1211.000	1307.000	1404.000	1232.000	1025.000	1079.000	1489.000	919.000	1029.000	1060.000
unactive	0.000	6689.000	14046.000	20748.000	21356.000	21231.000	21390.000	21798.000	21831.000	21731.000	21542.000	21706.000	22033.000	22019.000	21510.000	22133.000	22082.000	22064.000
unreg	15724.000	7248.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

In[]:

# 转置t_purchase_status_ct = purchase_status_ct.Tt_purchase_status_ct

Out[]:

active	new	return	unactive	unreg
month
1997-01-01	0.000	7846.000	0.000	0.000	15724.000
1997-02-01	9633.000	0.000	0.000	6689.000	7248.000
1997-03-01	8929.000	0.000	595.000	14046.000	0.000
1997-04-01	1773.000	0.000	1049.000	20748.000	0.000
1997-05-01	852.000	0.000	1362.000	21356.000	0.000
1997-06-01	747.000	0.000	1592.000	21231.000	0.000
1997-07-01	746.000	0.000	1434.000	21390.000	0.000
1997-08-01	604.000	0.000	1168.000	21798.000	0.000
1997-09-01	528.000	0.000	1211.000	21831.000	0.000
1997-10-01	532.000	0.000	1307.000	21731.000	0.000
1997-11-01	624.000	0.000	1404.000	21542.000	0.000
1997-12-01	632.000	0.000	1232.000	21706.000	0.000
1998-01-01	512.000	0.000	1025.000	22033.000	0.000
1998-02-01	472.000	0.000	1079.000	22019.000	0.000
1998-03-01	571.000	0.000	1489.000	21510.000	0.000
1998-04-01	518.000	0.000	919.000	22133.000	0.000
1998-05-01	459.000	0.000	1029.000	22082.000	0.000
1998-06-01	446.000	0.000	1060.000	22064.000	0.000

关键词：所有用户最后一次购买商品