最新要闻

广告

手机

iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?

iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?

警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案

警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案

家电

pandas 用户数据分析2

来源:博客园


(资料图片)

user_analysis

第一部分:数据类型处理¶

数据加载¶

字段含义:    user_id:用户ID    order_dt:购买日期    order_product:购买产品的数量    order_amount:购买金额

观察数据¶

查看数据的数据类型数据中是否存储在缺失值将order_dt转换成时间类型查看数据的统计描述    计算所有用户购买商品的平均数量    计算所有用户购买商品的平均花费    在源数据中添加一列表示月份:astype(datetime64[M])
In[]:
# 加载数据,定义字段含义import pandas as pdimport numpy as npfrom matplotlib import pyplot as pltpd.set_option("display.float_format", lambda x: "%.3f" % x)df = pd.read_csv("./CDNOW_master.txt", header=None,                 sep="\s+", names=["user_id", "order_dt", "order_product", "order_amount"])df.head()
Out[]:
user_idorder_dtorder_productorder_amount
0119970101111.770
1219970112112.000
2219970112577.000
3319970102220.760
4319970330220.760
In[]:
# 将order_dt转换成时间类型,格式化时间df["order_dt"] = pd.to_datetime(df["order_dt"], format="%Y%m%d")
In[]:
# 添加month列df["month"] = df["order_dt"].values.astype("datetime64[M]")df.head(20)
Out[]:
user_idorder_dtorder_productorder_amountmonth
011997-01-01111.7701997-01-01
121997-01-12112.0001997-01-01
221997-01-12577.0001997-01-01
331997-01-02220.7601997-01-01
431997-03-30220.7601997-03-01
531997-04-02219.5401997-04-01
631997-11-15557.4501997-11-01
731997-11-25420.9601997-11-01
831998-05-28116.9901998-05-01
941997-01-01229.3301997-01-01
1041997-01-18229.7301997-01-01
1141997-08-02114.9601997-08-01
1241997-12-12226.4801997-12-01
1351997-01-01229.3301997-01-01
1451997-01-14113.9701997-01-01
1551997-02-04338.9001997-02-01
1651997-04-11345.5501997-04-01
1751997-05-31338.7101997-05-01
1851997-06-16226.1401997-06-01
1951997-07-22228.1401997-07-01
In[]:
# 计算所有用户购买商品的平均数量 2.410040# 计算所有用户购买商品的平均花费 35.893648df.describe()[["order_product", "order_amount"]]
Out[]:
order_productorder_amount
count69659.00069659.000
mean2.41035.894
std2.33436.282
min1.0000.000
25%1.00014.490
50%2.00025.980
75%3.00043.700
max99.0001286.010

第二部分:按月数据分析¶

用户每月花费的总金额¶

绘制曲线图展示

所有用户每月的产品购买量¶

所有用户每月的消费总次数¶

统计每月的消费人数¶

In[]:
# 用户每月花费的总金额,并绘制折线图df.groupby(by="month")["order_amount"].sum().plot()
Out[]:
In[]:
# 所有用户每月的产品购买量df.groupby(by="month")["order_product"].sum().plot()
Out[]:
In[]:
# 所有用户每月的消费总次数df.groupby(by="month")["user_id"].count()
Out[]:
month1997-01-01     89281997-02-01    112721997-03-01    115981997-04-01     37811997-05-01     28951997-06-01     30541997-07-01     29421997-08-01     23201997-09-01     22961997-10-01     25621997-11-01     27501997-12-01     25041998-01-01     20321998-02-01     20261998-03-01     27931998-04-01     18781998-05-01     19851998-06-01     2043Name: user_id, dtype: int64
In[]:
# 统计每月的消费人数df.groupby(by="month")["user_id"].nunique()
Out[]:
month1997-01-01    78461997-02-01    96331997-03-01    95241997-04-01    28221997-05-01    22141997-06-01    23391997-07-01    21801997-08-01    17721997-09-01    17391997-10-01    18391997-11-01    20281997-12-01    18641998-01-01    15371998-02-01    15511998-03-01    20601998-04-01    14371998-05-01    14881998-06-01    1506Name: user_id, dtype: int64

第三部分: 用户个体消费数据分析¶

用户消费总金额和消费总次数的统计描述¶

用户消费金额和消费次数的散点图¶

各个用户消费总金额的直方分布图(消费金额在1000之内的分布)¶

各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)¶

In[]:
# 用户消费总金额df.groupby(by="user_id")["order_amount"].sum()
Out[]:
user_id1        11.7702        89.0003       156.4604       100.5005       385.610          ...  23566    36.00023567    20.97023568   121.70023569    25.74023570    94.080Name: order_amount, Length: 23570, dtype: float64
In[]:
# 用户消费总次数df.groupby(by="user_id")["order_amount"].count()
Out[]:
user_id1         12         23         64         45        11         ..23566     123567     123568     323569     123570     2Name: order_amount, Length: 23570, dtype: int64
In[]:
# 用户消费金额和消费次数的散点图# 用户消费金额money = df.groupby(by="user_id")["order_amount"].sum()# 用户消费次数times = df.groupby(by="user_id")["order_product"].count()# 绘图plt.scatter(times, money)
Out[]:
In[]:
# 各个用户消费总金额的直方分布图(消费金额在1000之内的分布)df.groupby(by="user_id").sum().query("order_amount < 1000")["order_amount"].hist()
C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\701786761.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.  df.groupby(by="user_id").sum().query("order_amount < 1000")["order_amount"].hist()
Out[]:
In[]:In[]:
# 各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()
C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\2679188117.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.  df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()
Out[]:

第四部分: 用户消费行为分析¶

用户第一次消费的月份分布,和人数统计¶

绘制线形图

用户最后一次消费的时间分布,和人数统计¶

绘制线形图

新老客户的占比¶

消费一次为新用户消费多次为老用户    分析出每一个用户的第一个消费和最后一次消费的时间    agg(["func1func2]):对分组后的结果进行指定聚合    分析出新老客户的消费比例

用户分层¶

分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfmRFM模型设计    R表示客户最近一次交易时间的间隔        /np.timedelta64(1,"D"):去除days。    F表示客户购买商品的总数量,F值越大,表示客户交易越频繁,反之则表示客户交易不够活跃。    M表示客户交易的金额。M值越大,表示客户价值越高,反之则表示客户价值越低。    将R,F,M作用到rfm表中根据价值分层,将用户分为:    "重要价值客户"    "重要保持客户"    "重要挽留客户"    "重要发展客户"    "一般价值客户"    "一般保持客户"    "一般挽留客户"    "一般发展客户"        使用已有的分层模型rfm_func
In[]:
# 用户第一次消费的月份统计,和人数统计,绘制折线图first_con = df.groupby(by="user_id")["month"].min().value_counts().plot()
In[]:
# 用户最后一次消费的月份统计和人数统计,绘制折线图df.groupby(by="user_id")["month"].max().value_counts().plot()
Out[]:
In[]:
# # 新老用户占比# 消费一次新用户,消费多次老用户# 如何获知用户是否为第一次消费? 可以根据用户的消费时间进行判定?# 如果用户的第一次消费时间和最后一次消费时间一样,则该用户只消费了一次为新用户,否则为老用户new_old_con_df = df.groupby(by="user_id")["order_dt"].agg(["min","max"])new_old = new_old_con_df["min"] == new_old_con_df["max"].valuesnew = new_old.value_counts()[True]old = new_old.value_counts()[False]new_proportion = new / (new + old)old_proportion = old / (new + old)"老用户占比:{:.2f}%".format(old_proportion*100),"新用户占比:{:.2f}%".format(new_proportion*100)
Out[]:
("老用户占比:48.86%", "新用户占比:51.14%")
In[]:
# 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfm 用透视表rfm = df.pivot_table(index="user_id", aggfunc={"order_product":"sum", "order_amount": "sum", "order_dt":"max"})
In[]:
# R表示用户最近一次交易时间的间隔# R = df中最大的日期 - 每个用户最后一次交易的日期# 去除days用 /np.timedelta64(1,"D")today = df["order_dt"].max()rfm["R"] = (today - df.groupby(by="user_id")["order_dt"].max()) / np.timedelta64(1,"D")
In[]:
# 删除order_dt字段rfm.drop("order_dt", axis=1, inplace=True)
In[]:
# 重命名字段名为MRFrfm.columns = ["M", "F", "R"]rfm
Out[]:
MFR
user_id
111.7701545.000
289.0006534.000
3156.4601633.000
4100.5007200.000
5385.61029178.000
............
2356636.0002462.000
2356720.9701462.000
23568121.7006434.000
2356925.7402462.000
2357094.0805461.000

23570 rows × 3 columns

In[]:
# RFM模型def rfm_func(x):    level = x.map(lambda x: "1" if x >= 0 else "0")    label = level.R + level.F + level.M    d = {        "111": "重要价值客户",        "011": "重要保持客户",        "101": "重要挽留客户",        "001": "重要发展客户",        "110": "一般价值客户",        "010": "一般保持客户",        "100": "一般挽留客户",        "000": "一般发展客户"    }    result = d[label]    return result
In[]:
# 将rfm_func计算的结果返回给新建label列 (lambda x: x - x.mean()).rfm_funcrfm["label"] = rfm.apply(lambda x: x - x.mean()).apply(rfm_func, axis=1)rfm.head()
Out[]:
MFRlabel
user_id
111.7701545.000一般挽留客户
289.0006534.000一般挽留客户
3156.4601633.000重要保持客户
4100.5007200.000一般发展客户
5385.61029178.000重要保持客户

第五部分: 用户的生命周期¶

将用户划分为活跃用户和其他用户¶

统计每个用户每个月的消费次数统计每个用户每个月是否消费,消费记录为1否则记录为0    知识点: DataFrame的apply和applymap的区别        applymap:返回df            将函数做用于DataFrame中的所有元素(elements)        apply:返回Series            apply()将一个函数作用于DataFrame中的每个行或者列

将用户按照每一个月份分成:¶

unreg:观望用户(前两月没买,第三个月才第一次买,则用户前两个月为观望用户)。unactive:首月购买后,后序月份没有购买则在没有购买的月份中该用户的为非活用户。 new:当前月就进行首次购买的用户在当前月为新用户active:连续月份购买的用户在这些月中为活跃用户return:购买之后间隔n月再次购买的第一个月份为该月份的回头客
In[]:
# 统计每个用户每个月的消费次数 用透视 var:user_month_count_dfuser_month_count_df = df.pivot_table(index="user_id",values="order_dt",aggfunc="count", columns="month").fillna(value=0)user_month_count_df
Out[]:
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
11.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
22.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
31.0000.0001.0001.0000.0000.0000.0000.0000.0000.0002.0000.0000.0000.0000.0000.0001.0000.000
42.0000.0000.0000.0000.0000.0000.0001.0000.0000.0000.0001.0000.0000.0000.0000.0000.0000.000
52.0001.0000.0001.0001.0001.0001.0000.0001.0000.0000.0002.0001.0000.0000.0000.0000.0000.000
.........................................................
235660.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
235670.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
235680.0000.0001.0002.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
235690.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
235700.0000.0002.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000

23570 rows × 18 columns

In[]:
# 统计每个用户每个月是否消费,消费记录为1否则记录为0  var:df_purchasedf_purchase = user_month_count_df.applymap(lambda x : 1 if x >=1 else 0 )df_purchase
Out[]:
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1100000000000000000
2100000000000000000
3101100000010000010
4100000010001000000
5110111101001100000
.........................................................
23566001000000000000000
23567001000000000000000
23568001100000000000000
23569001000000000000000
23570001000000000000000

23570 rows × 18 columns

In[]:
# 用户生命周期模型,固定算法def active_status(data):    status = []    for i in range(18):        # 若本月没有消费        if data[i] == 0:            if len(status) > 0:                if status[i-1] == "unreg":                    status.append("unreg")                else:                    status.append("unactive")            else:                status.append("unreg")        # 若本月消费        else:            if len(status) == 0:                status.append("new")            else:                if status[i-1] == "unactive":                    status.append("return")                elif status[i-1] == "ureg":                    status.append("new")                else:                    status.append("active")    return status
In[]:
# 将df_purchase中的原始数据0和1修改为new,unactive...返回新var:df_purchase_newdf_purchase_new = df_purchase.apply(active_status,axis=1)df_purchase_new
Out[]:
user_id1        [new, unactive, unactive, unactive, unactive, ...2        [new, unactive, unactive, unactive, unactive, ...3        [new, unactive, return, active, unactive, unac...4        [new, unactive, unactive, unactive, unactive, ...5        [new, active, unactive, return, active, active...                               ...                        23566    [unreg, unreg, active, unactive, unactive, una...23567    [unreg, unreg, active, unactive, unactive, una...23568    [unreg, unreg, active, active, unactive, unact...23569    [unreg, unreg, active, unactive, unactive, una...23570    [unreg, unreg, active, unactive, unactive, una...Length: 23570, dtype: object
In[]:
# 将pivoted_status的values转成list,再将list转成DataFrame# 将df_purchase的index作为df_pruchase的index,columns相同# var:df_puechase_newdf_purchase_new1 = pd.DataFrame(data=df_purchase_new.to_list(),index=df_purchase.index, columns=df_purchase.columns)df_purchase_new1.head()
Out[]:
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1newunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
2newunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
3newunactivereturnactiveunactiveunactiveunactiveunactiveunactiveunactivereturnunactiveunactiveunactiveunactiveunactivereturnunactive
4newunactiveunactiveunactiveunactiveunactiveunactivereturnunactiveunactiveunactivereturnunactiveunactiveunactiveunactiveunactiveunactive
5newactiveunactivereturnactiveactiveactiveunactivereturnunactiveunactivereturnactiveunactiveunactiveunactiveunactiveunactive
In[]:
# 将每月不同活跃用户进行计数 var:purchase_status_ctpurchase_status_ct = df_purchase_new1.apply(lambda x : pd.value_counts(x),axis=0).fillna(0)purchase_status_ct.head()
Out[]:
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
active0.0009633.0008929.0001773.000852.000747.000746.000604.000528.000532.000624.000632.000512.000472.000571.000518.000459.000446.000
new7846.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
return0.0000.000595.0001049.0001362.0001592.0001434.0001168.0001211.0001307.0001404.0001232.0001025.0001079.0001489.000919.0001029.0001060.000
unactive0.0006689.00014046.00020748.00021356.00021231.00021390.00021798.00021831.00021731.00021542.00021706.00022033.00022019.00021510.00022133.00022082.00022064.000
unreg15724.0007248.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
In[]:
# 转置t_purchase_status_ct = purchase_status_ct.Tt_purchase_status_ct
Out[]:
activenewreturnunactiveunreg
month
1997-01-010.0007846.0000.0000.00015724.000
1997-02-019633.0000.0000.0006689.0007248.000
1997-03-018929.0000.000595.00014046.0000.000
1997-04-011773.0000.0001049.00020748.0000.000
1997-05-01852.0000.0001362.00021356.0000.000
1997-06-01747.0000.0001592.00021231.0000.000
1997-07-01746.0000.0001434.00021390.0000.000
1997-08-01604.0000.0001168.00021798.0000.000
1997-09-01528.0000.0001211.00021831.0000.000
1997-10-01532.0000.0001307.00021731.0000.000
1997-11-01624.0000.0001404.00021542.0000.000
1997-12-01632.0000.0001232.00021706.0000.000
1998-01-01512.0000.0001025.00022033.0000.000
1998-02-01472.0000.0001079.00022019.0000.000
1998-03-01571.0000.0001489.00021510.0000.000
1998-04-01518.0000.000919.00022133.0000.000
1998-05-01459.0000.0001029.00022082.0000.000
1998-06-01446.0000.0001060.00022064.0000.000

关键词: 所有用户 最后一次 购买商品