using a cohort approach for data analysis

Background:
Nations Info Corp is an online subscription provider of real estate and credit monitoring services.
Our customers pay a monthly membership fee, after a usually $1 trial for several days. One of
the challenges of a subscription business is to forecast the subscription revenue we collect from
each customer (also known as LTV or CLV, lifetime value). On one hand, it’s necessary to know
the LTV because we need that value to know if the business is profitable. For instance, if the
LTV is $100 but we are paying $120 to acquire each customer, that is clearly not a profitable
business to pursue. On the other hand, it is very difficult to know what the LTV is when you first
acquire the customer, since it may take many months for the revenue to come in, as we charge
the customer each month they remain a member subscribed to our product.
Data:
Weare using a cohort approach for data analysis. For this exercise, we will define the cohort as
the group of customers that sign up in a given month (e.g. January 2020).
Wecollect cohort data from 2020 to 2022 for 2 different verticals of business (rto: rent-to-own
and credco: credit monitor). We wanted to use various short term metrics (i.e. how they
performed within 1 month from signup) to predict long term LTV. Data is provided in the
“historical data.csv” file. Here is the data dictionary:
Numbers #:
M0#: number of signups we get. Serve as the denominator of all following metrics.
Dollar Amounts $:
LTV 0-15: average money we collect from customers in 15 days from signup.
LTV 0-30: average money we collect from customers in 30 days from signup.
LTV 0-360: average money we collect from customers in 360 days from signup.
Percentages %:
C0%: cancels on the same day of signup.
C1%: cancels during the trial period, before the first monthly bill.
D0%: failed to pay for signup due to card declines.
D1%: succeeded to stay the trial period but failed to pay for the first monthly bill due to card
declines.
M1%: succeeded to stay the trial period and pay for the first monthly bill.
MOBILE%: signups using mobile device.
PREPAID%: signups using prepaid card.
LOGIN RATE%: login to account after initial singup
SEARCHRATE%: search for properties after initial signup (not available for credit business)
PDP VIEWRATE%: view the property details after initial signup (not available for credit
business)
Task 1:
Our team is tasked with forecasting the LTV 0-360 using the given metrics and any external
data. Marketing and product teams are interested in how data analysis and predictive modeling
could help with their business decisions.
Please compile your Python / R codes and results in a .html file. Also feel free to use any
business intelligence tools to present insights.
Task 2:
Weare constantly testing new features on the sites and want to assess performance of the
changes and optimize profitability of the overall business. We tested different price points of
subscription fees on “rto” business recently, where our old version (Variant A) was put head to
head in a test against the new version (Variant B). The visitor traffic is supposed to split evenly
between A and B.
Analyze the test results and present findings, giving a recommendation about what we
should do for the traffic that is being tested.
Variant A: $49 monthly subscription fee after 7 days trial.
Variant B: KaTeX parse error: Expected 'EOF', got '#' at position 183: …aset: VISITORS#̲: number of uni…: cost we pay to partners on each signup.
Task 3 (SQL Question):
Table1: “orders”- the information of each order being placed
Column Name
Type
order_id
order_status
signup_type
number
varchar
varchar
order_datetime
jluvr
timestamp
varchar
Table 2: “activities”- the users activities on our website, whether before or after the order
ColumnName Type
id number
action_type varchar
user_data varchar
created_at timestamp
jluvr varchar
Prompt:
“jluvr”isakeytodefineeachindividualuservisitingourwebsite,andcanbeusedtolink
“orders”and“activities”tables.Notejluvrisnotuniqueineitherordersoractivitiestable(i.e.
sameusercanplacemultipleordersandcanhavemultipleactivities).
Wewantedtofindthelastactivityofusersbeforeeachorderbeingplaced.Theresultwill
showeachuniqueorder,andthematchedactivityif itexists(ifnotexists,showsnullandkeeps
theorderinfo).Feel freetostateproperassumptionsifanyarenotclarifiedabove.
Pleasesubmit theSQLcodesinaplaintext file/doc.
SampleOutput:
order_id jluvr order_dateti
me
created_at action_type user_data
1001 abc-123-def 2023-12-01
08:00:00
2023-12-01
07:58:00
form_submit rent
1002 abc-123-def 2023-12-01
09:00:00
2023-12-01
08:02:00
form_submit own
1003 abc-123-efg 2023-12-01
10:00:00
2023-12-01
09:55:00
form_view own
1004 abc-123-fgh 2023-12-01
10:01:00
(null) (null) (null)
… … … … … …

<think> 我们参考了上述引用中关于群组分析(Cohort Analysis)的内容,尤其是引用[2]、[3]、[4]中提到的分析步骤和代码示例。群组分析的核心是将用户划分为具有共同特征的群组(如同期注册、首次购买等),然后跟踪这些群组在一段时间内的行为(如留存率、活跃度等)。以下是进行群组分析的一般步骤和Python实现方法: ### 1. 数据准备 首先,需要准备包含以下字段的数据集: - **用户ID**:唯一标识用户。 - **行为时间**:用户发生行为(如注册、购买)的时间戳。 - **行为类型**(可选):如注册、购买等,用于确定群组的划分依据。 ### 2. 定义群组 通常,我们将用户按首次发生关键行为的时间划分群组(如按首次注册的月份)。例如: - 将每个用户分配到其首次注册的月份(群组月份)。 - 计算每个用户后续每个月的活跃情况(相对于首次注册月份的时间间隔)。 ### 3. 计算群组指标 常见的指标是留存率(Retention Rate),即某个群组在后续时间点仍然活跃的用户比例。 ### 4. 数据透视表 将数据转换为群组分析矩阵,行表示群组(如注册月份),列表示时间间隔(如第1个月、第2个月等),值表示留存率或活跃用户数。 ### Python实现示例 以下代码基于引用[4]中的示例,使用pandas进行群组分析: ```python import pandas as pd import numpy as np # 假设df包含以下列:CustomerID(用户ID),OrderDate(订单日期) # 步骤1:创建群组月份(CohortMonth),即每个用户首次购买的月份 df['OrderDate'] = pd.to_datetime(df['OrderDate']) df['OrderMonth'] = df['OrderDate'].dt.to_period('M') # 按用户分组,找到每个用户的首次购买月份(即群组月份) df_cohort = df.groupby('CustomerID')['OrderMonth'].min().reset_index() df_cohort.rename(columns={'OrderMonth':'CohortMonth'}, inplace=True) # 合并回原始数据,这样每条记录都有该用户所在的群组月份 df = pd.merge(df, df_cohort, on='CustomerID') # 步骤2:计算每个订单相对于群组月份的时间间隔(月数) df['CohortIndex'] = (df['OrderMonth'] - df['CohortMonth']).apply(lambda x: x.n) # 步骤3:按群组月份和时间间隔分组,统计唯一用户数 cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index() # 步骤4:创建数据透视表(群组分析矩阵) cohort_counts = cohort_data.pivot_table(index='CohortMonth', columns='CohortIndex', values='CustomerID') # 步骤5:计算留存率(用每个群组的第一个月的用户数作为分母) cohort_sizes = cohort_counts.iloc[:,0] retention = cohort_counts.divide(cohort_sizes, axis=0) # 按行除,即每个群组除以自己的群组大小 retention = retention.round(3) * 100 # 转换为百分比 # 输出留存率矩阵 print(retention) ``` ### 结果解释 - 行索引:群组月份(如2019-01表示2019年1月首次购买的用户群)。 - 列索引:时间间隔(0表示群组月份当月,1表示下个月,依此类推)。 - 值:留存率(百分比),例如,0列都是100%,因为用户在自己的群组月份都是活跃的。 ### 可视化 可以使用热图(Heatmap)来直观展示留存率矩阵: ```python import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12, 8)) sns.heatmap(retention, annot=True, fmt='.1f', cmap='Blues', vmin=0, vmax=100) plt.title('Monthly Cohorts: User Retention') plt.ylabel('Cohort Month') plt.xlabel('Months Since First Purchase') plt.show() ``` ### 注意事项 - 群组分析不仅限于时间维度,还可以根据用户的行为特征(如首次购买的商品类别)划分群组。 - 留存率是常用指标,但也可以使用其他指标,如平均订单金额、购买频率等。 §§相关问题§§ 1. 群组分析中,如何选择划分群组的依据? 2. 除了留存率,群组分析还可以分析哪些指标? 3. 如何解释群组分析矩阵中的异常值? </think> ### 群组分析(Cohort Analysis)方法详解 群组分析是通过将用户划分为具有共同特征的群体(如同期注册、相同行为路径等),追踪其随时间变化的指标(如留存率、转化率等)的分析方法。以下是基于引用内容的完整实施步骤: --- #### **1. 定义群组(Cohort)** 根据业务目标选择划分维度: - **时间型群组**:按用户首次行为时间划分 示例:`2023年6月注册用户`、`2023Q1首次购买用户`(引用[3]) - **行为型群组**:按关键行为特征划分 示例:`通过渠道B注册的用户`、`首单金额>500元的用户`(引用[5]) --- #### **2. 数据处理与计算** 使用Python实现核心计算(引用[4]): ```python import pandas as pd # 步骤1:计算每个用户的首次行为时间 df['首次行为日期'] = df.groupby('用户ID')['行为日期'].transform('min') df['群组月份'] = df['首次行为日期'].dt.to_period('M') # 步骤2:计算时间间隔(群组索引) df['时间间隔'] = (df['行为日期'].dt.to_period('M') - df['群组月份']).apply(lambda x: x.n) # 步骤3:按群组和时间间隔聚合 cohort_data = df.groupby(['群组月份', '时间间隔'])['用户ID'].nunique().reset_index() # 步骤4:生成透视表 retention_matrix = cohort_data.pivot_table( index='群组月份', columns='时间间隔', values='用户ID', aggfunc='sum' ) ``` --- #### **3. 关键指标计算** - **留存率公式**: $$留存率 = \frac{当前活跃用户数}{群组初始用户数} \times 100\%$$ 示例代码: ```python initial_users = retention_matrix.iloc[:, 0] # 群组首月用户数 retention_rate = retention_matrix.div(initial_users, axis=0) * 100 ``` - **其他常用指标**: - 人均收入(ARPU) - 行为转化率 - 复购率(引用[5]) --- #### **4. 可视化分析** - **热力图展示留存率**: ```python import seaborn as sns plt.figure(figsize=(12,8)) sns.heatmap(retention_rate, annot=True, fmt=".1f", cmap="YlGnBu") plt.title("月度群组留存率分析", fontsize=14) ``` - **折线图对比群组**: ![](https://example.com/cohort-chart.png) *不同渠道群组的留存趋势对比(引用[5])* --- #### **5. 分析决策应用** - **识别优质群组**:高留存群组(如渠道A用户)增加资源投入(引用[5]) - **产品迭代验证**:对比改版前后新用户留存率变化(引用[5]) - **异常诊断**:特定群组指标骤降时定位问题(如渠道B用户留存差) --- ### 常见工具推荐 1. **Python库**: - `pandas`(数据处理) - `matplotlib/seaborn`(可视化) 2. **BI工具**: - Tableau(内置群组分析模块) - Google Analytics(同期群报告) > **注意事项**:群组规模需>100用户以保证统计显著性,时间周期选择需匹配业务场景(周/月/季)。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值