目录
1.数据集说明
每一行代表一个客户,每一列包含列元数据中描述的客户属性。原始数据包含7043行(客户)和21列(特性)。
| 字段 | 字段 | 字段说明 |
|---|---|---|
| customerID: | 用户ID | 身份标识 |
| gender | 性别 | (male,female ) |
| SeniorCitizen | 是否老年人 | (0, 1 ) |
| Partner | 是否有伴侣 | (No, Yes ) |
| Dependents | 是否有抚养人 | (No, Yes ) |
| tenure | 客户入网时长(月) | (连续值 0-72 ) |
| PhoneService | 是否有电话服务 | (Yes, No) |
| MultipleLines | 是否有多线服务 | (Yes, No, No phone service) |
| InternetService | 客户互联网服务提供商 | (No, DSL数字网络,fiber optic光纤网络 ) |
| OnlineSecurity | 是否有在线安全 | (Yes, No, No internet service) |
| OnlineBackup | 是否在线备份 | (Yes, No, No internet service) |
| DeviceProtection | 设备保护策略 | (Yes, No, No internet service) |
| TechSupport | 技术支持 | (Yes, No, No internet service) |
| StreamingTV | 在线电视 | (Yes, No, No internet service) |
| StreamingMovies | 在线电影 | (Yes, No, No internet service) |
| Contract | 合同 | (month-to-month, two year, One year) |
| PaperlessBilling | 无纸账单 | (Yes, No) |
| PaymentMethod | 支付方式 | (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
| MonthlyCharges | 每月服务费 | (连续值) |
| TotalCharges | 总话费 | (连续值) |
| Churn | 流失标签 | (No, Yes) |
2.分析思路
寻找与流失率有关的特征,进一步分析这些特征如何影响流失率,刻画高流失率用户画像,对高流失率用户提供建议。
3.数据预处理
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
df = pd.read_csv('电信运营商客户数据集.csv')
df.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
#查看数据信息
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
没有数据缺失。
#是否有重复数据
sum(df.duplicated())
0
sum(df.customerID.duplicated())
0
一共有7043名用户的数据。
#将TotalCharges(总消费额)转换为浮点型,错误充为nan值
df['TotalCharges'] = pd.to_numeric( df['TotalCharges'],errors= 'coerce' )
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7032 non-null float64
20 Churn 7043 non-null object
dtypes: float64(2), int64(2), object(17)
memory usage: 1.1+ MB
TotalCharges(总消费额)有缺失值。
df[df.TotalCharges.isin([np.NaN])]
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity |
|---|

最低0.47元/天 解锁文章
1227

被折叠的 条评论
为什么被折叠?



