TED(technology entertainment design)
旨在将技术、娱乐、设计领域的专家聚集在一起的非盈利性组织
口号:“Ideas worth spreading” 值得传播的思想
每年2-3月 会召集杰出人物,将工作和研究提炼为简短有力的演讲(通常少于18分钟),并上传到TED官网供观众免费收看。
目录
一、数据
两个数据来源:
ted_main.csv:包含了2017年9月21日之前上传到官方网站TED.com的所有TED Talks演讲录制信息。
transcripts.csv:包含了具体的演讲文本信息
1、数据介绍
from IPython.core.interactiveshell import InteractiveShell #多行输出
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
ted=pd.read_csv('C:/Users/ZJDCUser/Desktop/比赛实战/kaggle/TED talks/ted_main.csv')
#查看数据集的列
ted.columns
#调整特征顺序
ted = ted[['name', 'title', 'description', 'main_speaker', 'speaker_occupation', 'num_speaker', 'duration', 'event', 'film_date', 'published_date', 'comments', 'tags', 'languages', 'ratings', 'related_talks', 'url', 'views']]
共17列
特征 解释 name 演讲的正式名称(主要发言人+标题) title 演讲标题 description 演讲内容 main_speaker 主要发言人 speaker_occupation 主要发言人的职业 num_speaker 发言人数量 duration 演讲时长,以秒为单位 event 演讲所在的TED / TEDx活动 film_date 演讲拍摄时间 (Unix timestamp) published_date 演讲发布时间 (Unix timestamp) comments 评论数量 tags 与演讲相关的主题标签 languages 收听演讲时可选择的语言数量 ratings 一个列表,里面包含许多字典,每个字典是不同的演讲评级(如鼓舞人心,引人入胜,令人惊讶等) related_talks 一个列表,里面包含许多字典,每个字典是下一个值得观看的演讲推荐 url 演讲的URL链接 views 观看数量
2、数据质量检查
ted.info()
ted.shape
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 comments 2550 non-null int64 1 description 2550 non-null object 2 duration 2550 non-null int64 3 event 2550 non-null object 4 film_date 2550 non-null int64 5 languages 2550 non-null int64 6 main_speaker 2550 non-null object 7 name 2550 non-null object 8 num_speaker 2550 non-null int64 9 published_date 2550 non-null int64 10 ratings 2550 non-null object 11 related_talks 2550 non-null object 12 speaker_occupation 2544 non-null object 13 tags 2550 non-null object 14 title 2550 non-null object 15 url 2550 non-null object 16 views 2550 non-null int64 dtypes: int64(7), object(10)(2550, 17)speaker_occupation 存在6个缺失值,但无关紧要
ted.head()
ted.isnull().any() #只有一列存在缺失值
ted[ted["speaker_occupation"].isnull()] #显示具体的六行缺失值
name title description main_speaker speaker_occupation num_speaker duration event film_date published_date comments tags languages ratings related_talks url views 0 Ken Robinson: Do schools kill creativity? Do schools kill creativity? Sir Ken Robinson makes an entertaining and pro... Ken Robinson Author/educator 1 1164 TED2006 25-02-2006 27-06-2006 4553 ['children', 'creativity', 'culture', 'dance',... 60 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/ken_robinson_says_sc... 47227110 1 Al Gore: Averting the climate crisis Averting the climate crisis With the same humor and humanity he exuded in ... Al Gore Climate advocate 1 977 TED2006 25-02-2006 27-06-2006 265 ['alternative energy', 'cars', 'climate change... 43 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... https://www.ted.com/talks/al_gore_on_averting_... 3200520 2 David Pogue: Simplicity sells Simplicity sells New York Times columnist David Pogue takes aim... David Pogue Technology columnist 1 1286 TED2006 24-02-2006 27-06-2006 124 ['computers', 'entertainment', 'interface desi... 26 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/david_pogue_says_sim... 1636292 3 Majora Carter: Greening the ghetto Greening the ghetto In an emotionally charged talk, MacArthur-winn... Majora Carter Activist for environmental justice 1 1116 TED2006 26-02-2006 27-06-2006 200 ['MacArthur grant', 'activism', 'business', 'c... 35 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/majora_carter_s_tale... 1697550 4 Hans Rosling: The best stats you've ever seen The best stats you've ever seen You've never seen data presented like this. Wi... Hans Rosling Global health expert; data visionary 1 1190 TED2006 22-02-2006 28-06-2006 593 ['Africa', 'Asia', 'Google', 'demo', 'economic... 48 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... https://www.ted.com/talks/hans_rosling_shows_t... 12005869 name False title False description False main_speaker False speaker_occupation True num_speaker False duration False event False film_date False published_date False comments False tags False languages False ratings False related_talks False url False views False dtype: bool
3、数据处理(日期)
#原数据集中的film_data和published_date是用Unix timestamp表示的
#我们使用datetime库,将其转换为可读的日期形式。
from datetime import datetime
import time
import datetime
ted['film_date'] = ted['film_date'].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime('%d-%m-%Y'))
ted['published_date'] = ted['published_date'].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime('%d-%m-%Y'))
二、描述性分析
(一)比例分析—单人演讲比例、时长小于18分钟演讲比例
ted.describe()
print("单人演讲占所有演讲的比例为{}%".format(round(sum(ted["num_speaker"]==1)*100/len(ted),1)))
#format函数 即放在print()函数中进行格式化输出 即放入{}中
print("时长小于18分钟的演讲数占总演讲数的比例{}%".format(round(sum(ted["duration"]<=18*60)*100/len(ted),1)))
num_speaker duration comments languages views count 2550.000000 2550.000000 2550.000000 2550.000000 2.550000e+03 mean 1.028235 826.510196 191.562353 27.326275 1.698297e+06 std 0.207705 374.009138 282.315223 9.563452 2.498479e+06 min 1.000000 135.000000 2.000000 0.000000 5.044300e+04 25% 1.000000 577.000000 63.000000 23.000000 7.557928e+05 50% 1.000000 848.000000 118.000000 28.000000 1.124524e+06 75% 1.000000 1046.750000 221.750000 33.000000 1.700760e+06 max 5.000000 5256.000000 6404.000000 72.000000 4.722711e+07 单人演讲占所有演讲的比例为97.7% 时长小于18分钟的演讲数占总演讲数的比例79.1%
结果显示,大部分都是单人演讲,79.1%演讲在18分钟以内。评论数平均值为191.5,观看数的平均数为1,700,000次。提供多种语言选择,最多72种语言。
(二)排名分析
1、浏览量最高
#根据views量 排序 前15行数据
pop_talks=ted[["title","main_speaker","views","film_date"]].sort_values("views",ascending=False)[:15]
pop_talks
ted.views.describe() #浏览量整体情况
count 2.550000e+03 mean 1.698297e+06 std 2.498479e+06 min 5.044300e+04 25% 7.557928e+05 50% 1.124524e+06 75% 1.700760e+06 max 4.722711e+07 Name: views, dtype: float64</