如何高效合并音视频文件(时间短消耗资源少)(二)

英语字幕

1
00:00:06,480 --> 00:00:08,400
Good morning. We have a banger for you

2
00:00:08,400 --> 00:00:09,840
today. We're going to launch chatbt

3
00:00:09,840 --> 00:00:11,519
agent. But before jumping into that, I'd

4
00:00:11,519 --> 00:00:12,559
like to ask the team to introduce

5
00:00:12,559 --> 00:00:14,080
themselves. Starting with Yosh.

6
00:00:14,080 --> 00:00:17,840
Hi, I'm Yash. I work on agent team and

7
00:00:17,840 --> 00:00:20,080
before that I used to work on operator.

8
00:00:20,080 --> 00:00:22,560
Hi, I'm Jing. I work on agents research

9
00:00:22,560 --> 00:00:24,400
previously on deep research.

10
00:00:24,400 --> 00:00:26,000
Hi, I'm Casey. I'm a researcher on

11
00:00:26,000 --> 00:00:27,920
agents formerly operator.

12
00:00:27,920 --> 00:00:30,560
Hi, I'm Issa. I'm a researcher on agent

13
00:00:30,560 --> 00:00:32,640
formerly on deep research.

14
00:00:32,640 --> 00:00:34,880
So we we started launching agents

15
00:00:34,880 --> 00:00:36,800
earlier this year. Uh we launched deep

16
00:00:36,800 --> 00:00:38,879
research, we launched operator and

17
00:00:38,879 --> 00:00:40,160
people were very excited about this.

18
00:00:40,160 --> 00:00:42,480
People could see that now uh AI was

19
00:00:42,480 --> 00:00:44,640
going off to do complex tasks for them.

20
00:00:44,640 --> 00:00:46,079
But it became clear to us that what

21
00:00:46,079 --> 00:00:48,000
people really wanted was for us to bring

22
00:00:48,000 --> 00:00:49,760
those capabilities and more together.

23
00:00:49,760 --> 00:00:51,920
People wanted a unified agent that could

24
00:00:51,920 --> 00:00:55,039
go off, use its own computer and do real

25
00:00:55,039 --> 00:00:57,360
complex tasks for them, that could uh

26
00:00:57,360 --> 00:00:59,359
seamlessly transition from thinking

27
00:00:59,359 --> 00:01:01,520
about something to taking actions to

28
00:01:01,520 --> 00:01:03,359
using lots of tools using the terminal,

29
00:01:03,359 --> 00:01:05,360
clicking around the web, even producing

30
00:01:05,360 --> 00:01:06,880
things like spreadsheets and slides and

31
00:01:06,880 --> 00:01:08,960
and much more. And wanted people want to

32
00:01:08,960 --> 00:01:10,159
be able to do this over a long time

33
00:01:10,159 --> 00:01:12,159
horizon and a sort of for universal

34
00:01:12,159 --> 00:01:13,840
tasks. So the team has been working

35
00:01:13,840 --> 00:01:16,400
super hard to bring that together. And

36
00:01:16,400 --> 00:01:18,080
today we have chat with the agent. Um,

37
00:01:18,080 --> 00:01:19,680
it's probably easier to show it to you

38
00:01:19,680 --> 00:01:21,439
than to keep talking about it. It is one

39
00:01:21,439 --> 00:01:23,360
of the feel the aon moments for me to

40
00:01:23,360 --> 00:01:25,280
watch it work. So, let's take a look.

41
00:01:25,280 --> 00:01:27,840
Awesome. Thanks, Sam. Hello, everyone.

42
00:01:27,840 --> 00:01:29,920
Very excited to share chat GBD agent

43
00:01:29,920 --> 00:01:31,600
with everybody. And as Sam said, let's

44
00:01:31,600 --> 00:01:33,759
just dive right into the demo. Okay, so

45
00:01:33,759 --> 00:01:36,159
we are on Chad GBD as we all know and

46
00:01:36,159 --> 00:01:39,119
love. And to turn on the agent mode, you

47
00:01:39,119 --> 00:01:40,880
just click the tools menu and select

48
00:01:40,880 --> 00:01:43,280
agent. You can also just type agent in

49
00:01:43,280 --> 00:01:45,040
the composer bar and it'll take you to

50
00:01:45,040 --> 00:01:47,520
agent mode. Um, Edward and I have a

51
00:01:47,520 --> 00:01:49,360
wedding to go to later this year. Uh,

52
00:01:49,360 --> 00:01:51,119
it's for one of our mutual friends.

53
00:01:51,119 --> 00:01:52,560
Should we should we have the Asian

54
00:01:52,560 --> 00:01:53,280
planet?

55
00:01:53,280 --> 00:01:55,680
Yeah, let's do it. I need an outfit. And

56
00:01:55,680 --> 00:01:56,799
don't forget the gift.

57
00:01:56,799 --> 00:01:58,719
Okay, great. We won't forget the gift.

58
00:01:58,719 --> 00:02:00,240
Um, it's a little bit of a longer

59
00:02:00,240 --> 00:02:01,680
prompt, so I have it copied in my

60
00:02:01,680 --> 00:02:02,799
buffer, so I'm just going to go ahead

61
00:02:02,799 --> 00:02:05,759
and paste it. Um, okay. So, let's see.

62
00:02:05,759 --> 00:02:07,360
Let's see what it says. Our friends are

63
00:02:07,360 --> 00:02:08,640
getting married later this year, as I

64
00:02:08,640 --> 00:02:10,720
said, Minia and Sarah. And we want the

65
00:02:10,720 --> 00:02:12,879
agent to help us find an outfit that

66
00:02:12,879 --> 00:02:15,520
matches the dress code. uh propose a few

67
00:02:15,520 --> 00:02:17,840
options. Nice mid luxury taking into

68
00:02:17,840 --> 00:02:21,040
account venue and weather. We also want

69
00:02:21,040 --> 00:02:23,280
to find us some hotels and as Edward

70
00:02:23,280 --> 00:02:25,760
said, don't forget the gift. Um so let's

71
00:02:25,760 --> 00:02:27,840
see and

72
00:02:27,840 --> 00:02:30,319
send the prompt away. As Sam said, agent

73
00:02:30,319 --> 00:02:32,640
uses a computer. Uh so in the beginning

74
00:02:32,640 --> 00:02:34,959
it sets up its environment. It it you

75
00:02:34,959 --> 00:02:38,000
know it'll take a minute or two or not

76
00:02:38,000 --> 00:02:39,680
really 5 seconds to set up its

77
00:02:39,680 --> 00:02:41,440
environment. And in this case, as you

78
00:02:41,440 --> 00:02:43,840
see, it understands the prompt. It's

79
00:02:43,840 --> 00:02:46,319
asking for me for a clarification. I'm

80
00:02:46,319 --> 00:02:48,000
just going to let it just continue and

81
00:02:48,000 --> 00:02:51,120
work. Anyway, um I think it got confused

82
00:02:51,120 --> 00:02:54,239
by saying, "Oh, where's the um what

83
00:02:54,239 --> 00:02:55,680
exactly is the time of the date of the

84
00:02:55,680 --> 00:02:57,200
wedding?" I think it'll figure out using

85
00:02:57,200 --> 00:02:59,840
the website. Okay, cool. So, now it's

86
00:02:59,840 --> 00:03:01,760
kicked off. It's starting the process,

87
00:03:01,760 --> 00:03:03,920
the prompt, and it's open up a browser.

88
00:03:03,920 --> 00:03:04,959
And to walk you through what's

89
00:03:04,959 --> 00:03:06,800
happening, here's

90
00:03:06,800 --> 00:03:09,040
Yeah. So, as mentioned, we gave the

91
00:03:09,040 --> 00:03:10,879
agent access to its own virtual

92
00:03:10,879 --> 00:03:13,280
computer, and the computer has many

93
00:03:13,280 --> 00:03:14,720
different tools installed, and it can

94
00:03:14,720 --> 00:03:16,239
choose which to use as it's working

95
00:03:16,239 --> 00:03:18,640
through the task. So, in chat GPT, you

96
00:03:18,640 --> 00:03:21,360
can see a visualization of the agent's

97
00:03:21,360 --> 00:03:23,680
computer screen, and you can see

98
00:03:23,680 --> 00:03:25,519
overlaid its chain of thought in text,

99
00:03:25,519 --> 00:03:27,200
and that's what it's thinking as it's

100
00:03:27,200 --> 00:03:28,480
working through the task and deciding

101
00:03:28,480 --> 00:03:30,799
what to do next. We gave the agent

102
00:03:30,799 --> 00:03:32,400
access to two different ways to browse

103
00:03:32,400 --> 00:03:34,560
the internet. First, we gave it a text

104
00:03:34,560 --> 00:03:36,159
browser, and this is similar to the deep

105
00:03:36,159 --> 00:03:38,000
research tool. And this is what lets it

106
00:03:38,000 --> 00:03:40,159
really efficiently and quickly read many

107
00:03:40,159 --> 00:03:43,440
web pages um um and search for them. And

108
00:03:43,440 --> 00:03:45,040
we also gave it access to a visual

109
00:03:45,040 --> 00:03:46,319
browser. And this is similar to the

110
00:03:46,319 --> 00:03:48,239
operator tool. And this is what lets it

111
00:03:48,239 --> 00:03:50,159
actually interact with the UI of a web

112
00:03:50,159 --> 00:03:52,720
page. So it can um drag things. It can

113
00:03:52,720 --> 00:03:54,879
use the cursor to click around. It can

114
00:03:54,879 --> 00:03:57,280
open UI components. It can fill out

115
00:03:57,280 --> 00:03:59,920
forms and enter text and text areas.

116
00:03:59,920 --> 00:04:02,560
It's very flexible. So those two tools

117
00:04:02,560 --> 00:04:04,720
are very complimentary. And then we also

118
00:04:04,720 --> 00:04:06,720
gave it access to its own terminal so

119
00:04:06,720 --> 00:04:08,720
that it can run code and it can also

120
00:04:08,720 --> 00:04:10,640
generate and analyze files like slide

121
00:04:10,640 --> 00:04:12,879
decks and spreadsheets. And then through

122
00:04:12,879 --> 00:04:14,560
the terminal it's also able to call

123
00:04:14,560 --> 00:04:17,840
APIs. So both public APIs and APIs to

124
00:04:17,840 --> 00:04:19,840
access your private data sources like

125
00:04:19,840 --> 00:04:22,479
Google Drive, Google Calendar, GitHub,

126
00:04:22,479 --> 00:04:25,360
SharePoint and many others um and only

127
00:04:25,360 --> 00:04:26,960
if you explicitly connect them similar

128
00:04:26,960 --> 00:04:28,960
to deep research connectors. And then it

129
00:04:28,960 --> 00:04:31,680
also has access to the image gen API so

130
00:04:31,680 --> 00:04:34,240
it can create nice visuals for um slide

131
00:04:34,240 --> 00:04:36,080
decks and other things as it's working

132
00:04:36,080 --> 00:04:38,240
through its tasks.

133
00:04:38,240 --> 00:04:40,800
How is deciding which tools to use here?

134
00:04:40,800 --> 00:04:42,560
Yes, we train the model to move between

135
00:04:42,560 --> 00:04:44,160
these capabilities with reinforcement

136
00:04:44,160 --> 00:04:46,080
learning. This is the first model we

137
00:04:46,080 --> 00:04:48,880
trained that has access to this unified

138
00:04:48,880 --> 00:04:52,000
tool box. A text browser, a GUI browser

139
00:04:52,000 --> 00:04:53,840
and a terminal all in one virtual

140
00:04:53,840 --> 00:04:57,120
machine. To guide its learning, we

141
00:04:57,120 --> 00:04:59,360
created hard tasks that require using

142
00:04:59,360 --> 00:05:01,919
all these tools. This allows the model

143
00:05:01,919 --> 00:05:04,000
not only to learn how to use these

144
00:05:04,000 --> 00:05:06,160
tools, but also when to use which tool

145
00:05:06,160 --> 00:05:08,400
depending on the task at hand. At the

146
00:05:08,400 --> 00:05:10,400
beginning of the training, the model

147
00:05:10,400 --> 00:05:12,880
might attempt to use all these tools to

148
00:05:12,880 --> 00:05:15,600
solve a relatively simple problem. Over

149
00:05:15,600 --> 00:05:17,840
time, as we reward the model for solving

150
00:05:17,840 --> 00:05:20,560
problems correctly and efficiently, the

151
00:05:20,560 --> 00:05:24,080
model will have smarter tool choice.

152
00:05:24,080 --> 00:05:27,360
For example, if you ask a model to uh

153
00:05:27,360 --> 00:05:29,039
find a restaurant with specific

154
00:05:29,039 --> 00:05:31,919
requirements and make a reservation, the

155
00:05:31,919 --> 00:05:34,479
model may typically just start a deep

156
00:05:34,479 --> 00:05:36,160
research in the text browser to find

157
00:05:36,160 --> 00:05:39,039
some candidates, then switch to the GUI

158
00:05:39,039 --> 00:05:42,160
browser to view photos of food, uh check

159
00:05:42,160 --> 00:05:45,600
availability, and complete the booking.

160
00:05:45,600 --> 00:05:48,000
Similarly, for creative task like

161
00:05:48,000 --> 00:05:50,160
creating an artifact, the model will

162
00:05:50,160 --> 00:05:51,680
first search online for public

163
00:05:51,680 --> 00:05:54,479
resources, then switch to the terminal

164
00:05:54,479 --> 00:05:57,039
to do some code editing to compile the

165
00:05:57,039 --> 00:05:59,919
artifact and finally verify the final

166
00:05:59,919 --> 00:06:02,960
outputs in the GUI browser. With this,

167
00:06:02,960 --> 00:06:05,600
we truly feel like we brought together

168
00:06:05,600 --> 00:06:08,240
the best of deep research and operator

169
00:06:08,240 --> 00:06:11,759
and added some extra sparkle.

170
00:06:11,759 --> 00:06:14,000
That's right. Yeah. So to put this

171
00:06:14,000 --> 00:06:15,520
project in context, I want to give a bit

172
00:06:15,520 --> 00:06:18,000
of history. So a few months ago, we

173
00:06:18,000 --> 00:06:20,960
shipped operator in January and this was

174
00:06:20,960 --> 00:06:23,120
our agent that lets you do online tasks

175
00:06:23,120 --> 00:06:25,759
like book reservations and um send

176
00:06:25,759 --> 00:06:27,840
emails and then two weeks later we

177
00:06:27,840 --> 00:06:29,919
shipped deep research and deep research

178
00:06:29,919 --> 00:06:31,919
is a tool that lets you do in-depth

179
00:06:31,919 --> 00:06:35,759
internet research and output highquality

180
00:06:35,759 --> 00:06:39,280
um um research reports. And after launch

181
00:06:39,280 --> 00:06:41,039
we realized that actually these two

182
00:06:41,039 --> 00:06:42,319
approaches are actually deeply

183
00:06:42,319 --> 00:06:44,160
complimentary.

184
00:06:44,160 --> 00:06:46,400
Um for example operator has some trouble

185
00:06:46,400 --> 00:06:48,720
reading super long articles. Um it has

186
00:06:48,720 --> 00:06:50,400
to scroll. It takes a long time. But

187
00:06:50,400 --> 00:06:51,759
that's something that deep research is

188
00:06:51,759 --> 00:06:56,240
good at. Conversely operator uh uh deep

189
00:06:56,240 --> 00:06:58,240
research isn't as good at interacting

190
00:06:58,240 --> 00:07:00,319
with web pages interactive elements

191
00:07:00,319 --> 00:07:03,199
visual uh highly visual web pages but

192
00:07:03,199 --> 00:07:04,800
that's something that operator excels

193
00:07:04,800 --> 00:07:08,639
at. So uh yeah we felt these approaches

194
00:07:08,639 --> 00:07:11,120
were complimentary and then we we were

195
00:07:11,120 --> 00:07:13,120
also looking at some customer feedback.

196
00:07:13,120 --> 00:07:14,880
So for example one of our most highly

197
00:07:14,880 --> 00:07:17,120
requested features for deep research was

198
00:07:17,120 --> 00:07:18,960
the ability to log into websites and

199
00:07:18,960 --> 00:07:20,960
access authenticated sources. That's

200
00:07:20,960 --> 00:07:22,880
something that operator can do.

201
00:07:22,880 --> 00:07:24,000
I've been waiting for that for a long

202
00:07:24,000 --> 00:07:24,560
time.

203
00:07:24,560 --> 00:07:26,160
Yeah.

204
00:07:26,160 --> 00:07:28,479
Um another thing is that we were looking

205
00:07:28,479 --> 00:07:29,840
at the prompts that people were trying

206
00:07:29,840 --> 00:07:31,520
for operator and we saw that they were

207
00:07:31,520 --> 00:07:32,880
actually more deep research type

208
00:07:32,880 --> 00:07:35,199
prompts. for example, plan a trip and

209
00:07:35,199 --> 00:07:38,240
then book it. And so, yeah, we we really

210
00:07:38,240 --> 00:07:39,360
feel like we're bringing the best of

211
00:07:39,360 --> 00:07:41,440
both worlds here. And on a personal

212
00:07:41,440 --> 00:07:42,800
note, we've all been friends for a

213
00:07:42,800 --> 00:07:44,160
while, and it's really exciting to be

214
00:07:44,160 --> 00:07:46,479
working together. So, speaking of

215
00:07:46,479 --> 00:07:48,960
matches made in heaven, how is the

216
00:07:48,960 --> 00:07:50,319
wedding planning going?

217
00:07:50,319 --> 00:07:51,759
It's amazing to watch. This is an

218
00:07:51,759 --> 00:07:53,599
example of a task I hate doing. This can

219
00:07:53,599 --> 00:07:55,520
like ruin like, you know, multiple hours

220
00:07:55,520 --> 00:07:56,960
for me as I get sucked into these rabbit

221
00:07:56,960 --> 00:07:58,160
holes. So, just watching this as you

222
00:07:58,160 --> 00:07:59,520
guys have been talking click through

223
00:07:59,520 --> 00:08:01,199
this and just like do the whole thing is

224
00:08:01,199 --> 00:08:03,360
really quite remarkable. Yeah, totally.

225
00:08:03,360 --> 00:08:06,560
Um, looks like it started off by

226
00:08:06,560 --> 00:08:08,560
figuring out the weather. One of the

227
00:08:08,560 --> 00:08:11,280
cool features, um, is that, you know, as

228
00:08:11,280 --> 00:08:12,560
some of these tasks may take a little

229
00:08:12,560 --> 00:08:14,160
bit longer, you can just go back and see

230
00:08:14,160 --> 00:08:15,759
what it was doing. So, that's what we're

231
00:08:15,759 --> 00:08:17,199
exactly going to do. Looks like it went

232
00:08:17,199 --> 00:08:18,720
through the website to use the text

233
00:08:18,720 --> 00:08:21,039
browser. Interestingly, for that, now

234
00:08:21,039 --> 00:08:22,400
it's looking through the suits for

235
00:08:22,400 --> 00:08:23,919
Edward. I think it'll find something

236
00:08:23,919 --> 00:08:25,360
good. Here you can see it switched over

237
00:08:25,360 --> 00:08:27,199
to actually a visual browser to make

238
00:08:27,199 --> 00:08:28,960
sure suit will look really good on

239
00:08:28,960 --> 00:08:31,280
Edward.

240
00:08:31,280 --> 00:08:34,560
And now looks like yeah, it's got

241
00:08:34,560 --> 00:08:36,880
chugging along, figuring out what to do.

242
00:08:36,880 --> 00:08:39,599
Um, and still on suits and now probably

243
00:08:39,599 --> 00:08:41,919
getting to the gifts section. Um, okay,

244
00:08:41,919 --> 00:08:43,279
cool. So, this is going to take a while.

245
00:08:43,279 --> 00:08:44,959
As Sam said, these tasks sometimes can

246
00:08:44,959 --> 00:08:46,160
take a long time. So, it's going to

247
00:08:46,160 --> 00:08:47,680
continue doing hopefully much faster

248
00:08:47,680 --> 00:08:49,760
than we will do. Um, should we do

249
00:08:49,760 --> 00:08:51,600
something else while it's doing it? I

250
00:08:51,600 --> 00:08:53,519
think the team really wanted the um

251
00:08:53,519 --> 00:08:55,279
stickers, some stickers for the for the

252
00:08:55,279 --> 00:08:56,480
launch. Should we do that?

253
00:08:56,480 --> 00:08:57,279
Yeah, cool.

254
00:08:57,279 --> 00:08:59,040
All right. So, we have a team mascot,

255
00:08:59,040 --> 00:09:00,320
which is one of our colleagues, Bunny

256
00
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值