Embeddings
What are embeddings? 什么是嵌入?
OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
OpenAI 的文本嵌入衡量文本字符串的相关性。嵌入通常用于:
- Search (where results are ranked by relevance to a query string)
搜索(结果按与查询字符串的相关性排序) - Clustering (where text strings are grouped by similarity)
聚类(其中文本字符串按相似性分组) - Recommendations (where items with related text strings are recommended)
推荐(推荐具有相关文本字符串的项目) - Anomaly detection (where outliers with little relatedness are identified)
异常检测(识别出相关性很小的异常值) - Diversity measurement (where similarity distributions are analyzed)
多样性测量(分析相似性分布) - Classification (where text strings are classified by their most similar label)
分类(其中文本字符串按其最相似的标签分类)
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
嵌入是浮点数的向量(列表)。两个向量之间的距离衡量它们的相关性。小距离表示高相关性,大距离表示低相关性。
Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input sent.
访问我们的定价页面以了解嵌入定价。请求根据发送的输入中的令牌数量计费。
**To see embeddings in action, check out our code samples
要查看嵌入的实际效果,请查看我们的代码示例**
- Classification
- Topic clustering
- Search
- Recommendations
How to get embeddings 如何获得嵌入
To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.
要获得嵌入,请将您的文本字符串连同选择的嵌入模型 ID(例如 text-embedding-ada-002 )一起发送到嵌入 API 端点。响应将包含一个嵌入,您可以提取、保存和使用它。
Example requests:
Example: Getting embeddings 示例:获取嵌入
python
response = openai.Embedding.create(
input="Your text string goes here",
model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']
Example response:
{
"data": [
{
"embedding": [
-0.006929283495992422,
-0.005336422007530928,
...
-4.547132266452536e-05,
-0.024047505110502243
],
"index": 0,
"object": "embedding"
}
],
"model": "text-embedding-ada-002",
"object": "list",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
See more Python code examples in the OpenAI Cookbook.
在 OpenAI Cookbook 中查看更多 Python 代码示例。
When using OpenAI embeddings, please keep in mind their limitations and risks.
使用 OpenAI 嵌入时,请牢记它们的局限性和风险。
Embedding models
OpenAI offers one second-generation embedding model (denoted by -002 in the model ID) and 16 first-generation models (denoted by -001 in the model ID).
OpenAI 提供了一个二代嵌入模型(在模型 ID 中用 -002 表示)和 16 个第一代模型(在模型 ID 中用 -001 表示)。
We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use. Read the blog post announcement.
我们建议对几乎所有用例使用 text-embedding-ada-002。它更好、更便宜、更易于使用。阅读博文公告。
| MODEL GENERATION | TOKENIZER | MAX INPUT TOKENS 最大输入代币 | KNOWLEDGE CUTOFF |
|---|---|---|---|
| V2 | cl100k_base | 8191 | Sep 2021 |
| V1 | GPT-2/GPT-3 | 2046 | Aug 2020 |
Usage is priced per input token, at a rate of $0.0004 per 1000 tokens, or about ~3,000 pages per US dollar (assuming ~800 tokens per page):
使用量按输入令牌定价,每 1000 个令牌 0.0004 美元,或每美元约 3,000 页(假设每页约 800 个令牌):
| MODEL | ROUGH PAGES PER DOLLAR 每美元粗略页数 | EXAMPLE PERFORMANCE ON BEIR SEARCH EVAL BEIR SEARCH EVAL 的性能示例 |
|---|---|---|
| text-embedding-ada-002 文本嵌入-ada-002 | 3000 | 53.9 |
| -davinci--001 | 6 | 52.8 |
| -curie--001 | 60 | 50.9 |
| -babbage--001 | 240 | 50.4 |
| -ada--001 | 300 | 49.0 |
| MODEL NAME | TOKENIZER | MAX INPUT TOKENS 最大输入代币 | OUTPUT DIMENSIONS |
|---|---|---|---|
| text-embedding-ada-002 文本嵌入-ada-002 | cl100k_base | 8191 | 1536 |
First-generation models (not recommended)
第一代机型(不推荐)

OpenAI的文本嵌入模型用于衡量字符串的相关性,常用于搜索、聚类、推荐、异常检测和分类等任务。嵌入是浮点数向量,距离表示相关性。推荐使用text-embedding-ada-002模型。文章展示了如何获取嵌入、实际应用案例及不同场景的代码示例,同时提醒注意模型的局限性和风险,如社会偏见和对近期事件的无知。
最低0.47元/天 解锁文章
384

被折叠的 条评论
为什么被折叠?



