Benchmarking Natural Language Understanding Systems

Note to the reader : a dynamic version of this article can be found HERE, including interactive data-visualisations.

Over the past few years, natural language interfaces have been transforming the way we interact with technology. Voice assistants in particular have had strong adoption in cases where it’s simpler to speak than write or use a complex user interface. This becomes particularly relevant for IoT, where devices often don’t have touchscreens, making voice a natural interaction mechanism. And since speaking doesn’t require adaptation (unlike learning to use a new app or device), we can hope for a larger adoption of technology across all age groups.

Building a robust AI assistant, however, is still rather complicated. A number of companies are addressing this issue by providing natural language understanding (NLU) solutions that developers can use to augment their product with natural language capabilities.

In this article, we present the results of a benchmark conducted across multiple major providers and startups: Amazon’s Alexa, Google’s Api.ai, Microsoft’s Luis, Apple’s SiriKit, IBM’s Watson, Facebook’s Wit, and Snips (disclaimer: we work at Snips).

Beyond performance, the ability to operate offline can be a game-changer for use cases like connected cars, where internet connectivity cannot be guaranteed. Privacy is also mission critical in areas such as healthcare or banking, and is generally a growing concern for consumers, so sending data to the cloud might not always be an option. Finally, some systems have pre-built models for specific application domains, while others enable custom intents to be built.

Here is a table comparing the main existing solutions on each one of these aspects:

Overview of existing NLU solutions

In order to guarantee privacy by design, the solution we built at Snips can run fully on device, without the need for a server. Snips does not offer hosting services and thus never sees any user data. As such, models had to be trained without user data, leading to a natural question around the tradeoff of privacy vs. AI performance.

The benchmark relies on a set of 328 queries built by the business team at Snips, and kept secret from data scientists and engineers throughout the development of the solution. The dataset, as well as raw performance metrics and benchmark results are openly accessible on github.

Built-in intents are compared to similar domains across providers. Providers not offering built-in intents were not included in this benchmark. The performance of these service depends on the efforts made by developers when training their custom intents, which makes comparisons methodologically more complicated. In the end, we tested the following services: Snips (10 intents), Google Api.ai (6 intents), Amazon Alexa (2 intents), Microsoft Luis (3 intents) and Apple Siri (1 intent).

Understanding natural language queries

All commercially available natural language understanding (NLU) services currently work in a similar way:

  • Natural language query is sent to the service (e.g. “where is the closest sushi”)
  • The intention is detected (e.g. “place search”)
  • The specific parameters are extracted (e.g. type=sushi, location=42.78,2.38)
  • A structured description of the user query is sent back

These steps each introduce uncertainty and require different models to be trained.

Step 1: Detecting user intentions

Since every provider has their own way of describing intents (called an ontology), we first had to build a baseline by creating a joint ontology across services. Different domains have been identified, and similar intents have been grouped together in performance estimates.

The metric we are providing is the the fraction of properly classified queries for each intent, along with its 95% confidence interval. The latter enables more robust conclusions, given that the size of the test dataset is limited.

Intent classification accuracy

These results suggest is that intent classification remains a difficult problem, on which competing solutions perform quite differently. Luis is significantly outperformed by competing solutions. Api.ai and Snips generally yield indistinguishable performances, aside from place search and request ride in which performances are significantly lower for Api.ai than for competitors. Alexa, Siri and Snips systematically rank among the best solutions on the intents they were tested on.

Step 2: Identifying the parameters of the query

Aggregated performances

The second step of the NLU process is to extract the parameters related to the detected intent, a process commonly called “slot filling”.

For example, when booking a taxi, these parameters would be: origin, destination and party size (how many people would be part of the trip). Again, each service has defined its own system of intents, and each intent comes with its specific set of parameters. The intents covered in this benchmark have between 1 and 18 parameters (slots).

In the graph below, the dots represent the aggregated precision and recall scores of all slots for a given intent. Recall represents the fraction of relevant slots that are retrieved. Precision accounts for the fraction of retrieved slots that are relevant.

Precision and recall for the slot filling task, more  here

These results show that all solutions behave similarly in terms of precision. For all solutions, average precision per intent spans between 50% and 90%. The later suggest that all solutions are perfectible, and have a certain risk of misinterpreting user queries.

In terms of recall, however, significant differences can be identified between different solutions. Luis is once again outperformed by other solutions: average recall per intent is systematically below 14%. Performances for api.ai are highly variable: they span between 0% (taxi.search), and 80% (weather.search). The highest levels of recall can be observed for Alexa (42% and 82%), Siri (61%), and Snips (between 57% and 97%).

Diving into the raw data

Global performances give an indication of how different services perform relatively to each other. To get a better feel for how acceptable these solutions are from an end user perspective, we can dive deeper into the raw data. Navigating from query to query below shows what each service would return.

Explore the benchmark dataset

Limits and key takeaways

A difficult task

This benchmark shows that there is currently no silver bullet. More precisely, there is no solution that doesn’t misinterpret user queries, and no solution that understands every query.

Unlike other areas of Artificial Intelligence, machines haven’t reached yet the level of human performance when it comes to NLU.

What this benchmark also shows is how important such tests and methodologies are needed: there are important differences between solutions available to developers.

As the saying goes, you can’t optimize what you don’t measure. At least we now know that some solutions lack robustness when it comes to variations in how things are asked, while all of them would benefit from improvement in how they fill slots.

Another good surprise (for us!), is that our solution, Snips, compares favorably with others. It achieves comparable outcomes than Alexa or Siri, while running fully on device, and without access to the vast amount of training data others have. Hopefully this starts to show that AI performance and Privacy can co-exist, and hence should be the default.

There is still much to be done

The data used for this benchmark is publicly available on Github. It has been designed to cover an important variety of cases, formulations and levels of complexity in order to make metrics as representative as possible. Still, the choices remain manual and arbitrary, which is why we are transparently making the raw data public. In addition, the supervision of results cannot always be automated, resulting in significant amounts of work. This also explains why our test dataset has a limited size: every time it made sense, we added the corresponding 95% confidence intervals.

Any contribution in extending the benchmark to a wider set of queries is of course welcome, and would contribute to bringing greater transparency to the NLU industry. Feel free to contact us should you have any questions / comments.

原文地址: https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-d35be6ce568d

### 性能基准测试的最佳实践、工具与方法 在进行系统、代码、数据库或其它 IT 相关性能评估时,基准测试(Benchmarking)是衡量系统性能、识别瓶颈和优化方向的重要手段。为了确保测试结果具有可重复性和可比性,需要遵循一系列最佳实践,并使用合适的工具和方法。 #### 测试环境的控制 在进行基准测试之前,必须确保测试环境尽可能稳定和隔离。测试应在相同的硬件配置、操作系统版本、网络条件和负载环境下运行,以减少外部因素对测试结果的干扰。此外,测试过程中应避免其他高负载任务同时运行,以防止资源争用影响性能数据的准确性[^1]。 #### 测试指标的选择 选择合适的性能指标是评估系统性能的关键。常见的指标包括: - **吞吐量(Throughput)**:单位时间内处理的请求数量。 - **延迟(Latency)**:请求从发出到完成所需的时间,通常包括平均延迟、中位数延迟、P95/P99 延迟等。 - **资源利用率**:CPU、内存、磁盘 I/O 和网络带宽的使用情况。 在数据库性能测试中,TPC-C(每分钟交易数)常被用作衡量 OLTP 系统性能的标准[^1]。 #### 常用基准测试工具 根据测试对象的不同,可以选择不同的工具进行性能测试: - **系统级性能测试**:`sysbench`、`Geekbench`、`UnixBench` 可用于评估 CPU、内存、磁盘等硬件性能。 - **数据库性能测试**:`TPC-C`、`TPC-H`、`HammerDB` 是数据库基准测试的常用工具。 - **Web 服务性能测试**:`JMeter`、`Locust`、`Gatling` 支持高并发请求模拟,适合 REST API 和 Web 应用的压力测试。 - **机器学习推理性能测试**:`MLPerf` 是一个开放的基准测试套件,广泛用于评估深度学习模型在训练和推理阶段的性能。 示例:使用 Locust 进行 Web 性能测试的代码片段: ```python from locust import HttpUser, task class WebsiteUser(HttpUser): @task def index_page(self): self.client.get("/") ``` #### 测试方法与策略 基准测试应采用科学的测试方法,以确保结果的可重复性和可比较性。常见方法包括: - **单次测试与多次测试取平均值**:多次运行测试并取平均值,可以减少随机波动对结果的影响。 - **逐步增加负载**:通过逐步增加并发用户数或请求频率,观察系统性能变化,找出系统瓶颈。 - **混合工作负载测试**:模拟真实场景中的多种请求类型,评估系统在复杂负载下的表现。 #### 性能调优与分析 在获得基准测试结果后,应结合性能分析工具(如 `perf`、`Valgrind`、`Intel VTune`、`NVIDIA Nsight`)深入分析性能瓶颈。这些工具可以帮助识别热点函数、内存分配问题和硬件资源瓶颈,为后续优化提供依据[^1]。 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值