大数据和计算中的数据集偏差对通往材料科学之路的影响

斐夷所非

于 2024-10-28 00:52:25 发布

阅读量825

点赞数 17

CC 4.0 BY-SA版权

分类专栏： materials science 文章标签：材料学

本文链接：https://blog.youkuaiyun.com/u013669912/article/details/142891098

注：机翻，未校。

Why big data and compute are not necessarily the path to big materials science

Naohiro Fujinuma, Brian DeCost, Jason Hattrick-Simpers & Samuel E. Lofland

Communications Materials volume 3, Article number: 59 (2022) Cite this article

Abstract 摘要

Applied machine learning has rapidly spread throughout the physical sciences. In fact, machine learning-based data analysis and experimental decision-making have become commonplace. Here, we reflect on the ongoing shift in the conversation from proving that machine learning can be used, to how to effectively implement it for advancing materials science. In particular, we advocate a shift from a big data and large-scale computations mentality to a model-oriented approach that prioritizes the use of machine learning to support the ecosystem of computational models and experimental measurements. We also recommend an open conversation about dataset bias to stabilize productive research through careful model interrogation and deliberate exploitation of known biases. Further, we encourage the community to develop machine learning methods that connect experiments with theoretical models to increase scientific understanding rather than incrementally optimizing materials. Moreover, we envision a future of radical materials innovations enabled by computational creativity tools combined with online visualization and analysis tools that support active outside-the-box thinking within the scientific knowledge feedback loop.
应用机器学习已迅速传播到整个物理科学领域。事实上，基于机器学习的数据分析和实验决策已经变得司空见惯。在这里，我们反思了对话中正在进行的转变，从证明机器学习可以被使用，到如何有效地实施它来推进材料科学。特别是，我们主张从大数据和大规模计算的心态转变为面向模型的方法，该方法优先使用机器学习来支持计算模型和实验测量的生态系统。我们还建议就数据集偏差进行公开对话，通过仔细的模型询问和有意识地利用已知偏差来稳定富有成效的研究。此外，我们鼓励社区开发机器学习方法，将实验与理论模型联系起来，以增加科学理解，而不是逐步优化材料。此外，我们设想了通过计算创新工具与在线可视化和分析工具相结合实现激进材料创新的未来，这些工具支持在科学知识反馈循环中积极地跳出框框思考。

Introduction 介绍

Since Frank Rosenblatt created Perceptron to play checkers1, machine learning (ML) applications have been used to emulate human intelligence. The field has grown immensely with the advent of ever more powerful computers with increasingly smaller size combined with the development of robust statistical analyses. These advances allowed Deep Blue to beat Grandmaster Gary Kasparov in chess and Watson to win the game show Jeopardy! The technology has since progressed to more practical applications such as advanced manufacturing and common tasks we now expect from our phones like image and speech recognition. The future of ML promises to obviate much of the tedium of everyday life by assuming responsibility for more and more complex processes, e.g., autonomous driving.
自从 Frank Rosenblatt 创建用于玩跳棋的 Perceptron 以来1，机器学习（ML）应用程序一直被用于模拟人类智能。随着越来越强大的计算机和越来越小的尺寸的出现以及强大的统计分析的发展，该领域得到了极大的发展。这些进步让 Deep Blue 在国际象棋中击败了特级大师 Gary Kasparov，并让 Watson 赢得了游戏节目 Jeopardy！此后，该技术已发展到更实际的应用，例如高级制造以及我们现在期望手机执行的常见任务，例如图像和语音识别。ML 的未来有望通过承担越来越复杂的流程（例如自动驾驶）来消除日常生活的大部分乏味。

When it comes to scientific application, our perspective is that current ML methods are just another component of the scientific modeling toolbox, with a somewhat different profile of representational basis, parametrization, computational complexity, and data/sample efficiency. Fully embracing this view will help the materials and chemistry communities to overcome perceived limitations and at the same time evaluate and deploy these techniques with the same level of rigor and introspection as any physics-based modeling methodology. Toward this end, in this essay we identify four areas in which materials researchers can clarify our thinking to enable a vibrant and productive community of scientific ML practitioners:
当谈到科学应用时，我们的观点是，当前的 ML 方法只是科学建模工具箱的另一个组成部分，在表示基础、参数化、计算复杂性和数据/样本效率方面略有不同。完全接受这一观点将有助于材料和化学界克服感知到的局限性，同时以与任何基于物理的建模方法相同的严谨性和内省水平评估和部署这些技术。为此，在本文中，我们确定了材料研究人员可以阐明我们思考的四个领域，以实现一个充满活力和生产力的科学 ML 从业者社区：

Maintain perspective on resources required 保持对所需资源的看法

The recent high profile successes in mainstream ML applications enabled by internet-scale data and massive computation2,3 have spurred two lines of discussion in the materials community that are worth examining more closely. The first is an unmediated and limiting preference for large-scale data and computation, under the assumption that successful ML is unrealistic for materials scientists with datasets that are orders of magnitude smaller than those at the forefront of the publicity surrounding deep learning. The second is a tendency to dismiss brute-force ML systems as unscientific. While there is some validity to both these viewpoints, there are opportunities in materials research for productive and creative ML work with small datasets and for the “go big or go home” brute-force approach.
最近，由互联网规模的数据和大规模计算2,3 支持的主流 ML 应用程序取得了引人注目的成功，这在材料社区中引发了值得更仔细研究的两条讨论线。首先是对大规模数据和计算的无中介和有限偏好，假设成功的 ML 对于数据集比处于深度学习宣传前沿的数据集小几个数量级的材料科学家来说是不现实的。第二个是倾向于将蛮力 ML 系统视为不科学。虽然这两种观点都有一定的道理，但在材料研究中也有机会使用小数据集进行高效和创造性的 ML 工作，以及“要么做大，要么回家”的蛮力方法。

Molehills of data (or compute) are sometimes better than mountains 数据（或计算）的 molehills 有时比山更好

A common sentiment in the contemporary deep-learning community is that the most reliable means of improving the performance of a deep-learning system is to amass ever larger datasets and apply raw computational power. This sometimes can encourage the fallacy that large-scale data and computation are fundamental requirements for success with ML methods. This can lead to needlessly deploying massively overparameterized models when simpler ones may be more appropriate4, and it limits the scope of applied ML research in materials by biasing the set of problems people are willing to consider addressing. There are many examples of productive, creative ML work with small datasets in materials research that counter this notion5,6.
当代深度学习社区的一个普遍观点是，提高深度学习系统性能的最可靠方法是积累越来越大的数据集并应用原始计算能力。这有时会助长一种谬论，即大规模数据和计算是使用 ML 方法成功的基本要求。这可能会导致不必要地部署大规模过度参数化的模型，而更简单的模型可能更合适4，并且它通过偏向人们愿意考虑解决的一系列问题，限制了材料中应用 ML 研究的范围。在材料研究中，有许多使用小数据集进行高效、创造性的 ML 工作的例子与这一概念相悖5,6。

In the small-data regime, high-quality data with informative features often trump excessive computational power with massive data and weakly correlated features. A promising approach is to exploit the bias-variance trade-off by performing more rigorous feature selection or crafting a more physically motivated model form7. Alternatively, it may be wise to reduce the scope of the ML task by restricting the material design space or use ML to solve a smaller chunk of the problem at hand. ML tools for exploratory analysis with appropriate features can help us comprehend much higher dimensional spaces even at an early stage of the research, which may be helpful to have a bird’s-eye view on our target. For example, cluster analysis can help researchers identify representative groups in large high-throughput datasets, making the process of formulating hypotheses more tractable.
在小数据领域，具有信息特征的高质量数据通常胜过具有大量数据和弱关联特征的过度计算能力。一种很有前途的方法是通过执行更严格的特征选择或制作更具物理动机的模型形式来利用偏差-方差权衡7。或者，明智的做法是通过限制 Material Design 空间来缩小 ML 任务的范围，或者使用 ML 来解决手头问题的一小部分。即使在研究的早期阶段，用于探索性分析的 ML 工具也可以帮助我们理解更高维的空间，这可能有助于鸟瞰我们的目标。例如，聚类分析可以帮助研究人员识别大型高通量数据集中的代表性群体，从而使制定假设的过程更易于处理。

There are also specific ML disciplines aimed at addressing the well-known issues of small datasets, dataset bias, noise, incomplete featurization, and over-generalization, and there has been some effort to develop tools to address them. Data augmentation and other regularization strategies can allow even small datasets to be treated with large deep-learning models. Another common approach is transfer learning, where a proxy model is trained on a large dataset and adapted to a related task with fewer data points8,9,10. Chen et al.11 showed that multi-fidelity graph networks could be used in comparatively inexpensive low-fidelity calculations to bolster the accuracy of ML predictions for expensive high-fidelity calculations. Finally, active learning methods are now being explored in many areas of materials research, where surrogate models are initialized on small datasets and updated as predictions are used to guide the acquisition of new data generation, often in a manner that balances exploration with optimization12. Generally a solid understanding of the uncertainty in the data is critical for success with these strategies, but ML systems can lead us to some insights or perhaps serve as a guide for optimization which might otherwise be intractable.
还有一些特定的 ML 学科旨在解决众所周知的小型数据集、数据集偏差、噪声、不完整特征化和过度泛化等问题，并且已经努力开发工具来解决这些问题。数据增强和其他正则化策略可以允许使用大型深度学习模型处理较小的数据集。另一种常见的方法是迁移学习，其中代理模型在大型数据集上进行训练，并适应数据点较少的相关任务8,9,10。Chen 等人11 表明，多保真图网络可用于相对便宜的低保真计算，以提高昂贵的高保真计算的 ML 预测的准确性。最后，现在正在材料研究的许多领域探索主动学习方法，其中代理模型在小数据集上初始化，并在使用预测时进行更新，以指导新数据的生成，通常以平衡探索与优化的方式 12。一般来说，对数据不确定性的深刻理解对于这些策略的成功至关重要，但 ML 系统可以引导我们获得一些见解，或者可以作为优化指南，否则这些优化可能会很棘手。

We assert that the materials community would generally benefit from taking a more model-oriented approach to applied ML, in contrast to the popular prediction-oriented approach that many method-development papers take. With the current prediction-oriented application of ML to the physical sciences, the primary intent of the model is to obtain property predictions, often for screening or optimization workflows. We propose that the community would be better served to instead use ML as a means to generate scientific understanding, using, for instance, inference techniques to quantify physical constants from experiments. To achieve the goals of scientific discovery and knowledge generation, predictive ML must often play a supporting role within a larger ecosystem of computational models and experimental measurements. It can be productive to reassess13 the predictive tasks we are striving to address with ML methods; more carefully thought out applications may provide more benefit than simply collecting larger datasets and training higher capacity models.
我们断言，与许多方法开发论文采用流行的面向预测的方法相比，材料社区通常会从采用更加面向模型的方法中受益。随着当前面向预测的 ML 在物理科学中的应用，该模型的主要目的是获得属性预测，通常用于筛选或优化工作流程。我们建议，使用 ML 作为产生科学理解的手段会更好地为社区服务，例如，使用推理技术来量化实验中的物理常数。为了实现科学发现和知识生成的目标，预测性 ML 通常必须在更大的计算模型和实验测量生态系统中发挥支持作用。重新评估 13 我们努力使用 ML 方法解决的预测任务可能会很有成效;更仔细考虑的应用程序可能比简单地收集更大的数据集和训练更高容量的模型提供更多的好处。

Massive computation can be useful but is not everything 大规模计算可能有用，但不是全部

On the other hand, characterizing brute computation as “unscientific” can lead to missed opportunities to meaningfully accelerate and enable new kinds or scales of scientific inquiry14. Even without investment in massive datasets or specialized ML models, there is evidence that simply increasing the scale of computation applied can help compensate for small datasets. For example, ref. 15 show that simply by increasing the number of training iterations, large-object detection and segmentation models trained from random initialization can match the performance of the conventional transfer learning approach. In many cases, advances enabled in this way do not directly contribute to scientific discovery or development, but they absolutely change the landscape of feasible scientific research by lowering the barrier to exploration and increasing the scale and automation of data analysis.
另一方面，将野蛮计算定性为“不科学”可能会导致错失有意义地加速和实现新类型或规模的科学探索的机会14。即使不投资于海量数据集或专门的 ML 模型，也有证据表明，只需增加应用的计算规模就可以帮助补偿小型数据集。例如，参考文献 15 表明，只需增加训练迭代次数，从随机初始化训练的大目标检测和分割模型就可以与传统迁移学习方法的性能相匹配。在许多情况下，以这种方式实现的进步并不直接有助于科学发现或发展，但它们通过降低探索门槛和增加数据分析的规模和自动化，绝对改变了可行科学研究的格局。

A perennial challenge in organic chemistry is predicting the structure of proteins, but recent advances in learned potential methods16 have provided paradigm-shifting improvements in performance made possible by sheer computational power. In addition, massive computation can enable new scientific applications through scalable automated data analysis systems. Recent examples include phase identification in electron backscatter diffraction17 and X-ray diffraction18, and local structural analysis via extended x-ray absorption fine structure19,20. These ML systems leverage extensive precomputation through the generation of synthetic training data and training of models; this makes online data analysis possible, removing barriers to more adaptive experiments enabled by real-time decision making.
预测蛋白质的结构是有机化学的一个长期挑战，但学习潜在方法16 的最新进展通过纯粹的计算能力实现了性能的范式转变改进。此外，大规模计算可以通过可扩展的自动化数据分析系统实现新的科学应用。最近的例子包括电子背散射衍射 17 和 X 射线衍射 18 中的物相识别，以及通过扩展 X 射线吸收精细结构进行局部结构分析19,20。这些 ML 系统通过生成合成训练数据和训练模型来利用广泛的预计算;这使得在线数据分析成为可能，消除了通过实时决策实现更具适应性的实验的障碍。

In light of the potential value of large-scale computation in advancing fundamental science, the materials field should make computational efficiency21 an evaluation criterion alongside accuracy and reproducibility22. Comparison of competing methods with equal computational budgets can provide insight into which methodological innovations actually contribute to improved performance (as opposed to simply boosting model capacity) and can provide context for the feasibility of various methods to be deployed as online data analysis tools. Careful design and interpretation of benchmark tasks and performance measures are needed for the community to avoid chasing arbitrary targets that do not meaningfully facilitate scientific discovery and development of novel and functional materials.
鉴于大规模计算在推进基础科学方面的潜在价值，材料领域应将计算效率 21 与准确性和可重复性 22 一起作为评估标准。将计算预算相等的竞争方法进行比较，可以深入了解哪些方法创新实际上有助于提高性能（而不是简单地提高模型容量），并且可以为部署为在线数据分析工具的各种方法的可行性提供背景信息。社区需要仔细设计和解释基准任务和性能衡量标准，以避免追逐不会有意义地促进新型和功能性材料的科学发现和开发的武断目标。

Openly assess dataset bias 公开评估数据集偏差

Acknowledging dataset bias 承认数据集偏差

It is widely accepted that materials datasets are distinct from the datasets used to train and validate ML systems for more “mainstream” applications in a number of ways. While some of this is hyperbole, there are some genuine differences that have a large impact on the overall outlook for ML in materials research. For instance, there is a community-wide perception that all ML problems involve data on the scale of the classic image recognition and spam/ham problems. While there are over 140,000 labeled structures in the Materials Project Database23 and the MNIST

最低0.47元/天解锁文章