摘要:
本文介绍了medBERT.de,这是一个专门为德国医学领域设计的预训练德国BERT模型。该模型已在470万份德国医疗文档的大型语料库上进行了训练,并在涵盖广泛学科和医疗文档类型的八个不同医疗基准上实现了最新的性能。我们研究了重复数据删除对模型性能的影响,以及使用更高效的令牌化方法的潜在好处。结果表明,特定领域的模型,如medBERT.de,对较长的文本特别有用,并且训练数据的重复数据删除不一定会带来性能的提高。此外,我们发现有效的标记化在提高模型性能方面只起了很小的作用,并将大部分性能的提高归因于大量的训练数据。为鼓励进一步研究,预先训练的模型权重和基于放射性数据的新基准已公布,供科学界使用。
limitations(which limit their comparability)
- limited training data
- narrow focus
- unrepresentative benchmarks
Private Medical Benchmarks:
we assigned to the surgery reports as labels all OPS codes of the same patient that matched the date of the text document. Furthermore, we restricted the codes to the surgery chapter of the OPS system 1. For the discharge summaries, we assigned all codes (in one task diagnoses as ICD-10, in another task procedures as OPS) of the patient as labels. For each of these tasks, we included as labels the most frequent codes, such that the test set consisted of at least 10 examples for each label.
Deduplication
Radiology reports are often written in a semi-structured form with very similar sentences. Because of this repetition, the information content of many documents is lower in terms of semantic concepts than other data sources used. Language models tend to quickly overfit due to these data-inherent properties. A common strategy to counteract this behavior is to deduplicate the pre-training data. Therefore, we measure the cosine distance between all reports by encoding them as bag of word representations. We only keep documents for which there is no other document with a similarity greater than 0.75.
Hyperparameters/Pretraining Details
We pre-train the model using the Lamb optimizer.
We remove very rare Unicode characters that appear less than three times from our pre-training data. This allows the tokenizer vocabulary to contain more specific sub-words and removes unnecessary tokens from the vocabulary that have an impact on the memory footprint. In addition, we set the number of occurrences required for a word to be included in the vocabulary to 20 to avoid including patient names in the vocabulary that may have been missed during anonymization.
Discussion
本研究在文章、论文和电子病历的德国医学文本大数据集上训练了特定领域的德国BERT模型。在多个医学基准上对该模型进行了微调,发现其优于通用领域语言模型和其他医学领域模型。
结果突出了针对德语医学语言使用特定领域语言模型的优势,也强调了大量训练语料的重要性。
结果也表明,用于预训练或微调领域特定模型的数据不应该仅仅由专门的语言组成,因为这可能会影响模型在一般任务上的性能。
观察到在OPS和ICD代码分类任务上,与所有其他特定领域和一般领域模型相比,性能有显著提高。
fertility alone is not a predictive measure of a model’s performance on specialized downstream tasks. Nevertheless, it is likely that the fertility of the tokenizer played a role in the model’s performance on tasks involving longer texts, particularly on clinical benchmarks based on discharge notes and surgical reports.
Since the texts for these benchmarks were truncated to fit into 512 tokens, some information may have been lost in the process. A more efficient tokenizer may be able to encode more information, potentially improving the model’s performance on these tasks.
发现了数据去重的混合影响。虽然早期的研究表明去重可以带来好处,但是与去重版本( medBERT.dededup )相比,我们的模型( medBERT.de )的性能并没有得到一致的提升。而在某些基准上,我们的medBERT . de比去重版本表现更好,其他则表现较差。这种差异可能是由于我们的去重过程并不广泛,因为它只适用于短报告。此外,没有将去重应用到非放射类文本中,这些文本可能包含重复。