NLTK word_tokenize 抛出 IndexError: list index out of range

ren.yz

于 2022-01-24 15:48:13 发布

阅读量667

点赞数

文章标签： list html 数据结构

原文链接：https://gitanswer.com/nltk-word-tokenize-throws-indexerror-list-index-out-of-range-python-1088385118

版权

NLTK 3.6.6 这个版本千万不要用！！！！！！！！！！！！！！！！！！！

NLTK word_tokenize throws IndexError: list index out of range | GitAnswer I am working on some NLP experiments, where I want to tokenize some texts from users. For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string. I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems. Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK word_tokenize ? I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and nltk==3.6.6 . conda create -n "example_nltk" python=3.9 -y conda activate example_nltk pip install nltk==3.6.6 Then I create and run the following python file: import nltk from nltk import word_tokenize nltk.download("punkt") nltk.download("wordnet") nltk.download("omw-1.4") text = '? so ein schwachsinn! rot für: dummes post. salzburg gewinnt öfb-cup gegen rapid' word_tokenize(text, language='german') The script throws an IndexError: list index out of range when running the function word_tokenize(text, language='german') . The error occurs in the punkt.py file in the function _match_potential_end_contexts for line before_words[match] = split[-1] , because the variable split is empty ( [] ). Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's word_tokenize ? Thank you for your support! Here the full traceback for details: Traceback (most recent call last): File "nltk_test.py", line 8, in word_tokenize(text, language='german') File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize sentences = [text] if preserve_line else sent_tokenize(text, language) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize return tokenizer.tokenize(text) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize for sentence in slices: File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries for sentence1, sentence2 in _pair_iter(slices): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter prev = next(iterator) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text for match, context in self._match_potential_end_contexts(text): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts before_words[match] = split[-1] IndexError: list index out of rangehttps://gitanswer.com/nltk-word-tokenize-throws-indexerror-list-index-out-of-range-python-1088385118

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。