NLTK 3.6.6 这个版本千万不要用!!!!!!!!!!!!!!!!!!! NLTK word_tokenize throws IndexError: list index out of range | GitAnswer I am working on some NLP experiments, where I want to tokenize some texts from users. For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string. I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems. Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK word_tokenize ? I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and nltk==3.6.6 . conda create -n "example_nltk" python=3.9 -y conda activate example_nltk pip install nltk==3.6.6 Then I create and run the following python file: import nltk from nltk import word_tokenize nltk.download("punkt") nltk.download("wordnet") nltk.download("omw-1.4") text = '? so ein schwachsinn! rot für: dummes post. salzburg gewinnt öfb-cup gegen rapid' word_tokenize(text, language='german') The script throws an IndexError: list index out of range when running the function word_tokenize(text, language='german') . The error occurs in the punkt.py file in the function _match_potential_end_contexts for line before_words[match] = split[-1] , because the variable split is empty ( [] ). Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's word_tokenize ? Thank you for your support! Here the full traceback for details: Traceback (most recent call last): File "nltk_test.py", line 8, in word_tokenize(text, language='german') File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize sentences = [text] if preserve_line else sent_tokenize(text, language) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize return tokenizer.tokenize(text) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize for sentence in slices: File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries for sentence1, sentence2 in _pair_iter(slices): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter prev = next(iterator) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text for match, context in self._match_potential_end_contexts(text): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts before_words[match] = split[-1] IndexError: list index out of rangehttps://gitanswer.com/nltk-word-tokenize-throws-indexerror-list-index-out-of-range-python-1088385118