AttributeError: module ‘hanlp.utils.rules‘ has no attribute ‘tokenize_english‘

博主分享了在学习《Python人工智能20小时玩转NLP》时遇到的hanlp英文分词问题,通过找到hanlp库中实际的tokenizer模块解决了`tokenize_english`缺失的问题。教程中推荐的import方式需要调整,正确引入了lang.en.english_tokenizer中的tokenizer函数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近在看Python人工智能20个小时玩转NLP自然语言处理【黑马程序员】_哔哩哔哩_bilibili​​​​​​

在p37文本处理的基本方法中提到了hanlp的使用

其中有这么一段代码,其目的是使用hanlp进行英文分词

import hanlp
tokenizer = hanlp.utils.rules.tokenize_english 
tokenizer('Mr. Hankcs bought hankcs.com for 1.5 thousand dollars.')

 正确执行结果应该为:

['Mr.', 'Hankcs', 'bought', 'hankcs.com', 'for', '1.5', 'thousand', 'dollars', '.']

但,我的环境报错AttributeError: module 'hanlp.utils.rules' has no attribute 'tokenize_english'

我看视频评论很多也报这个错误,但网上的解决方案不仅少,而且我试了好多个都没用。主要原因是:hanlp.utils.rules中没有tokenize_english。但视频有条评论提到“from hanlp.utils.english_tokenize import tokenize_english
在https://github.com/hankcs/HanLP/blob/doc-zh/README.md 查看文件结构”虽然试过但也没效果,想着是不是时间太久,hanlp版本(不知道叫版本还是什么!!!!)不一样。而且看连接中的文件结构也找不到相关的tokenize_english,但给了我个启发,直接去hanlp文件里面找,想着会不会是换的位置不一样,果然在C:\ProgramData\Anaconda3\Lib\site-packages\hanlp\utils\lang中发现有个en的文件,其中的english_tokenizer.py里面就有一个tokenize_english。

于是:

from hanlp.utils.lang.en.english_tokenizer import tokenize_english
tokenizer = tokenize_english
tokenizer('Mr. Hankcs bought hankcs.com for 1.5 thousand dollars.')

可行(结果一样,不知道和视频中hanlp.utils.rules.tokenize_english是不是一个东西,应该是!!!)

当然后面直接用这串代码也行,不过想方便一点与视频里面代码同步,可以在hanlp.utils.rules里面加上

import hanlp.utils.lang.en.english_tokenizer

def tokenize_english(text):
    return hanlp.utils.lang.en.english_tokenizer.tokenize_english(text)

 本人python小白,import导入总感觉有问题,有一点乱,请各位不要笑,大家可以改的简短一点(其实我试了好多自认为没问题的导入方式,结果出现莫名其妙的错误,完全没有规律可言,算了,虽然这个不顺眼,但是符合预期结果就行)!!!!

 即可用视频中的代码:

import hanlp
tokenizer = hanlp.utils.rules.tokenize_english 
tokenizer('Mr. Hankcs bought hankcs.com for 1.5 thousand dollars.')
PS D:\DATAJUICER> python data-juicer-main/tools/postprocess/count_token.py ` >> --data_path token.jsonl ` >> --text_keys text ` >> --tokenizer_method gpt2 ` >> --num_proc 1 2025-07-07 16:19:59.098 | INFO | __main__:prepare_tokenizer:22 - Loading tokenizer from HuggingFace... 2it [00:00, 4999.17it/s] 0%| | 0/2 [00:05<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\software\python\Lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 16, in count_token_single num += len(TOKENIZER.tokenize(sample[key])) ^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'tokenize' """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 61, in <module> fire.Fire(main) File "D:\software\python\Lib\site-packages\fire\core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\python\Lib\site-packages\fire\core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "D:\software\python\Lib\site-packages\fire\core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 55, in main token_count += res.get() ^^^^^^^^^ File "D:\software\python\Lib\multiprocessing\pool.py", line 774, in get raise self._value AttributeError: 'NoneType' object has no attribute 'tokenize'
07-08
出现以下问题,继续改进:(style_tune) C:\Users\28996\Desktop\AI\persona_contrastive_finetuning>python Contrastive_Training_LM.py INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). trainable params: 1,572,864 || all params: 1,838,401,536 || trainable%: 0.0856 Map: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 76.55 examples/s] 训练集样本示例: {'anchor_input_ids': [56568, 118919, 116122, 11319], 'positive_input_ids': [116122, 20412, 107340, 9370, 100357, 102323, 3837, 109202, 104078, 103975, 100675, 101940, 100912, 105054, 6313], 'negative_input_ids': [100323, 104307, 99245, 9370, 106059, 104060, 3837, 104530, 115604, 99329, 11319]} 验证集样本示例: {'anchor_input_ids': [56568, 118919, 116122, 11319], 'positive_input_ids': [116122, 20412, 107340, 9370, 100357, 102323, 3837, 109202, 104078, 103975, 100675, 101940, 100912, 105054, 6313], 'negative_input_ids': [100323, 104307, 99245, 9370, 106059, 104060, 3837, 104530, 115604, 99329, 11319]} INFO:__main__:GPU内存使用: 已分配 1.77GB, 保留 1.81GB 0%| | 0/3 [00:00<?, ?it/s]You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\torch\utils\checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Traceback (most recent call last): File "C:\Users\28996\Desktop\AI\persona_contrastive_finetuning\Contrastive_Training_LM.py", line 328, in <module> trainer.train() File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 2171, in train return inner_training_loop( File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 2531, in _inner_training_loop tr_loss_step = self.training_step(model, inputs, num_items_in_batch) File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 3676, in training_step loss = self.compute_loss(model, inputs) File "C:\Users\28996\Desktop\AI\persona_contrastive_finetuning\Contrastive_Training_LM.py", line 203, in compute_loss lm_labels[lm_labels == self.tokenizer.pad_token_id] = -100 AttributeError: 'NoneType' object has no attribute 'pad_token_id' 0%| | 0/3 [00:02<?, ?it/s]
最新发布
07-21
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值