正则相关

eunicechen

于 2019-10-12 13:44:02 发布

阅读量226

点赞数

CC 4.0 BY-SA版权

分类专栏： NLP 文章标签：基本功

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.youkuaiyun.com/eunicechen/article/details/102518941

NLP 专栏收录该内容

4 篇文章

订阅专栏

最近处理文本，使用正则表达式去除文本中的各种标点符号，但是保留英文间的空格，包括转义符号在hive -e和python中的区别实验，总结如下两点，以求备份：

Hive -e 中使用正则：

hive -e "select regexp_replace(
                    regexp_replace(
                        regexp_replace(lower(trim(' 美女 主播# skin a skin and IPhone 11')),'([\\\u4e00-\\\u9fa5]) * ([\\\u4e00-\\\u9fa5])', '\\\$1\\\$2'), 
                        '([a-zA-Z]+) * ([0-9]+)', '\\\$1\\\$2'),'[#]+', '')"

Python 中使用：

def _str_normalization(self, text):
    # 以下去除顺序不可变
    # 去开头空格和中文间的空格
    pattern_clean = u"([\u4e00-\u9fa5]) * ([\u4e00-\u9fa5])"
    text_clean = re.sub(pattern_clean, r"\1\2", text.strip())
    # 去掉 英文和数字间的空格
    pattern_clean = u"([a-zA-Z]+) * ([0-9]+)"
    text_clean = re.sub(pattern_clean, r"\1\2", text_clean)
    # 去掉标点符号，不去数字
    pattern_clean = u"[#.，,?!/。？！@<>《》+=~-……【】{}]+"
    text_clean = re.sub(pattern_clean, u"", text_clean.lower())
    return text_clean

博客等级

码龄16年

18
原创

9
点赞

30
收藏

3
粉丝

关注

私信

热门文章

分类专栏

笔记
机器学习 17篇
library 4篇
NLP 4篇
深度学习 4篇

上一篇：: zz: illustrator-BERT

最新评论

单层lstm的伪孪生网络计算句子的相似度
wenhuiliu701: 大佬可以分享一下伪孪生的代码嘛，万分感激！
Tensorboard estimator export_savedmodel简单用法
233彭于晏: 找到原因了[code=python] estimator.export_savedmodel(export_dir, serving_input_fn(), assets_extra={"vocab.txt": FLAGS.vocab_file}, as_text=False, strip_default_attrs=True) ## 把serving_input_fn()括号去掉 [/code]
Tensorboard estimator export_savedmodel简单用法
233彭于晏: 请教下这个是什么原因呀，大佬
Tensorboard estimator export_savedmodel简单用法
233彭于晏: [code=python] TypeErrorTraceback (most recent call last) <ipython-input-34-5c2d99dad76c> in <module>() 14 15 export_dir = os.path.join("./", "saved_model") ---> 16 estimator.export_savedmodel(export_dir, serving_input_fn(),as_text=False,strip_default_attrs=True) /opt/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.pyc in export_savedmodel(self, export_dir_base, serving_input_receiver_fn, assets_extra, as_text, checkpoint_path, strip_default_attrs) 583 self._create_and_assert_global_step(g) 584 random_seed.set_random_seed(self._config.tf_random_seed) --> 585 serving_input_receiver = serving_input_receiver_fn() 586 587 # Call the model_fn and collect the export_outputs. TypeError: 'InputFnOps' object is not callable [/code]
LibSVM3.21使用笔记----for python3.4 配置
七百攻的卉卉酱: 请问一下输入 from svmutil import *的时候，老是出错是咋回事呢 Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> from svmutil.py import * ModuleNotFoundError: No module named 'svmutil'

大家在看

最新文章

目录

展开全部

收起

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。