匹配表情emoji 正则,在python正则表达式中匹配unicode表情符号

最新推荐文章于 2023-07-24 20:15:54 发布

转载最新推荐文章于 2023-07-24 20:15:54 发布 · 1k 阅读

文章标签：

在Python中，要提取数字和表情符号之间的文本，可以使用正则表达式。当需要匹配所有Unicode表情符号时，可以创建一个包含表情范围的字符类，如[u263a-U0001f645]。例如，`d+(.*?)[u263a-U0001f645]`这个正则表达式可以从文本中提取数字和指定范围内表情符号之间的内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I need to extract the text between a number and an emoticon in a text

example text:

blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv

output:

extract1

extract2

The regex code that I wrote extracts the text between 2 numbers, I need to change the part where it identifies the unicode emoji characters and extracts text between them.

(?<=[\s][\d])(.*?)(?=[\d])

Please suggest a python friendly method, and I need it to work with all the emoji's not only the one's given in the example

解决方案

Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a (the unicode representation of ☺️) you can put it in a range with \u263a:

In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv'

In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')

In [74]: regex.findall(s)

Out[74]: [' extract1 ', ' extract2 ']

Or if you want to match more emojies you can use a character range (here is a good reference which shows you the proper range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode):

In [75]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [76]: regex.findall(s)

Out[76]: [' extract1 ', ' extract2 ']

Note that in second case you have to make sure that all the characters withn the aforementioned range are emojies that you want.

Here is another example:

In [77]: s = "blah 4 xzuyguhbc 😺 ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv"

In [78]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [79]: regex.findall(s)

Out[79]: [' xzuyguhbc ', ' extract1 ', ' extract2 ']