I need to extract the text between a number and an emoticon in a text
example text:
blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv
output:
extract1
extract2
The regex code that I wrote extracts the text between 2 numbers, I need to change the part where it identifies the unicode emoji characters and extracts text between them.
(?<=[\s][\d])(.*?)(?=[\d])
Please suggest a python friendly method, and I need it to work with all the emoji's not only the one's given in the example
解决方案
Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a (the unicode representation of ☺️) you can put it in a range with \u263a:
In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv'
In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')
In [74]: regex.findall(s)
Out[74]: [' extract1 ', ' extract2 ']
Or if you want to match more emojies you can use a character range (here is a good reference which shows you the proper range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode):
In [75]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')
In [76]: regex.findall(s)
Out[76]: [' extract1 ', ' extract2 ']
Note that in second case you have to make sure that all the characters withn the aforementioned range are emojies that you want.
Here is another example:
In [77]: s = "blah 4 xzuyguhbc 😺 ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2 🙅 bjvcvvv"
In [78]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')
In [79]: regex.findall(s)
Out[79]: [' xzuyguhbc ', ' extract1 ', ' extract2 ']