[python]unicode,utf8

本文介绍使用Python读取并解析JSON文件的方法,并演示如何利用MeCab进行日语文本的分词处理。通过实例展示了从JSON文件中提取关键信息的过程,并提供了一段用于日语句子分词的具体代码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I’m reading a json file by python.

[
  {
    "id": 1,
    "qas": [
      {
        "image_id": 1,
        "qa_id": 10000000,
        "answer": "グレー",
        "a_objects": [],
        "question": "右側の男性は何色の服を着ていますか?",
        "q_objects": []
      },
      {
        "image_id": 1,
        "qa_id": 10000001,
        "answer": "赤色",
        "a_objects": [],
        "question": "左側の男性は何色の服を着ていますか?",
        "q_objects": []
      },
...
#! /usr/bin/env python
# coding=utf-8
import json
import sys

infile="/home/c-nrong/VQA/draw/Json/question_answers_jan.json"
jfile = json.load(open(infile, 'r'))
#names_train=jfile["unique_img_train"]
newd = {}
for i in range(len(jfile)):
        imgid=jfile[i]["id"]
        print imgid
        for j in range(len(jfile[i]["qas"])):
                qid=jfile[i]["qas"][j]["qa_id"]
                ques=jfile[i]["qas"][j]["question"]
                print ques

The output is like

u'\u53f3\u5074\u306e\u7537\u6027\u306f\u4f55\u8272\u306e\u670d\u3092\u7740\u3066\u3044\u307e\u3059\u304b\uff1f'

add

#! /usr/bin/env python
# coding=utf-8

to the beginning of the file.
The output will be

右側の男性は何色の服を着ていますか?

test code

#! /usr/bin/env python
# coding=utf-8
import MeCab
import sys
def tokenize(sentence):
   #print(sentence, type(sentence)) #unicode: (u'\u6a2a\uff1f')
   #s=sentence.encode("utf-8")
   #print(s,type(s)) #str: ('\xe6\x9f')
   s=sentence

   reslist=[]
   mt = MeCab.Tagger("mecabrc")
   res = mt.parseToNode(s)

   res=res.next
   while res:
        #print(res.surface)
        reslist.append(res.surface)
        res = res.next
   del reslist[-1]
   #print(reslist[1],type(reslist[1])) #str:  ('\xe6\x9f')
   #for x in reslist: print x.decode("utf-8")

   reslist_utf8=[ x.decode("utf-8") for x in reslist ]

   return reslist

if __name__ == "__main__":
        labels=["dare","doko","dona","dorekurai","dou","ikutu","itu","naze"]
        path="quesID"
        for i,label in enumerate(labels):
                filei="%s/%s.quesID" %(path,label)
                f = open(filei, 'r')
                f_lines = f.readlines()
                f.close()

                for ix, line in enumerate(f_lines):
                        sen=line.strip('\n').split(':')[2]
                        print(sen)
                        res=tokenize(sen)
                        sys.exit()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值