LabeledSentence TaggedDocument TaggedLineDocument 区别及doc2vec相关-优快云博客

本文链接：https://blog.youkuaiyun.com/anthea_luo/article/details/117814291

在网上搜的一些doc2vec的例子，在处理数据时，有的用的LabeledSentence 也有用TaggedDocument 也有用TaggedLineDocument的。
这几个名字长得好像，就搜了一下区别。大部分是讲 LabeledSentence TaggedDocument的区别，前者是老旧版本，不推荐使用，而推荐用后者。
但 TaggedDocument TaggedLineDocument 的区别就搜不到了。翻源码看一下就知道了：
本人gensim 版本3.8.1

@deprecated("Class will be removed in 4.0.0, use TaggedDocument instead")
class LabeledSentence(TaggedDocument):
    """Deprecated, use :class:`~gensim.models.doc2vec.TaggedDocument` instead."""
    pass

LabeledSentence 有个deprecated说明，直接建议使用TaggedDocument

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
    """Represents a document along with a tag, input document format for :class:`~gensim.models.doc2vec.Doc2Vec`.

    A single document, made up of `words` (a list of unicode string tokens) and `tags` (a list of tokens).
    Tags may be one or more unicode string tokens, but typical practice (which will also be the most memory-efficient)
    is for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from :class:`gensim.models.word2vec.Word2Vec`.

    """
    def __str__(self):
        """Human readable representation of the object's state, used for debugging.

        Returns
        -------
        str
           Human readable representation of the object's state (words and tags).

        """
        return '%s(%s, %s)' % (self.__class__.__name__, self.words, self.tags)

TaggedDocument 继承自同名具名元组只定义了一个 __str__方法。效果就是就是打印(或用到__str__())时有点不一样的具名元组.. 这种写法有点少见，当然了我阅读的代码也不多。
用TaggedDocument 构造数据时应该传入两个参数 'words' 'tags'

class TaggedLineDocument(object):
    """Iterate over a file that contains documents: one line = :class:`~gensim.models.doc2vec.TaggedDocument` object.

    Words are expected to be already preprocessed and separated by whitespace. Document tags are constructed
    automatically from the document line number (each document gets a unique integer tag).

    """
    def __init__(self, source):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> from gensim.models.doc2vec import TaggedLineDocument
            >>>
            >>> for document in TaggedLineDocument(datapath("head500.noblanks.cor")):
            ...     pass

        """
        self.source = source

    def __iter__(self):
        """Iterate through the lines in the source.

        Yields
        ------
        :class:`~gensim.models.doc2vec.TaggedDocument`
            Document from `source` specified in the constructor.

        """
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don't have seek will trigger an exception
            self.source.seek(0)
            for item_no, line in enumerate(self.source):
                yield TaggedDocument(utils.to_unicode(line).split(), [item_no])
        except AttributeError:
            # If it didn't work like a file, use it as a string filename
            with utils.open(self.source, 'rb') as fin:
                for item_no, line in enumerate(fin):
                    yield TaggedDocument(utils.to_unicode(line).split(), [item_no])

TaggedLineDocument 可以看到生成器里每次 yield一个TaggedDocument对象
TaggedLineDocument 构造数据时应该传入一个参数:文件名或一个 a file-like object。这个文件应该一行为一个doc 同时也属于一个固定类别类别名就是行号
所以网上有些代码用了一个count变量，每行加1，放到TaggedDocument 其实可以使用 TaggedLineDocument。当然如果不同的行可能属于同一个类别那还是使用 TaggedDocument吧。

为什么 TaggedDocument 的第二个参数，常见的总是把一个标签或行号，放在一个List中，形如[index] ? 因为每个文档可以是多标签..

另外， doc2vec 的模型名称，实在是有点绕，与word2vec很容易混：

PV-DM Distributed Memory Model of Paragraph Vectors 和word2vec的CBOW类似在代码中是用的 not self.sg
PV-DBOW Distributed Bag of Words version of Paragraph Vector 和word2vec的Skip-gram类似在代码中是用的 self.sg

@property
def dm(self):
    """Indicates whether 'distributed memory' (PV-DM) will be used, else 'distributed bag of words'
    (PV-DBOW) is used.

    """
    return not self.sg  # opposite of SG

@property
def dbow(self):
    """Indicates whether 'distributed bag of words' (PV-DBOW) will be used, else 'distributed memory'
    (PV-DM) is used.

    """
    return self.sg  # same as SG

很早之前就看过word2vec doc2vec 相关的文档和代码. 但是真正用时，才重新想起来（复看起来就像新的） PV-DM PV-DBOW 这些。因为参数中有 dm默认为1。
以前的笔记是推荐 PV-DM PV-DBOW 同时使用，但想了好一会怎么同时使用呢？查了好一会才知道是把两个向量concat起来再喂入后面的分类器... 为啥这么简单的事我之前就是想不到呢?..
而且再看了一下 doc2vec的新数据预测过程。他需要把词语embedding、输出层的那些参数都固定再把Paragraph vector迭代训练，使得结果收敛。然后再model.docvecs.most_similar([inferred_vector], topn=5) 之类的求分类结果。注意是用docvecs 求相似。
所以很费时费计算。
这些细节，需要真正投入使用才会记得（对于本人的天赋是这样，顶级高手可能能跳过）。足够幸运我现在还有机会和时间去投入NLP。曾有个面试官问你为什么对NLP感兴趣。我觉得让计算机这个只懂0101的死家伙能理解人类的语言，是一件很好很了不起的事。不通过if-else 不通过类的继承，不用硬编码去应对可能的场景，它可以去处理未知的事物。多好~。 CV就好像给计算机加了一个智能的外设，推荐-想想那些让人沉迷的短视频；真觉得NLP比较好。
在一个IT不太发达的二线城市的办公室墙上，前任租这个办公室的团队有一辐字画留下来：人在一起叫聚会，心在一起叫团队。嘿~，当我看到这句话，大有"野百合也有春天"的感觉。