Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data.In Proceedings of ACL, 1003–1011.
1)Applying the DS method to RE task for the first time.
2)They proposed the above-mentioned 'all sentence' assumption.
3)Their task is to extract relations of Freebase. They tackle it by using the existing relations in Freebase as training data: for each related pair of entities they collect all sentences that mention both entities as input observation x, and use their relation type in Freebase as label y. Together with a set of unrelated pairs of entities as negative instances, they train a classifier to predict relations.
Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In Proceedings of ECML PKDD, 148–163.
1)They argue that Mintz’s distant supervision assumption(‘all sentence’) is too strong and needs to be relaxed. So they employ the ‘at-least-one-sentence’ assumption: If two entities participate in a relation, at least one sentence that mentions these two entities might express that relation.
2)In their work, they jointly model two tasks: (a)Whether two entities are related.(b) Whether this relation is mentioned in a given sentence.
3)For a pair of entities that appears together in at least one sentence, a relation variable Y denotes the relation between them, or NA if there is no such relation.And for each relation mention candidate i, they define a binary relation mention variable Z i that is true if and only if mention i is indeed expressing the relation Y between the two entities. They will use Z to denote the state of all mention candidates, and ǁzǁ to represent the number of active relation mentions for a given assignment z of Z. Their truth function is as follows:
original distant supervision approach
expressed-at-Least-Once Supervision
4)Their model with expressed-at-least-once assumption leads to 91% precision for our top 1000 predictions. When compared to 87% precision for a model based on the distant supervision assumption(Mintz), this amounts to 31% error reduction.
Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of ACL, pages 541–550.
1)Hoffmann thinks that Riedel’s sentence-level variables are binary, and they only have a single aggregate variables that takes values r Є R(target relationship)∪{NA}, thereby ruling out overlapping relations. And Riedel et al.did not report sentence level performance.
2)They proposed MULTIR system which can extract overlapping relations, for example: R= Founded(Jobs,Apple) and CEO-of(Jobs, Apple).
3)To measure the impact of modeling overlapping relations, Instead of labeling each entity pair with the set of all true Freebase facts, they created a dataset where each true relation was used to create a different training example. Training MULTIR on this data simulates effects of conflicting supervision that can come from not modeling overlaps.
4)MULTIR achieves significantly higher recall with a consistently high level of precision. At the highest recall point, MULTIR reaches 72.4% precision and 51.9% recall, for an F1 score of 60.5% :
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of EMNLP-CoNLL, pages 455–465.
1)They propose a novel approach to MIML learning for RE. The MIML is the first RE approach that jointly models both multiple instances(by modeling the latent labels assigned to instances)and multiple labels (by providing a simple method to capture dependencies between labels).
2)Their work is closest to Hoffmann et al. The differences are as follows :
Distinctive | Hoffmann | MIML |
Method
| Use a deterministic model that aggregates latent instance labels into a set of labelsfor the corresponding tuple by OR-ing the classification results. | Use instead an object-level classifier that is trained jointly with the classifier that assigns latent labels to instances and Can capture dependencies between labels. |
Train
| Use a perceptron-style additive parameter update approach. | Train in a Bayesian framework. |
3)MIML model plate diagram :
in the figure:
•n is the number of distinct entity tuples in D;
•Mi is the set of mentions for the ith entity pair;
•x is a sentence and z is the latent relation classification for that sentence;
•wz is the weight vector for the multi-class mention-level classifier;
•k is the number of known relation labels in L;
•yj is the top-level classification decision for the entity pair as to whether the jth relation holds;
•wj is the weight vector for the binary top-
level classifier for the jth relation.
4)MIML-RE generally outperforms the current state of the art. In the Riedel dataset, MIML-RE has higher overall recall than the Riedel et al. model, and, for the same recall point, MIML-RE’s precision is between 2 and 15 points higher.
Zeng, D.; Liu, K.; Chen, Y.; and Zhao, J. 2015. Distant Supervision for Relation Extraction
via Piecewise Convolutional Neural Networks. In Proceedings of EMNLP, 1753–1762.(PCNN)
1. They pointed out the two problems existing in the DS-RE and proposed solutions
respectively. Their main contribution is to present the PCNN model for RE.
Problems | Solutions |
Wrong label in DS | Consider DS-RE as a multi-instance problem in which the uncertainty of instance labels(bags) |
Noise from feature extraction | Adopt convolutional architecture with piecewise max pooling to automatically learn relevant features |
2. Their model PCNN :

Follow the order from left to right in the picture, their PCNN:
1) Vector representation : word embedding with position embedding.
2) The Convolutional layer is the same as traditional CNN , it outputs feature map.
3) PCNN made some changes to the pooling layer, they think traditional CNN’s max pooling operation is insufficient for RE, because single max pooling reduces the size of the hidden layers too rapidly and is too coarse to capture fine- grained features for relation extraction. So they propose a piece-wise max pooling procedure that returns the maximum value in each segment instead of a single maximum value. The feature map is divided into three sections for pooling through two entities position. The purpose is to better capture the structured information between two entities.
4) Finally, it is classified by the softmax layer.
3. Their result comparison of the proposed method with traditional approaches. :
1) ROC:
2) Precision values for the top 100, top 200, and top 500 extracted relation Instances upon manual evaluation:
Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun, M. 2016.Neural Relation Extraction with Selective Attention over Instances. In Proceedings of ACL, 2124–2133. (APCNN)
1. They think Zeng’s at-least-one assume, which only one sentence is active for each entity pair, will lose a large amount of rich information containing in those neglected sentences. So they propose a sentence-level attention-based model for relation extraction, which is expected to dynamically reduce the weights of those noisy instances.
2. Their model APCNN’s architecture:
• mi is the original sentence for an entity pair.
• ri is the representation of each sentence.
• The CNN in this figure used Zeng’s PCNN.
• ai is the weight of each sentence vector.
• r is the representation of relation r.
• ai is further defined as:
• A is a weighted diagonal matrix.
3. The comparison of APCNN two experiments and Zeng’s PCNN experiment:
PCNN+max one | Select the sentence with the maximum probability corresponds to the relation r to tag the bag. Zeng’s PCNN can be regarded as a special case as the APCNN+att, when the weight of the sentence with the highest probability is set to 1 and others to 0. |
APCNN+ave weight | Assume that all sentences in the bag have the same contribution, and use the average of all the sentence To tag the bag. |
APCNN+att weight | Use a selective attention to de-emphasize the noisy sentence. |
4. Their APCNN experiments result:
Ji, G.; Liu, K.; He, S.; and Zhao, J. 2017. Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions. In Proceedings of AAAI, 3060–3066.(APCNN+D)
1. This paper think that the existing approaches have flaws on selecting valid instances and lack of background knowledge about the entities. The entity descriptions can help to recognize whether a sentence express the relation or not.
2. Their model is based on Lin’s APCNN.
In addition, they use another CNN to extract entity descriptions’ feature vectors, and let the vectors of entities(e1 , e2) be close to that of descriptions , via adding constraints on the objective function of APCNNs. So their model called APCNN+D. They extract descriptions for entities from Freebase and Wikipedia pages.
It is worth mentioning that they use vrelation = e1 - e2 to represent the relation r. If an instance expresses the relation r, its feature vector should has higher similarity with vrelation , otherwise lower similarity.
The APCNN model in their paper :
3. There result:
1) ROC :
2) Precision values for the top 100, top 200, and top 500 extracted relation Instances upon manual evaluation:
Tianyu Liu, Kexiang Wang, B. C., and Sui, Z. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. In EMNLP, 1791-1796.(soft-label)
1. They consider that all the previous work use hard labels which are determined by distant supervision and immutable during training. So they introduce an entity-pair level(not sentence-level) denoise method which exploits semantic information from correctly labeled entity pairs to correct wrong labels dynamically during training. That means the same bag may have different labels in different epochs of training.
2. Their model is based on APCNN and then added soft-label during training. The soft-label is calculated by a joint score function which combines the relational scores based on the entity-pair representation and the confidence of the hard label.
3. The soft-label’s improvement of PCNN and APCNN:
Xiangrong Zeng , Shizhu He , Kang Liu , Jun Zhao .2018.Large Scaled Relation Extraction with Reinforcement Learning. In Proceedings of AAAI,(Reinforcement+RE)
1. To solve the wrong label problem, they propose a novel model with reinforcement learning.
2. Their model based on CNN is called PECNN (position enhance), because the position embedding is not only for composing the entity representations, but also be concatenated with the output of the pooling layer. The structure of PECNN :
3. The PECNN is trained with reinforcement learning, and to learn the relation extractor without the direct guide, they introduce the policy gradient method in reinforcement learning.
Training | Reinforcement |
A bag | A episode |
Sentences | States |
The relation | Actions |
The relation extractor | RL agent |
1) The advantage of state (si) is calculated by :
Before all sentences in bag has been extracted, the rewards of state are set to 0 (that is ri = 0,i = 1,...,n−1) , since we don’t know if this episode is good or not. So R(si) can be simplified as :
The order of sentences in bag should not influence the predicted result, so we setγ=1 and we have :
In the experiment, rn is set to +1 or -1. If the gold relation of this bag is the same as the predicted relation, the episode reward will be set to +1.
2) Their objective function to optimize the policy is followed by Williams’ REINFORCE algorithm in1992.
4. Their result :