Content

Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation
Translating Embeddings for Modeling Multi-relational Data
Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
A Convolutional Neural Network for Modelling Sentences
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
How Translation Alters Sentiment

1. Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation (2016)

This paper aims to provide an embedding method which jointly maps words and entities into low-dimensional space. Their focus is Named Entity Disambiguation (NED). Basically they extend the Skip-Gram Model by integrating two more loss functions; 3 loss functions in total. First model (or perhaps sub-model) comes from Skip-Gram Model and tries to optimize word context probability. The second one (Knowledge-base (KB) Graph model) aims to place entities (e.g., Wikipedia Articles) closely if they have similar incoming link patterns. The last one (Anchor Context Model) binds words and entities (otherwise, entity embeddings and word embeddings end up with different places in \(\Bbb R^d\)) which tries to predict context words in an entity (Wikipedia anchor text). For the sake of computational feasibility, they used Negative Sampling and avoid to calculate the normalization factor in softmax function for each training instance. They trained the model with 40-core CPU in 5 days iterating 10 epoch over December 2014 Wikipedia dump. They improved the state-of-the-art scores on both CoNLL and TAC2010 datasets (more info for datasets).

2. Translating Embeddings for Modeling Multi-relational Data (2013)

This paper proposes a method to represent multi-relational data (e.g., Freebase, Google Knowledge Graph, GeneOntology, etc.). Their method tries to learn one low dimensional vector for each entity and relation. Simple example: Steven Spielberg directed A.I. Artificial Intelligence. In this example, Steven Spielberg and A.I. Artificial Intelligence are entities, and the relation (or label (\(\boldsymbol{l}\))) between these entities is direct. First entity is usually called head (\(\boldsymbol{h}\)) and the second is called tail (\(\boldsymbol{t}\)). They came up with an margin-based ranking criterion for the loss function:

\[\boldsymbol{L} = \sum_{(h,l,t) \in S} \sum_{(h',l,t') \in S'} max(0, \gamma + d(\boldsymbol{h+l-t}) - d(\boldsymbol{h'+l-t'}))\]

Here, \(S\) here is the set which contains all the triplets in the training set. \(S'\) is the corrupted triplets generated by fixing relation and either head and randomly pick tail or fixing tail entity and randomly pick head entity. You can think of it as Negative Sampling. \(d\) here is the distance function (the smaller value is better). Basically we would like to give small numbers for legitimate triplets (\boldsymbol{h+l-t}) and big numbers for corrupted triplets d(\boldsymbol{h’+l-t’}).

They evaluated their score by replacing either head or tail and predicting the replaced one by providing a ranked set (hits@10). They also compared their results with previously introduced methods. Even one previously introduced method (Structured Embeddings (SE)) spans their method (in other words, the loss function of SE is more complex and theoritically better represents these three embeddings), TransE got better results. The plausible explanation of this is that TransE can be optimized easier than the complex SE. In order to see how their method performs on different relation types (One-to-One (e.g., Ankara is the capital of Turkey), 1-to-Many, Many-to-Many, Many-to-One) they divided the dataset in that manner and evaluated the performance for each type separately. The method performs very well on 1-to-Many relations, however, Many-to-1 performance of the system very low according to other relationships which can be expected because it is really hard to reach in the same point (tail embedding) adding only one relation with many heads.

Representing these triplets can be useful for many applications, for instance Wordnet.

3. Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction (2013)

This paper focuses on Relation Extraction (RE) problem. The method uses not only weakly labeled (Entities are found beforehand with NER methods) sentences but also considers Knowledge Base information. Most methods do not combine latter information, however, it is important to note that there is no joint learning in this paper. They have two separate embedding based methodology for RE problem. One represents the sentence (they called it mentions) and relations. The other represents the entities and relations in Freebase. Although relation types are Freebase relations, two model represent the relations separately (using different datasets). By posing the problem as a ranking problem, the first model aims to give high scores the dot product of mentions and corresponding relation in the training set (which is called NYT+FB dataset where New York Times corpus is aligned with Freebase relations), the other tries to minimize \(|h - r + t|_{2}\) where h, r and t are embeddings of head entity, relation and tail entity of the triplet, respectively. The model predicts relations in test set as follows. First it fetches all the head-tail pairs in the test set, and find the highest scored relation for each head-tail pair. While using all mentions for all head-tail pairs can increase the confidence on the maximum scored relation, recall will be low since this relation will be assigned for all mentions for these pairs. If the relation is not NA (Not Available – the relation that denotes there is no Freebase relations applicable for mention pairs in question), second model computes the score for maximum scored relation which has been found in the previous step. They calculated and compared their score with previous systems. The proposed method achieves good scores in low recalls (<0.1) but performed very poorly when recall is higher than 0.1 (in the Aggregated Precision / Recall curve). From the paper, I understand that there is no very good solution for free text to relation triplet transformation.

4. A Convolutional Neural Network for Modelling Sentences (2014)

This paper extends the work described in Natural Language Processing (almost) from Scratch. Their method has two components which are the generalization of Max Pooling layer: Dynamic k-Max Pooling. k in the k-Max pooling means that the layer pools k most active features and dynamic in “Dynamic k-Max Pooling” stands for k is not fixed; it is computed by a function of depth of the network and the sentence length. Another improvement is that they have “Multiple Feature Maps” which means that 3 layers (Convolution, Dynamic k-Max Pooling and Non-linear Feature Function) can be computed in parallel with different filters in Convolution layer and then putting all the scores together simply by summing. They also provided some intuition about their feature filters (ie., feature detector to be active for a specific pattern(s)). They fed 7-grams in validation and the test sets for each of 288 feature detectors, and ranked them. They showed feature detectors that are active for “positive”, “negative”, “too”, “not” phrases. In other words, one feature detector active when the 7-grams was “either too serious or too lighthearted.” for instance. They illustrated their system’s performance by using 4 different datasets.

5. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (2015)

This paper aims to provide a systematic and controlled experiment suite for Q/A systems. They grouped the question answering problem into 20 subproblems such as “Single Supporting Fact”, “Counting”, “Basic Deduction”, “Path Finding”, “Yes/No Question” and so on. This is really useful because you might have better understanding about the weaknesses of a given Q/A system. I said “controlled” since the paper provides a method (code) for dataset generation. Right now, you can only generate short sentences and diversity of the sentences are somewhat limited but they think that this dataset and the generation process are helpful especially when developing and analyzing algorithms. Another contribution is that they extend the work in Weston et al. (2014) (Memory Networks) in 3 ways. They use (1) “Adaptive Memories” thus they can do well in “Three Supporting Facts” subproblem. Original algorithm (Memory Network, MemNN) performs two hops of inference and it doesn’t do well on tasks where algorithm needs to perform more than two hops of inference such as “Three Supporting Facts” or “Path Finding”. Second extension is (2) N-Grams. This extension addresses the limitations of bag of words problem in the nature of some subtasks such as “Two Argument Relations”. If a system use only bag-of-words approach, it cannot distinguish which one is in the north of which: the office is north of the bedroom. This is equivalent of the bedroom is north of the office in bag-of-word methodology, however, the semantic is completely different. The last extension is that (3) Nonlinearity in matching function. I think paper is worth a read.

6. How Translation Alters Sentiment (2015)

Warning: This paper will not meet TL;DR criteria at all.

This paper focuses on analyzing state-of-the-art sentiment analyzer’s performance on translated texts. What do they have?

Manually annotated parallel corpora for a resource-rich language (English) and a resource-poor language (Arabic) pair.
Manual (Crowd-sourcing) sentiment annotation for these corpora.
A state-of-the-art English sentiment analyzer tool.
Arabic version of above tool (they implemented Arabic version with the same method)
A Statistical Machine Translation (SMT) system that translates in both direction (\(English \Longleftrightarrow Arabic\))

Let me introduce you some notations:

\(L_m^t\): Manual translation of a text in language \(L\) to focus language
\(L_a^t\): Automatic translation of a text in language \(L\) to focus language
\((L_m^t)_m^s\): Manual sentiment annotation of manual translation text in language \(L\).
\((L_m^t)_a^s\): Automatic sentiment annotation of manual translation text in language \(L\)

Experiment A.

Translate Arabic text into English (manually and automatically) and annotate the English text for sentiment (manually and automatically). Compare the sentiment labels assigned to the translated English text with manual sentiment annotations of the Arabic text. The more similar the sentiment annotations are, the less is the impact of translation.

Data:

Arabic social media dataset (BBN). (1200 sentences)
Random tweets originating from Syria (2000 tweets)

Evaluation:

Match percentages:

Human agreement benchmark for sentiment analysis: 73.82% (you can think of it as an upperbound)
\(A_m^s\) vs \(A_a^s\): 65.31%
\(A_m^s\) vs \((E_m^t)_m^s\): 71.31%
\(A_m^s\) vs \((E_m^t)_a^s\): 67.73%
\(A_m^s\) vs \((E_a^t)_m^s\): 57.21%
\(A_m^s\) vs \((E_a^t)_a^s\): 62.08%
\((E_m^t)_m^s\) vs \((E_a^t)_m^s\): 60.08%
\((E_m^t)_m^s\) vs \((E_m^t)_a^s\): 63.11%
\((E_a^t)_m^s\) vs \((E_a^t)_a^s\): 69.58%

Surprising result 1: If the text translated automatically, human sentiment annotators have difficulties to pick either positive or neutral, or negative for the text (#5).
Interesting result 2: Machine understands other machine (#6). The sentiment analyzer outperforms human sentiment annotators when the text in question is translated automatically (i.e., with a SMT system).
Surprising result 3: If the text in question is translated automatically, annotations of sentiment analyzer (a.k.a Ex Machina) and human judges overlaps relatively well (#9). This made authors to try translating all knowledge resources in resource-rich language (English) to resource poor language (Arabic) and observed the performance increase in sentiment analyzer tool fed by these translated resources.

Moreover, they also dug into discrepancy between the sentiment analyzer and human judges. In order to shed light on this, they used an annotator who was provided:

The original Arabic tweet,
Manually determined sentiment of the Arabic tweet (positive, negative, or neutral),
Automatic English translation of Arabic tweet,
manually determined sentiment of the translation.

Note that, both sentiment decisions are manual. Only difference here is that we have two sentiment annotations of (1) original Arabic tweet and (2) of its English translation (automatic). Here, they aimed to understand the underlying difference between automatic translation and manual translation in sentiment annotation.

According to a judge who is fluent both English and Arabic, the main reason sentiment annotators had difficulties dealing with manual translation of an Arabic sentence is cultural difference (as pointed out in paper: Translation is reasonable (sentiment-wise), but the same sentence can be viewed as having one sentiment in the Arabic speaking population and different sentiment in the English-speaking population due to cultural and life-style differences.). Cultural difference seemed to create a problem for 63% cases. Bad manual translation makes the sentiment analysis problematic for 35% cases. On the other hand, for sentiment analysis over automatic translation, problem is mostly bad translation (81.2%); more than half of this case bad translation is arised from sentiment words disappear. So, SMT systems cannot translate sentiment-bearing words very well. Chen and Zhu (2014) focussed on this problem and provided hand-crafted features and improved BLEU 1.1.

Experiment B.

Sentiment analysis system provided good results (see #9 row in above) on automatic translation of a text. This motivated the authors translate all related resources (sense lexicons etc.) to resource-poor language and were able to improve the performance of a sentiment analysis system which used translated resources. Similar idea is demonstrated by Lu et al. (2011). They suggested a joint bilingual sentiment classification system.