Datasets for NLP
Named Entity Recognition (NED)
-
CoNLL (English)
This dataset is one of the standard dataset for NED which is constructed by Hoffart et al., 2013. It is based on Named Entity Recognition data from the CoNLL 2003 shared task. It contains 946, 216, and 231 documents for training, validation, and test sets, respectively.
-
TAC 2010 (English)
This is another popular dataset for NED research which is compiled for the Text Analysis Conference (TAC). You can find the overview slides here. This dataset is based on Web Log and news articles from various sources and it contains 1,043 and 1,013 documents for training and test, respectively.