Dataset: Turkish Word Embeddings
Date
2017-05-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Huawei Türkiye Ar-Ge Merkezi
Boğaziçi University
Boğaziçi University
Contact Person
Onur, Güngör, onurgu@gmail.com, Boğaziçi University
Abstract
Description
This resource is a database of Turkish word embeddings learned using the skip-gram algorithm. A corpus of 940 million tokens was used to obtain 2 million word embeddings. The corpus is built by collecting from several online Turkish resources such as news outlets, forums, blogs, and e-books.
This package consists of both the embeddings and the corpus.
Each line in the file that stores the word embeddings contains one word surface form and 300 values that make up the dimensions of its embedding.
The corpus consists of a single file that contains a sentence in each line.
Keywords
Word embedding, Word2Vec, Skip-gram, Negative sampling