Turkish Word Embeddings

Güngör, Onur; Yıldız, Eray

Dataset:
Turkish Word Embeddings

Files

huawei-skipgram-min_count_10-word_dim_300.rar (1.17 GB)

turkish-texts-tokenized.txt.part1.rar (1.46 GB)

turkish-texts-tokenized.txt.part2.rar (635.81 MB)

Date

2017-05-01

Authors

Güngör, Onur

Yıldız, Eray

Publisher

Huawei Türkiye Ar-Ge Merkezi
Boğaziçi University

Contact Person

Onur, Güngör, onurgu@gmail.com, Boğaziçi University

Description

This resource is a database of Turkish word embeddings learned using the skip-gram algorithm. A corpus of 940 million tokens was used to obtain 2 million word embeddings. The corpus is built by collecting from several online Turkish resources such as news outlets, forums, blogs, and e-books. This package consists of both the embeddings and the corpus. Each line in the file that stores the word embeddings contains one word surface form and 300 values that make up the dimensions of its embedding. The corpus consists of a single file that contains a sentence in each line.

Keywords

Word embedding, Word2Vec, Skip-gram, Negative sampling

Referenced by

https://ieeexplore.ieee.org/document/7960223

URI

https://tulap.cmpe.boun.edu.tr/handle/20.500.12913/62

Full item page

Dataset:
Turkish Word Embeddings

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Contact Person

Abstract

Description

Keywords

Citation

Referenced by

Sponsor

URI

Dataset: Turkish Word Embeddings

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Contact Person

Abstract

Description

Keywords

Citation

Referenced by

Sponsor

URI

Dataset:
Turkish Word Embeddings