Dataset: Turkish Word Embeddings
dc.contributor.author | Güngör, Onur | |
dc.contributor.author | Yıldız, Eray | |
dc.date.accessioned | 2023-03-03T22:22:45Z | |
dc.date.available | 2023-03-03T22:22:45Z | |
dc.date.issued | 2017-05-01 | |
dc.description | This resource is a database of Turkish word embeddings learned using the skip-gram algorithm. A corpus of 940 million tokens was used to obtain 2 million word embeddings. The corpus is built by collecting from several online Turkish resources such as news outlets, forums, blogs, and e-books. This package consists of both the embeddings and the corpus. Each line in the file that stores the word embeddings contains one word surface form and 300 values that make up the dimensions of its embedding. The corpus consists of a single file that contains a sentence in each line. | |
dc.identifier.uri | https://tulap.cmpe.boun.edu.tr/handle/20.500.12913/62 | |
dc.language.iso | Turkish | |
dc.publisher | Huawei Türkiye Ar-Ge Merkezi | |
dc.publisher | Boğaziçi University | |
dc.relation.isreferencedby | https://ieeexplore.ieee.org/document/7960223 | |
dc.rights | Apache License 2.0 | |
dc.rights.uri | http://opensource.org/licenses/Apache-2.0 | |
dc.subject | Word embedding | |
dc.subject | Word2Vec | |
dc.subject | Skip-gram | |
dc.subject | Negative sampling | |
dc.title | Turkish Word Embeddings | |
dc.type | corpus | |
dspace.entity.type | Dataset | |
local.contact.person | Onur, Güngör, onurgu@gmail.com, Boğaziçi University |
Files
Original bundle
1 - 3 of 3
No Thumbnail Available
- Name:
- huawei-skipgram-min_count_10-word_dim_300.rar
- Size:
- 1.17 GB
- Format:
- Unknown data format
- Description:
- Unknown
No Thumbnail Available
- Name:
- turkish-texts-tokenized.txt.part1.rar
- Size:
- 1.46 GB
- Format:
- Unknown data format
- Description:
- Unknown
No Thumbnail Available
- Name:
- turkish-texts-tokenized.txt.part2.rar
- Size:
- 635.81 MB
- Format:
- Unknown data format
- Description:
- Unknown
License bundle
1 - 1 of 1