Dataset:
Turkish Word Embeddings

dc.contributor.authorGüngör, Onur
dc.contributor.authorYıldız, Eray
dc.date.accessioned2023-03-03T22:22:45Z
dc.date.available2023-03-03T22:22:45Z
dc.date.issued2017-05-01
dc.descriptionThis resource is a database of Turkish word embeddings learned using the skip-gram algorithm. A corpus of 940 million tokens was used to obtain 2 million word embeddings. The corpus is built by collecting from several online Turkish resources such as news outlets, forums, blogs, and e-books. This package consists of both the embeddings and the corpus. Each line in the file that stores the word embeddings contains one word surface form and 300 values that make up the dimensions of its embedding. The corpus consists of a single file that contains a sentence in each line.
dc.identifier.urihttps://tulap.cmpe.boun.edu.tr/handle/20.500.12913/62
dc.language.isoTurkish
dc.publisherHuawei Türkiye Ar-Ge Merkezi
dc.publisherBoğaziçi University
dc.relation.isreferencedbyhttps://ieeexplore.ieee.org/document/7960223
dc.rightsApache License 2.0
dc.rights.urihttp://opensource.org/licenses/Apache-2.0
dc.subjectWord embedding
dc.subjectWord2Vec
dc.subjectSkip-gram
dc.subjectNegative sampling
dc.titleTurkish Word Embeddings
dc.typecorpus
dspace.entity.typeDataset
local.contact.personOnur, Güngör, onurgu@gmail.com, Boğaziçi University
Files
Original bundle
Now showing 1 - 3 of 3
No Thumbnail Available
Name:
huawei-skipgram-min_count_10-word_dim_300.rar
Size:
1.17 GB
Format:
Unknown data format
Description:
Unknown
No Thumbnail Available
Name:
turkish-texts-tokenized.txt.part1.rar
Size:
1.46 GB
Format:
Unknown data format
Description:
Unknown
No Thumbnail Available
Name:
turkish-texts-tokenized.txt.part2.rar
Size:
635.81 MB
Format:
Unknown data format
Description:
Unknown
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Plain Text
Description:
Collections