Dataset: Web Corpus
Date
2010-08-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Boğaziçi University
Contact Person
Tunga, Güngör, gungort@boun.edu.tr, Boğaziçi University
Abstract
Description
The corpus includes raw Turkish text collected from web and is formed of three parts. Newscor is news corpus collected from news sites. It is given both in splitted form (train, development, and test splits) and in full form. Gencor is general corpus collected from sites excluding news sites. Bounwebcorpus is the sum of newscor and gencor. The corpora are provided in both xml and txt formats. xml files contain text in unnormalized form, whereas txt files contain text in normalized form (sentence splitted, punctuations removed, and numeric data converted to written form).
Example:
(xml format)
<p>
<s>
Bu
sistem
,
gelişen
dalıcı
donanımına
uyumlu
olarak
dalış
eğitim
tekniklerini
sürekli
geliştirmiş
ve
güncellemiştir
.
</s>
…
</p>
(txt format)
bu sistem gelişen dalıcı donanımına uyumlu olarak dalış eğitim tekniklerini sürekli geliştirmiş ve güncellemiştir
Keywords
Raw corpus
Citation
Sponsor
Boğaziçi University, 06A102, Research Fund, Other