dataset.page.titleprefix
Web Corpus

dc.contributor.authorSak, Haşim
dc.contributor.authorGüngör, Tunga
dc.contributor.authorSaraçlar, Murat
dc.date.accessioned2023-03-03T22:22:46Z
dc.date.available2023-03-03T22:22:46Z
dc.date.issued2010-08-10
dc.descriptionThe corpus includes raw Turkish text collected from web and is formed of three parts. Newscor is news corpus collected from news sites. It is given both in splitted form (train, development, and test splits) and in full form. Gencor is general corpus collected from sites excluding news sites. Bounwebcorpus is the sum of newscor and gencor. The corpora are provided in both xml and txt formats. xml files contain text in unnormalized form, whereas txt files contain text in normalized form (sentence splitted, punctuations removed, and numeric data converted to written form). Example: (xml format) <p> <s> Bu sistem , gelişen dalıcı donanımına uyumlu olarak dalış eğitim tekniklerini sürekli geliştirmiş ve güncellemiştir . </s> … </p> (txt format) bu sistem gelişen dalıcı donanımına uyumlu olarak dalış eğitim tekniklerini sürekli geliştirmiş ve güncellemiştir
dc.description.sponsorshipBoğaziçi University, 06A102, Research Fund, Other
dc.identifier.urihttps://tulap.cmpe.boun.edu.tr/handle/20.500.12913/68
dc.language.isoTurkish
dc.publisherBoğaziçi University
dc.relation.isreferencedbyhttps://link.springer.com/article/10.1007/s10579-010-9128-6
dc.rightsApache License 2.0
dc.rights.urihttp://opensource.org/licenses/Apache-2.0
dc.subjectRaw corpus
dc.titleWeb Corpus
dc.typecorpus
dspace.entity.typeDataset
local.contact.personTunga, Güngör, gungort@boun.edu.tr, Boğaziçi University

Files

Original bundle

Now showing 1 - 5 of 8
Loading...
Thumbnail Image
Name:
newscor_dev.txt.zip
Size:
148.18 KB
Format:
Unknown data format
Description:
Unknown
Loading...
Thumbnail Image
Name:
newscor_test.txt.zip
Size:
125.43 KB
Format:
Unknown data format
Description:
Unknown
Loading...
Thumbnail Image
Name:
newscor_train.txt.zip
Size:
477.91 MB
Format:
Unknown data format
Description:
Unknown
Loading...
Thumbnail Image
Name:
bounwebcorpus.txt.zip
Size:
1.06 GB
Format:
Unknown data format
Description:
Unknown
Loading...
Thumbnail Image
Name:
bounwebcorpus.xml.zip
Size:
1.25 GB
Format:
Unknown data format
Description:
Unknown

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.62 KB
Format:
Plain Text
Description:

Collections