Dataset: BOUN Treebank v2.11
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
Boğaziçi University
Contact Person
Büşra, Marşan, busra.marsan@boun.edu.tr, Boğaziçi University
Abstract
Description
This dataset is the re-annotated version of BOUN Treebank.
Extracted from Turkish National Corpus (TNC), BOUN Treebank consists of 9,761 sentences (121,214 tokens) from five different text types: Biographical texts, national newspapers, instructional texts, popular culture articles, and essays. The syntactic dependency relations and morphological features of the sentences were manually annotated by linguists following the UD scheme.
Some statistics on the treebank:
- Although the dataset shows word order variance, more than %70 of the sentences have OV and SV word order.
- The average token count of the updated treebank is 12.74 and the average arc length is 2.90.
Keywords
dependency annotation, universal dependencies
Citation
Referenced by
Sponsor
TÜBİTAK, 16909, Dilbilim Temelli Türkçe Doğal Dil İşleme Platformu, nationalFunds