About
PAWUK is an acronym for Polish Automatic Web corpus of UKrainian language. It is a linguistic corpus containing Ukrainian texts acquired from the Internet (selected web pages and social network accounts) and is updated daily. It is automatically annotated with morphosyntactic tags, syntactic dependencies and named entities using Stanza with a custom-built model for Ukrainian to produce both Universal Dependencies tags and VESUM morphological tags.
PAWUK uses CQL version allowing queries for:
- orthographic words, ie.
[orth="павуки"]
, - lemmata:
[lemma="павук"]
, - UD part of speech:
[upos="NOUN"]
, - VESUM tag:
[xpos="noun:anim:p:v_naz"]
, - UD morphological features:
[ufeat="fem"]
, - UD dependency relation:
[deprel="nsubj"]
, - lemma of the syntactic head:
[head.lemma="павук"]
, - UD part of speech of the syntactic head:
[head.upos="ADJ"]
, - UD morphological features of the syntactic head:
[head.ufeat="anim"]
, - named entities:
<ne="PERS" />
, - words or morphological interpretations not found in VESUM:
[oov="true"]
.
For a full list of UD parts of speech, dependency relations and values of morphological categories see Universal Dependency guidelines. Named entities values are: PERS
, ORG
, LOC
and MISC
.
PAWUK was built and is being maintained by the Linguistic Engineering Group of the Institute of Computer Science, Polish Academy of Sciences.
People involved in the process of building PAWUK are (in the alphabetical order):
- Witold Kieraś
- Łukasz Kobyliński
- Dorota Komosińska
- Bartłomiej Nitoń
- Michał Rudolf
- Maria Shvedova
- Aleksandra Zwierzchowska
When using the corpus in research and publications, please cite it in the following manner:
W. Kieraś, Ł. Kobyliński, D. Komosińska, B. Nitoń, M. Rudolf, M. Shvedova, A. Zwierzchowska, PAWUK: Polish Automatic Web corpus of UKrainian language, Instytut Podstaw Informatyki PAN, Warszawa 2023. URL: https://pawuk.ipipan.waw.pl.
@misc{PAWUK:2023, author = "Kieraś, W. and Kobyliński, Ł. and Komosińska, D. and Nitoń, B. and Rudolf, M. and Shvedova, M. and Zwierzchowska, A.", title = "PAWUK: Polish Automatic Web corpus of UKrainian language", howpublished = "Instytut Podstaw Informatyki PAN, Warszawa", year = "2023", note = "https://pawuk.ipipan.waw.pl", }
The work was partially supported by the visiting program of the Polish Academy of Sciences. PAWUK uses software and hardware infrastructure built and financed by the Dariah.lab project.
This work was partially supported by the European Regional Development Fund as a part of the 2014-2020 Smart Growth Operational Programme, CLARIN - Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.