RNC XML convertor ================= This document explains how to convert RNC XML documents to Tsakorpus JSON. See general information about source convertors and their configuration files :doc:`here `. Convertor: ``/src_convertors/xml_rnc2json.py``. Introduction ------------ The RNC XML convertor understands XML files with raw or morphologically annotated text in the format of Russian National Corpus. Currently, files with simple annotated texts ("Main subcorpus" and similar subcorpora) and parallel texts are supported. Although this is a project-specific format, it is a fairly simple one. If your format is not supported by Tsakorpus, it might be easier for you to convert your files into RNC XML, so that Tsakorpus can take it from there. Format description ------------------ All data is contained in an ```` node, which has ```` and ```` daughters. ```` may contain metadata, which alternatively can be stored in a separate metadata file. Each metadata field is stored as ````. In the case of simple annotated files, ```` contains paragraphs (``

``, possibly with a class attribute), which, in turn, contain sentences (````). Sentences contain words (````, see :doc:`parsed_wordlist_format` for details), while punctuation is placed between the word nodes as plain text. If there are newlines between the words, they are ignored. It is allowed to have spans marking italics (````) or boldface (````) inside ````; they will be transformed into :doc:`style spans `. In parallel corpora, ```` contains translation units (````), which contain aligned sentences. Each sentence has to have a ``lang`` attribute. The sentences are structured in the same way as in the case of simple texts. There may also be a ``

`` layer between ```` and ````. If your files or some of the languages in your files do not have morphological annotation, the ```` elements must contain plain text without the ```` and ```` tags. Configuration ------------- Additional settings available for this convertor are the following: * ``corpus_type`` (string) -- whether the corpus is parallel (``parallel``) or not (``main``). Defaults to ``main``. * ``meta_in_header`` (Boolean) -- whether the metadata should be searched in the XML header. If it is found, it undergoes certain name changes to comply with the Tsakorpus requirements (see ``get_meta_from_header`` function in ``/src_convertors/xml_rnc2json.py`` for details). * ``multivalued_ana_features`` (list of strings) -- names of analysis attributes have to be treated as carrying multiple values separated by a whitespace. Such strings will be converted into lists. * ``language_codes`` (dictionary) -- contains correspondences between the ``lang`` attribute values used to identify the languages in a parallel corpus and the language names as specified in the ``languages`` list. * ``clean_words_rnc`` (Boolean) -- whether the tokens should undergo additional RNC-style cleaning (such as removal of the stress marks). Examples -------- Simple annotated text ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: xml :linenos:

Группа "ПОСЛУШАЙТЕ!"

Parallel annotated text ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: xml :linenos: УЛААН МОРИД КРАСНЫЕ ВСАДНИКИ