Morphological analysis¶
If your texts already have morphological annotation, you can skip this page.
Analyzed word lists¶
Convertors that read raw text (from .txt
, .eaf
and so on) allow you to have a separate file with morphological (or any other word-level) annotation for all or some of the word forms. The only available option for now is xml_rnc
, the XML format used in the Russian National Corpus. An annotated word list in this format is a plain text file where each line is a valid XML that describes one unique word form. The lines should look as follows:
<w><ana lex="..." gr="..." ...></ana>(<ana....></ana>)*wordform</w>
Each word form starts with <w>
and ends with </w>
. At the beginning, it has an analysis in an <ana>
tag, or a concatenated list of multiple possible analyses. The annotation is stored in the attributes of the <ana>
element. There are four reserved attribute names:
lex
for lemmagr
for comma-separated list of grammatical tagsparts
for word segmentation into morphemesgloss
for the glossing.
All these fields are optional. If you have glossing, the number of morphemes should be equal to the number of glosses (hence, no hyphens in the stem are allowed). Apart from that, you can have any number of other attributes, e.g. trans_en
for an English translation of the word. The actual word form must be located at the end, after the analyses.
Disambiguation¶
If you choose to disambiguate your files using a Constraint Grammar file, they will be disambiguated after the primary conversion to JSON is complete. Your JSON files will be translated into CG format and stored in the cg
directory, which will have language subdirectories. Multilingual files will be split, and sentences in different languages will end up in different subdirectories. CG will process these files and put them to cg_disamb
. When this process is finished, the disambiguated files will be assembled, transformed back into JSON and stored in the json_disamb
directory.
Disambiguation requires that you have CG3 installed. On Linux, you can just run apt-get install cg3
. On Windows, download the executable and add its path to the PATH
variable.
Note that the convertors can only process certain types of CG3 output. Commands like REMOVE
, SELECT
and ADD
will definitely work.