Data model

Introduction

There are three major kinds of objects in Tsakorpus: documents, sentences and tokens. Normally, a document represents one text, a sentence represents, well, one sentence, and tokens represent words or punctuation marks. However, you can interpret these objects differently. For example, a sentence might correspond to one intonational unit or one line of verse. Note that it is only possible to search for multiple words inside one sentence. It is impossible not to split the text into sentences, since sentence is the basic search unit.

A corpus should be a collection of JSON or gzipped JSON files structured according to the rules described below. A corpus may contain any number of files scattered across a file system subtree starting with /corpus/%corpus_name%. The files must be stored in UTF-8 without BOM.

Each JSON file contains a dictionary representing one corpus document. Each dictionary should have the following keys:

  • meta – a dictionary with the document-level metadata.

  • sentences – a list of sentences the document consists of.

The document has, therefore, the following structure:

1{
2  "meta": {...},
3  "sentences": [...]
4}

Metadata

The value of the meta key is a dictionary where keys are the names of the metadata fields and the values are strings. All fields listed in the viewable_meta array in /conf/corpus.json must be present in each document of the corpus. Other than that, there are no restrictions on metadata; the array may even be empty. However, there are several field names which get special treatment in Tsakorpus:

  • The value of filename is never included in the search results to avoid accidentally compromising the data of corpus developers.

  • By default, the title and author fields are displayed as document identifiers next to each context in the search results.

  • The value of year should be integer.

  • If in your corpus you have texts for which the exact year of creation is unknown, or which contain parts written in different years, you may use fields year_from and year_to as lower and upper bounds for the year. If the difference between them is less than 2 and the document does not have the year field, it will be created and filled automatically.

Sentence list

The list of sentences is the main part of the document. Each sentence is a dictionary with the following keys:

  • text – a string with the full text of the sentence.

  • words – a list of objects, each representing a token (word or punctuation mark) in the sentence together with all the annotation. There are several reasons why the text of the sentence (or at least most of it) is actually stored twice, first in the text field and second inside the word objects. One of them is allowing multiple (ambiguous) tokenization options for a single sentence. Another is allowing easy full-text search, which would have been impossible in elasticsearch without the text field. Yet another is the possibility of normalizing the tokens so that they can look differently in the sentence and in the analysis.

  • lang – a one-byte integer representing the language the sentence is written in. This number should be a valid index for the languages array in /conf/corpus.json.

  • meta – a dictionary with sentence-level metafields. Sentence-level metafields may include, for example, speaker data for multi-tier (dialogue) files or year in a document that includes data from different years. All metafields listed in the sentence_meta array in /conf/corpus.json must be present in this dictionary. The values should be strings.

  • para_alignment (only in parallel corpora, i.e. corpora with several languages where all or some of the sentences in one language are aligned to sentences in another language) – a list with dictionaries, each representing an alignment of some part of the sentence with a part of another sentence in the corpus.

  • src_alignment (only for media-aligned corpora) – a list with dictionaries, each representing an alignment of some part of the sentence with a segment of a video or sound file.

  • style_spans (optional) – a list with dictionaries, each representing a segment of the sentence text that should be displayed in a non-default style, e.g. in italics or in superscript.

The order of the sentences is important. The sentences should be grouped by language, and within each language they should be ordered exactly as they are ordered in the document. When the sentence collection is indexed, each sentence is assigned the keys _id, prev_id and next_id, the latter two being filled based on the mutual position of the sentences in the JSON file.

The elements of the sentences array therefore look like this:

1{
2  "text": "...",
3  "words": [...],
4  "lang": ...,
5  "meta": {...},
6  "para_alignment": [...],
7  "src_alignment": [...],
8  "style_spans": [...]
9}

Words

Each word in the words list is a dictionary with the following keys and values:

  • wf – a string with the token (word form), used for word search.

  • wtype – type of the token. Currently, two values are possible: “word” and “punct”.

  • off_start, off_end – character offsets indicating to which segment of the text string the word corresponds. As mentioned earlier, this can be useful for multiple overlapping tokenization variants, or when the wf value is normalized for search.

  • next_word – an integer or a list of integers indicating the index (in the words array) of the token immediately following the current token. This is also important for multiple tokenization variants.

  • sentence_index – an integer or an array of integers (again, for multiple tokenizations) indicating the 0-based position of the token in the sentence, not counting the leading and the tail punctuation marks (which do not have to have this field).

  • sentence_index_neg – same as sentence_index, but used when the user enters a negative number in the Position in sentence field. Should equal 1 for the last word in the sentence, 2 for the one before the last, etc.

  • ana – a list of possible annotation variants for this word. If the word has no annotation, this key may be omitted.

Additionally, the word may have following fields which may be relevant for certain corpora:

  • wf_display – a string with the token in HTML. It can be used if the displayed version of the token should look different from the one used for search. The displayed version is non-searchable and is displayed exactly as written, without HTML escaping that takes place otherwise. For example, if part of the token has to be in superscript, wf_display may include <sup> and </sup> tags. All HTML tags used should be closed in the same token. If you use transliteration, you should make sure the tags in wf_display are not transliterated. This field only determines what the token looks like in the popup, but not the baseline.

Overall, a word dictionary looks like this:

 1{
 2  "wf": "...",
 3  "wf_display": "...",   // optional
 4  "wtype": "word|punct",
 5  "off_start": ...,
 6  "off_end": ...,
 7  "next_word": ...,
 8  "sentence_index": ...,
 9  "sentence_index_neg": ...,
10  "ana": [...]           // optional
11}

Analyses

A word can have more than one analysis variant. Usually a word having multiple analyses is the result of automatic morphological annotation without subsequent disambiguation. However, this option is useful even in manually annotated corpora when there is no way to distinguish between several homonymous forms, or when the annotator has doubts (which happens especially often when annotating historical corpora). Search queries will find all words that have at least one analysis conforming to the query.

Each analysis is a dictionary with the following keys and values:

  • lex – lemma (dictionary form), a string.

  • any number of keys starting with gr. (such as gr.pos or gr.case) – strings or arrays of strings that contain values of grammatical or lexical categories expressed in the word. The name of the category, as well as the value, should be listed in the categories.json file for the language the sentence is written in, otherwise this information will not be searchable. Each category can have multiple values (this can happen e.g. in case compounding when a stem attaches several case markers).

  • gloss, parts and gloss_index (only for corpora with glossing) – strings representing the glosses for the word (gloss), segmentation of the word into morphemes (parts) and the combination of these two fields used during search (gloss_index). The gloss field should contain glossing according to the Leipzig glossing rules (the glosses can be arbitrary, but the format should be correct). You might want to gloss the stem as STEM instead of a short English translation, otherwise it would be impossible to make queries such as “find a genitive marker immediately following the stem”. It is okay to only do that in the gloss_index, which is used for search, but not in gloss, which is used for displaying the glosses. Glossing and segmentation into morphemes should not contain empty morphemes and glosses for them; all categories that are not overtly expressed in the word should be tagged using the gr. fields. The string gloss_index has the following format: GLOSS1{morpheme1}-GLOSS2{morpheme2}-… Each gloss is accompanied by the corresponding morpheme in curly brackets. All glosses are separated by hyphens; there should also be a hanging hyphen at the end of the string.

  • any number of other keys with string values, such as trans_en. All fields used here have to be listed in the word_fields list in conf/corpus.json, and their additional properties can be specified through optional kw_word_fields, word_table_fields and accidental_word_fields parameters in the same file. You cannot have a field named gr.

Parallel alignment

If all or some of the documents in your corpus have several parallel tiers, e.g. original text and its translations into other languages, the tiers have to be assigned different language IDs, starting from zero. These IDs should correspond to the names of the languages in the languages array in conf/corpus.conf file. The sentences of all tiers should be stored in one JSON file, but independently. The sentences in the file should be ordered by language ID. In order to indicate that a certain part of a sentence is aligned with a certain part of another sentence in another tier, these sentences should contain the following dictionary in their para_alignment arrays:

1{
2  "off_start": ...,
3  "off_end": ...,
4  "para_id": ...
5}

The off_start and off_end parameters are integers that determine the aligned span in characters. The para_id parameter is an integer uniquely (at the document level) identifying a bunch of aligned segments: it should have the same value in all tiers of an aligned segment. The aligned segment may be shorter or longer than the sentence. In the first case, the sentence will contain several dictionaries in the para_alignment array. In the second case, several consecutive sentences in the same tier will have alignments with the same para_id.

Source alignment

If all or some of the documents in your corpus were aligned with sound or video, the aligned sentences (in all tiers, if there are several) should contain the following dictionary in their src_alignment lists:

1{
2  "off_start_src": ...,
3  "off_end_src": ...,
4  "off_start_sent": ...,
5  "off_end_sent": ...,
6  "mtype": "audio|video",
7  "src_id": "...",
8  "src": "..."
9}

The off_start_src and off_end_src parameters are numbers (float) that determine the relevant segment in the media file in seconds. The off_start_sent and off_end_sent parameters are integers that determine the aligned span in the sentence in characters. The mtype is a string that says if the media is a sound file or a video file. The src_id parameter is a string uniquely (at the document level) identifying an aligned segment. The src parameter is the name and the relative path to the media file. All media files have to be located in the search/media/%corpus_name% directory. Just as with para_alignment, it is possible to have several aligned segments in a sentence or several sentences in an aligned segment.

Style spans

The baseline of the sentence may contain style spans, i.e. segments that should be displayed in a style other than the default, e.g. in italics or in superscript. Each dictionary in the style_spans list represents one such segment. It looks like this:

1{
2  "off_start": ...,
3  "off_end": ...,
4  "span_class": "...",
5  "tooltip_text": "..."  // optional
6}

The off_start and off_end parameters are integers that define the offset of the relevant segment in the text value in characters. span_class is a string that defines the style. When displayed in a search hit, the relevant segment is put inside a <span> element with the class attribute set to style_%SPAN_CLASS%. For example, if span_class equals i, the actual span tag will look like <span class="style_i">. The classes should be defined in /search/web_app/static/css/span_styles.css. Predefined classes are style_i (italics), style_b (bold), style_sup (superscript), style_sub (subscript), and style_txt_hX for X = 1, 2 and 3 (headers). tooltip_text can be used to display a tooltip for that span. See Spans and styles for more.

Sentence example

Here is an example of a sentence. It contains both parallel alignment (the text is aligned with its Russian translation) and media alignment.

  1{
  2  "text": "[нрзб] tačʼe taos.",
  3  "words": [
  4    {
  5      "wf": "[",
  6      "wtype": "punct",
  7      "off_start": 0,
  8      "off_end": 1,
  9      "next_word": 1
 10    },
 11    {
 12      "wf": "нрзб",
 13      "wtype": "word",
 14      "off_start": 1,
 15      "off_end": 5,
 16      "next_word": 2,
 17      "sentence_index": 0,
 18      "sentence_index_neg": 3
 19    },
 20    {
 21      "wf": "]",
 22      "wtype": "punct",
 23      "off_start": 5,
 24      "off_end": 6,
 25      "next_word": 3,
 26      "sentence_index": 1
 27    },
 28    {
 29      "wf": "tačʼe",
 30      "wtype": "word",
 31      "off_start": 7,
 32      "off_end": 12,
 33      "next_word": 4,
 34      "sentence_index": 2,
 35      "sentence_index_neg": 2,
 36      "ana": [
 37        {
 38          "lex": "tačʼe",
 39          "gr.pos": "PRO",
 40          "gr.number": "sg",
 41          "gr.case": "nom",
 42          "parts": "tačʼe",
 43          "gloss": "STEM",
 44          "gloss_index": "STEM{tačʼe}-",
 45          "trans_ru": "такой"
 46        }
 47      ]
 48    },
 49    {
 50      "wf": "taos",
 51      "wtype": "word",
 52      "off_start": 13,
 53      "off_end": 17,
 54      "next_word": 5,
 55      "sentence_index": 3,
 56      "sentence_index_neg": 1,
 57      "ana": [
 58        {
 59          "lex": "ta",
 60          "gr.pos": "PRO",
 61          "gr.proType": "pers",
 62          "gr.number": "pl",
 63          "gr.case": "nom",
 64          "parts": "ta-os",
 65          "gloss": "STEM-PL",
 66          "gloss_index": "STEM{ta}-PL{os}-",
 67          "trans_ru": "он, она"
 68        }
 69      ]
 70    },
 71    {
 72      "wf": ".",
 73      "wtype": "punct",
 74      "off_start": 17,
 75      "off_end": 18,
 76      "next_word": 6
 77    }
 78  ],
 79  "lang": 0,
 80  "meta": {
 81    "speaker": "AP",
 82    "gender": "M",
 83    "year": "2017"
 84  },
 85  "para_alignment": [
 86    {
 87      "off_start": 0,
 88      "off_end": 18,
 89      "para_id": 616
 90    }
 91  ],
 92  "src_alignment": [
 93    {
 94      "off_start_src": "0.05",
 95      "off_end_src": "1.3",
 96      "off_start_sent": 0,
 97      "off_end_sent": 18,
 98      "mtype": "audio",
 99      "src_id": "50_1300",
100      "src": "AP_AS_2017.01.06_words_YZ_training-0-0.mp4"
101    }
102  ]
103}