Indexing

How to index your corpus

When your corpus JSON files are ready, they must be indexed in Elasticsearch before the corpus becomes available. You have to follow these steps:

  1. Configure the corpus by filling out a configuration page in the browser or editing /conf/corpus.json.

  2. Edit categories.json (or copy an existing one) and put it to /conf.

  3. Make sure all interface messages are translated into all interface languages.

  4. Make sure all transliterations, input methods and virtual keyboards mentioned in corpus.json exist.

  5. Put the JSON files to /corpus/%corpus_name%.

  6. If you have media files cut by the source convertors, put them to /search/media/%corpus_name% or symlink this folder to the place where they are stored.

  7. Run /indexator/indexator.py.

Indexing may take a while; you will see messages in the process. If you have a large corpus and would like to disconnect from the server until indexation is complete, you can launch it with nohup:

cd indexator
nohup python3 indexator.py -y > index.log 2>&1 &

If you are setting up the corpus for the first time, do not forget to set up apache/nginx/… configuration files, so that some URL resolves to your corpus, and switch it on. If you are reindexing the corpus, reload apache/nginx after the indexation is complete.

What indexator does

  1. It creates three Elasticsearch indexes called %corpus_name%.sentences, %corpus_name%.docs and %corpus_name%.words. If indexes with such names already exist, the indexator will ask you for permission to proceed. Use the -y option to overwrite existing indexes without asking.

  2. It puts the contents of your JSON files to the indexes. Sentences are transferred to the database almost without changes.

  3. It calculates word and lemma statistics and puts it to the indexes. The statistics is kept in memory during the indexation, so the larger your corpus, the more memory indexation will require.

  4. It generates full-text representations and dictionaries, if you chose so in the configuration.

PyBabel translations of the interface, which used to be compiled at indexation time, are now generated and compiled each time the corpus app is launched.