|
|
|
The required data format is [Xigt](https://github.com/xigt/xigt/wiki), with morpheme-level segmentation,
|
|
|
|
morpheme-to-gloss alignment, and POS tags.
|
|
|
|
|
|
|
|
A sample (toy) dataset is included under data/.
|
|
|
|
|
|
|
|
There is a fairly reliable toolbox-to-xigt converter.
|
|
|
|
|
|
|
|
Make sure your IGT file is properly encoded as UTF-8,
|
|
|
|
each igt in your collection has a unique \ref value and then run the converter:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ xigt import -f toolbox -i your_toolbox_igt.txt -o xigtified_igt.xml
|
|
|
|
|
|
|
|
```
|
|
|
|
(The above assumes you've installed xigt via pip;
|
|
|
|
this will happen on its own if you used pip to install MOM).
|
|
|
|
|
|
|
|
One way to replace \ref values in your files with the line numbers (so that each is unique) is
|
|
|
|
using the awk tool, available on Mac OSX and generally on Unix:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ awk '/^\\ref/{print "\\ref igt" NR;};!/^\\ref/' < original_file.txt > modified_file.txt
|
|
|
|
```
|
|
|
|
|
|
|
|
FLEx-to-xigt converter will be created in the future.
|
|
|
|
|
|
|
|
If your dataset does not have morpheme-level segmentation or POS tags, you can use [INTENT][1] to enrich it.
|
|
|
|
|
|
|
|
You can use [this online interface](http://intent.xigt.org/) of INTENT.
|
|
|
|
|
|
|
|
If you want to install INTENT and run it from your machine, the command you want will look something like this:
|
|
|
|
|
|
|
|
```
|
|
|
|
python3 intent.py enrich xigt_igts.xml igts-enriched.xml --align heur --pos class
|
|
|
|
```
|
|
|
|
|
|
|
|
If you experience trouble trying to set up INTENT or running it, please contact me at olga.zamaraeva@ gee mail and I will try to run it for you on your data.
|
|
|
|
|
|
|
|
|
|
|
|
[1]:https://github.com/rgeorgi/intent |
|
|
|
\ No newline at end of file |