Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • M mom-archived
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • agg
  • mom-archived
  • Wiki
  • dataset

dataset · Changes

Page history
olzama created page: dataset authored Sep 18, 2019 by Olga Zamaraeva's avatar Olga Zamaraeva
Hide whitespace changes
Inline Side-by-side
dataset.md 0 → 100644
View page @ f4ea0097
The required data format is [Xigt](https://github.com/xigt/xigt/wiki), with morpheme-level segmentation,
morpheme-to-gloss alignment, and POS tags.
A sample (toy) dataset is included under data/.
There is a fairly reliable toolbox-to-xigt converter.
Make sure your IGT file is properly encoded as UTF-8,
each igt in your collection has a unique \ref value and then run the converter:
```
$ xigt import -f toolbox -i your_toolbox_igt.txt -o xigtified_igt.xml
```
(The above assumes you've installed xigt via pip;
this will happen on its own if you used pip to install MOM).
One way to replace \ref values in your files with the line numbers (so that each is unique) is
using the awk tool, available on Mac OSX and generally on Unix:
```
$ awk '/^\\ref/{print "\\ref igt" NR;};!/^\\ref/' < original_file.txt > modified_file.txt
```
FLEx-to-xigt converter will be created in the future.
If your dataset does not have morpheme-level segmentation or POS tags, you can use [INTENT][1] to enrich it.
You can use [this online interface](http://intent.xigt.org/) of INTENT.
If you want to install INTENT and run it from your machine, the command you want will look something like this:
```
python3 intent.py enrich xigt_igts.xml igts-enriched.xml --align heur --pos class
```
If you experience trouble trying to set up INTENT or running it, please contact me at olga.zamaraeva@ gee mail and I will try to run it for you on your data.
[1]:https://github.com/rgeorgi/intent
\ No newline at end of file
Clone repository
  • dataset
  • Home
  • what to expect