|
|
|
## Data Preparation
|
|
|
|
Please refer to <code>data_prep_config.yml</code> for the following configurations.
|
|
|
|
|
|
|
|
### Global Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| iso_code |(various) | ISO code of the language |
|
|
|
|
|
|
|
|
### Corpus Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| raw_data_file |(path_to_file) | Location of the corpus, please combine multiple files into a single file |
|
|
|
|
| raw_data_type |toolbox | Filetype of the corpus (supported: toolbox, todo: [flex, elan, odin, pangloss]) |
|
|
|
|
|
|
|
|
### Toolbox Corpus Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| toolbox_transcript_tier|\\tph | Name of transcript tier in Toolbox file |
|
|
|
|
| toolbox_morpheme_tier|\\mph| Name of morpheme tier in Toolbox file |
|
|
|
|
| toolbox_gloss_tier|\\mgl| Name of gloss tier in Toolbox file |
|
|
|
|
| toolbox_pos_tier|\\ps| Name of part of the speech tier in Toolbox file |
|
|
|
|
| toolbox_translation_tier|\\eng| Name of English translation tier in Toolbox file |
|
|
|
|
|
|
|
|
### XIGT Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| raw_to_xigt | true | Set to true to turn on the raw corpus to XIGT conversion |
|
|
|
|
| raw_to_xigt_path | (path_to_XIGT_file) | Location of to-save XIGT file, only applicable when raw_to_xigt=true |
|
|
|
|
|xigt_path| (path_to_XIGT_directory)| Directory where stores all XIGT files for this language|
|
|
|
|
|xigt_file_path| (path_to_XIGT_file) | Load existing XIGT file, only applicable when raw_to_xigt=false|
|
|
|
|
|
|
|
|
### XIGT Enrichment Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| xigt_to_enriched_xigt| true | Set to true to turn on XIGT enrichment |
|
|
|
|
| enriched_xigt_file_path | (path_to_enrichedXIGT_file) | Path to store the enriched XIGT file |
|
|
|
|
|split_enriched_xigt|true| Set to true to split the enriched XIGT into 10 training folds and 10 test folds|
|
|
|
|
|
|
|
|
### Testsuites Configuration
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| create_testsuites| true | Set to true to create testsuits from enriched XIGT |
|
|
|
|
| testsuites_directory_path | (path_to_testsuits_directory)| Directory to store all testsuits|
|
|
|
|
|
|
|
|
### AGG Config Preparation
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
|prepare_agg_configs|true|Set to true to run AGG config file preparation|
|
|
|
|
|agg_config_path|(path_to_agg_config_directory)| Directory to store AGG config files|
|
|
|
|
| collect_tags | true | Set to true to collect POS tags set from enriched XIGT|
|
|
|
|
|pos_tag_tier_names| "pos m" | POS tag tier names in enriched XIGT|
|
|
|
|
|
|
|
|
## AGG Inferences Configuration
|
|
|
|
Please refer to <code>agg_config.yml</code> for the following configurations.
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
|agg_config_path|(path_to_agg_config_directory)| Directory of the AGG config files|
|
|
|
|
|compression_rounds|1|TODO|
|
|
|
|
|output_dir|(path_to_output_directory)|Directory to output inferred choices file and skipped items|
|
|
|
|
|graph|false|TODO|
|
|
|
|
|hyphens|true|TODO|
|
|
|
|
|precluster|None|TODO|
|
|
|
|
|glosses|true|TODO|
|
|
|
|
|compression|0.2|TODO|
|
|
|
|
|lexitem_classes|true|TODO|
|
|
|
|
|all_stems_occur_bare| true|TODO|
|
|
|
|
|ignore_chars| None|TODO|
|
|
|
|
|ungrammatical| '*!?'|TODO|
|
|
|
|
|allomorphs| None|TODO|
|
|
|
|
|boundaries| true|TODO|
|
|
|
|
|allow_difference| 5|TODO|
|
|
|
|
|escape_special_characters| true|TODO|
|
|
|
|
|wordlist| ''|TODO|
|
|
|
|
|infer_case| gram|TODO|
|
|
|
|
|
|
|
|
## Evaluation Preparation
|
|
|
|
Please refer to <code>eval_config.yml</code> for the following configurations.
|
|
|
|
| Setting| Default|Description|
|
|
|
|
| ------ | ------ |------ |
|
|
|
|
| TODO|| |
|
|
|
\ No newline at end of file |