|
|
## Aggregation(AGG) Pipeline
|
|
|
# Aggregation(AGG) Pipeline
|
|
|
|
|
|
TODO
|
|
|
The pipeline has three main structures:
|
|
|
|
|
|
- Data Preparation
|
|
|
(1) Data Preparation : formatting to ensure the data can be processed by the AGG pipeline in (2)
|
|
|
|
|
|
(2) AGGREGATION INFERENCES : main process
|
|
|
|
|
|
TODO
|
|
|
(3) EVALUATIONS : reports the number of successful parses from (2)
|
|
|
|
|
|
- AGGREGATION INFERENCES
|
|
|
In details of each step, please follow the instruction below.
|
|
|
|
|
|
|
|
|
TODO
|
|
|
## Requirements
|
|
|
- [ ] A suitable [conda](https://conda.io/) environment is installed.
|
|
|
|
|
|
> Note: If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment.
|
|
|
|
|
|
- EVALUATIONS
|
|
|
```
|
|
|
source install.sh
|
|
|
```
|
|
|
# Running AGG Pipeline
|
|
|
## 1. Data Preparation (with INTENT)
|
|
|
|
|
|
> Note: currently, data preparation only supports the following data format(s): toolbox
|
|
|
|
|
|
TODO
|
|
|
> in development: flex, elan, odin, pangloss
|
|
|
|
|
|
## Requirements
|
|
|
A suitable [conda](https://conda.io/) environment is installed.
|
|
|
Before running the aggregation process, ensure the correct [configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations) has been made in file <code> data_prep_config.yml </code>.
|
|
|
|
|
|
Each following variable should be set to achieve the desired effect.
|
|
|
|
|
|
- [ ] raw_to_xigt_path : specifies the path of translation to xigt only, without enrichment.
|
|
|
- [ ] enriched_xigt_file_path: specifies the path of enriched-xigt.
|
|
|
- [ ] split_enriched_xigt: (if true) splits the enriched-xigt into 10 folds.
|
|
|
- [ ] xigt_path : specifies the storing path of enriched-xigt upon splitting into 10 folds.
|
|
|
- [ ] testsuites_directory_path: specifies the storing path of testsuites, condensed based on the enriched-xigt data.
|
|
|
- [ ] agg_config_path: specifies the path of configuration file used in the next step.
|
|
|
- [ ] collect_tags: (if true) creates a file <code> pos-tags.txt </code> storing all part-of-speech tags found in the given corpus.
|
|
|
|
|
|
If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment.
|
|
|
In order to transform raw data into xigt, run the following script underneath path: <code> .../agg_script </code>
|
|
|
|
|
|
```
|
|
|
source install.sh
|
|
|
python data_preparation.py
|
|
|
```
|
|
|
## Running AGG Pipeline
|
|
|
- Data Preparation
|
|
|
|
|
|
TODO
|
|
|
The data can then smoothly go through the aggregation main process.
|
|
|
|
|
|
|
|
|
## fill in tags in corresponding files, based on collected tags in <code> pos-tags.txt </code>
|
|
|
|
|
|
|
|
|
At this point, the console will prompt for an action that MUST be manually completed.
|
|
|
|
|
|
As prompted, find the part-of-speech tags collected from the previous step in specified path, and categorize all that are (1) adpositions, (2) nouns, and (3) verbs, following the format below:
|
|
|
```
|
|
|
python data_preparation.py
|
|
|
NOUN
|
|
|
PRON
|
|
|
...
|
|
|
```
|
|
|
Each type of tag will have a separate .txt files, following the naming format of <code> adp-tags.txt </code>, <code> noun-tags.txt </code>.. etc.
|
|
|
|
|
|
if there are tags that specifically should be ignored, fill them in <code> ignore-tags.txt </code>.
|
|
|
|
|
|
|
|
|
## 2. Aggregation Inferences
|
|
|
|
|
|
Find <code> agg_config.yml </code> from path as specified in the previous step, from <code> data_prep_config.yml </code>.
|
|
|
|
|
|
Repeating and following the steps of 1, configure the following variables.
|
|
|
|
|
|
- Aggregation Inferences
|
|
|
- [ ] xigt_path: same as set in <code>data_prep_config.yml</code>
|
|
|
- [ ] agg_config_path: same as set in <code>data_prep_config.yml</code>
|
|
|
- [ ] output_dir: desired path of outputting choice files.
|
|
|
|
|
|
Make sure the paths in <code>agg_config.yml</code> matches the ones already set in <code>data_prep_config.yml</code>.
|
|
|
|
|
|
Upon having prepared data and configured, run aggregation with the following command
|
|
|
|
|
|
TODO
|
|
|
```
|
|
|
python run_agg.py
|
|
|
```
|
|
|
|
|
|
- Evaluations
|
|
|
The process returns choice files, which could be used to build grammar matrix.
|
|
|
|
|
|
## 3. Evaluations
|
|
|
|
|
|
Again as in 1 and 2, configure <code> eval_config.yml</code>, noting the following variables:
|
|
|
|
|
|
- [ ] data_dir: directory of preprocessed data returned in 1
|
|
|
- [ ] choice_dir: directory of choice files returned in 2
|
|
|
- [ ] output_dir: directory of results
|
|
|
|
|
|
WIP
|
|
|
```
|
|
|
python evaluate.py
|
|
|
```
|
|
|
Console will show the result ratio for each fold in the data:
|
|
|
> num of total item / num of parsed item / num total parsing in fold
|
|
|
|
|
|
## [Configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations)
|
|
|
|
|
|
for more details on results, find folder <code>./result </code> as specified in <code> output_dir</code> in <code> eval_config.yml</code>
|
|
|
|