Edi Xin · 711e1bd2
--- a/Home.md
+++ b/Home.md
-## Aggregation(AGG) Pipeline
+# Aggregation(AGG) Pipeline
-TODO
+The pipeline has three main structures:
- Data Preparation
+(1) Data Preparation : formatting to ensure the data can be processed by the AGG pipeline in (2)
+(2) AGGREGATION INFERENCES : main process
-TODO
+(3) EVALUATIONS : reports the number of successful parses from (2)
- AGGREGATION INFERENCES
+In details of each step, please follow the instruction below.
-TODO
+## Requirements
+- [ ] A suitable [conda](https://conda.io/) environment is installed.
- EVALUATIONS
+> Note: If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment. 
+```
+source install.sh
+```
+# Running AGG Pipeline
+## 1. Data Preparation (with INTENT)
-TODO
+> Note: currently, data preparation only supports the following data format(s): toolbox 
-## Requirements
+> in development: flex, elan, odin, pangloss
-A suitable [conda](https://conda.io/) environment is installed.
+Before running the aggregation process, ensure the correct [configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations) has been made in file <code> data_prep_config.yml </code>.
+Each following variable should be set to achieve the desired effect.
-If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment.
+- [ ] raw_to_xigt_path : specifies the path of translation to xigt only, without enrichment.
+- [ ] enriched_xigt_file_path: specifies the path of enriched-xigt.
+- [ ] split_enriched_xigt: (if true) splits the enriched-xigt into 10 folds.
+- [ ] xigt_path : specifies the storing path of enriched-xigt upon splitting into 10 folds.
+- [ ] testsuites_directory_path: specifies the storing path of testsuites, condensed based on the enriched-xigt data.
+- [ ] agg_config_path: specifies the path of configuration file used in the next step.
+- [ ] collect_tags: (if true) creates a file <code> pos-tags.txt </code> storing all part-of-speech tags found in the given corpus.
+In order to transform raw data into xigt, run the following script underneath path: <code> .../agg_script </code>
 ```
-source install.sh
+python data_preparation.py
 ```
-## Running AGG Pipeline
- Data Preparation
-TODO
+The data can then smoothly go through the aggregation main process.
+## fill in tags in corresponding files, based on collected tags in <code> pos-tags.txt </code>
+At this point, the console will prompt for an action that MUST be manually completed.
+As prompted, find the part-of-speech tags collected from the previous step in specified path, and categorize all that are (1) adpositions, (2) nouns, and (3) verbs, following the format below:
 ```
-python data_preparation.py
+NOUN
+PRON
+...
 ```
+Each type of tag will have a separate .txt files, following the naming format of <code> adp-tags.txt </code>, <code> noun-tags.txt </code>.. etc.
+if there are tags that specifically should be ignored, fill them in <code> ignore-tags.txt </code>.
+## 2. Aggregation Inferences
+Find <code> agg_config.yml </code> from path as specified in the previous step, from <code> data_prep_config.yml </code>.
+Repeating and following the steps of 1, configure the following variables.
- Aggregation Inferences
+- [ ] xigt_path: same as set in <code>data_prep_config.yml</code>
+- [ ] agg_config_path: same as set in <code>data_prep_config.yml</code>
+- [ ] output_dir: desired path of outputting choice files.
+Make sure the paths in <code>agg_config.yml</code> matches the ones already set in <code>data_prep_config.yml</code>.
+Upon having prepared data and configured, run aggregation with the following command
-TODO
 ```
 python run_agg.py
 ```
- Evaluations
+The process returns choice files, which could be used to build grammar matrix.
+## 3. Evaluations
+Again as in 1 and 2, configure <code> eval_config.yml</code>, noting the following variables:
+- [ ] data_dir: directory of preprocessed data returned in 1
+- [ ] choice_dir: directory of choice files returned in 2
+- [ ] output_dir: directory of results
-WIP
 ```
 python evaluate.py
 ```
+Console will show the result ratio for each fold in the data: 
+> num of total item / num of parsed item / num total parsing in fold 
-## [Configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations)
+for more details on results, find folder <code>./result </code> as specified in <code> output_dir</code> in <code> eval_config.yml</code>