Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • A AGGREGATION Scripts
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • agg
  • AGGREGATION Scripts
  • Wiki
  • Home

Home · Changes

Page history
Update Home with comments outlining pipeline structure authored Jun 23, 2023 by Edi Xin's avatar Edi Xin
Hide whitespace changes
Inline Side-by-side
Home.md
View page @ 711e1bd2
## Aggregation(AGG) Pipeline
# Aggregation(AGG) Pipeline
TODO
The pipeline has three main structures:
- Data Preparation
(1) Data Preparation : formatting to ensure the data can be processed by the AGG pipeline in (2)
(2) AGGREGATION INFERENCES : main process
TODO
(3) EVALUATIONS : reports the number of successful parses from (2)
- AGGREGATION INFERENCES
In details of each step, please follow the instruction below.
TODO
## Requirements
- [ ] A suitable [conda](https://conda.io/) environment is installed.
> Note: If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment.
- EVALUATIONS
```
source install.sh
```
# Running AGG Pipeline
## 1. Data Preparation (with INTENT)
> Note: currently, data preparation only supports the following data format(s): toolbox
TODO
> in development: flex, elan, odin, pangloss
## Requirements
A suitable [conda](https://conda.io/) environment is installed.
Before running the aggregation process, ensure the correct [configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations) has been made in file <code> data_prep_config.yml </code>.
Each following variable should be set to achieve the desired effect.
- [ ] raw_to_xigt_path : specifies the path of translation to xigt only, without enrichment.
- [ ] enriched_xigt_file_path: specifies the path of enriched-xigt.
- [ ] split_enriched_xigt: (if true) splits the enriched-xigt into 10 folds.
- [ ] xigt_path : specifies the storing path of enriched-xigt upon splitting into 10 folds.
- [ ] testsuites_directory_path: specifies the storing path of testsuites, condensed based on the enriched-xigt data.
- [ ] agg_config_path: specifies the path of configuration file used in the next step.
- [ ] collect_tags: (if true) creates a file <code> pos-tags.txt </code> storing all part-of-speech tags found in the given corpus.
If you are running the AGG pipeline in [Patas](https://wiki.ling.washington.edu/bin/view.cgi/Main/CondorClusterHomepage), you can simply run the following script to install the required environment.
In order to transform raw data into xigt, run the following script underneath path: <code> .../agg_script </code>
```
source install.sh
python data_preparation.py
```
## Running AGG Pipeline
- Data Preparation
TODO
The data can then smoothly go through the aggregation main process.
## fill in tags in corresponding files, based on collected tags in <code> pos-tags.txt </code>
At this point, the console will prompt for an action that MUST be manually completed.
As prompted, find the part-of-speech tags collected from the previous step in specified path, and categorize all that are (1) adpositions, (2) nouns, and (3) verbs, following the format below:
```
python data_preparation.py
NOUN
PRON
...
```
Each type of tag will have a separate .txt files, following the naming format of <code> adp-tags.txt </code>, <code> noun-tags.txt </code>.. etc.
if there are tags that specifically should be ignored, fill them in <code> ignore-tags.txt </code>.
## 2. Aggregation Inferences
Find <code> agg_config.yml </code> from path as specified in the previous step, from <code> data_prep_config.yml </code>.
Repeating and following the steps of 1, configure the following variables.
- Aggregation Inferences
- [ ] xigt_path: same as set in <code>data_prep_config.yml</code>
- [ ] agg_config_path: same as set in <code>data_prep_config.yml</code>
- [ ] output_dir: desired path of outputting choice files.
Make sure the paths in <code>agg_config.yml</code> matches the ones already set in <code>data_prep_config.yml</code>.
Upon having prepared data and configured, run aggregation with the following command
TODO
```
python run_agg.py
```
- Evaluations
The process returns choice files, which could be used to build grammar matrix.
## 3. Evaluations
Again as in 1 and 2, configure <code> eval_config.yml</code>, noting the following variables:
- [ ] data_dir: directory of preprocessed data returned in 1
- [ ] choice_dir: directory of choice files returned in 2
- [ ] output_dir: directory of results
WIP
```
python evaluate.py
```
Console will show the result ratio for each fold in the data:
> num of total item / num of parsed item / num total parsing in fold
## [Configurations](https://git.ling.washington.edu/agg/agg-scripts/-/wikis/Configurations)
for more details on results, find folder <code>./result </code> as specified in <code> output_dir</code> in <code> eval_config.yml</code>
Clone repository
  • Configurations
  • Home