guibon / ISTEX-TermSuiteLauncher
Launcher for ISTEX usage of TermSuite
guibon authored on 5 Oct 2015
ISTEX-TermSuiteLauncher Nouvelle version : correction pipeline TT 2 years ago
model Launcher simplifié et code source. 2 years ago
ISTEX-TermSuiteLauncher.jar Nouvelle version : correction pipeline TT 2 years ago
ISTEXlauncher.sh Ajout -model option et Xmx 2 years ago
README.md Ajout -model option et Xmx 2 years ago
launcher.sh~ Nouvelle version du launcher utilisant la version de termsuite réajustée pour les gros volumes par Damien Cram 2 years ago
README.md

ISTEX-TermSuiteLauncher

Context

This program is a launcher which purpose is to make TermSuite usages for ISTEX easier. The objective is to allow another person to start using TermSuite on ISTEX corpora without non essential options or features. It is a modified version of the InlineLauncher made by Damien Cram (LINA-CNRS) for which personnal CLI, models, and utils were added.

For all usage you will need to put the "model" directory in root, next to ISTEXlauncher.sh.

Basic Usage

Using the shell script

It is the recommended method as it allows you not to rewrite long command lines.

sh ISTEXlauncher.sh

In the ISTEXlauncher.sh file several variables can be edited: Mandatory fields:

  1. CORPUS : path of the directory where your txt file(s) are. If you have XML files, please refer to the ISTEX-xml2txt program in order to convert them to txt files.
  2. RESOURCES : path of the directory where the resources are. For instance : "termsuite-resources" which contains termsuite-resources/en
  3. MODEL : path of the directory where the mate models are. Models follow the names : "mate-lemma-en.model" and "mate-pos-en.model"
  4. OUTPUT : output json file path. Optionnal fields:
  5. LEMMAPOS : select the lemmatizer and part-of-speech tagger between "mate" (Mate-tools anna 3.3) and "treetagger" (TreeTagger)
  6. TREETAGGER_HOME : home path for TreeTagger program if you want to use it. You will first need to install TreeTagger.
  7. LANG : select a language between "en" (english) and "fr" (french). Other languages are not needed for IsteX.

Using the Java CLI

If you don't want to use the shell script, you can use the following basic command:

java -jar ISTEX-TermSuiteLauncher.jar -corpus [dir path] -resources [dir path] -output [path]

Help option

Of course, you can always add the "-help" flag option or only launch ISTEXTermSuiteLauncher without any options to show the help message.

java -jar ISTEX-TermSuiteLauncher.jar

OR

java -jar ISTEX-TermSuiteLauncher.jar [option1] [option...] -help

Advanced Usage

Using TreeTagger instead of Mate-tools for lemmatization and PoS tagging

By default, Mate will be used for PoS tagging and lemmatization. If, instead, you want to use TreeTagger you will need two options "-treetaggerhome" to set the HOME directory of your installation of TreeTagger, and specify "treetagger" using the "-lemmatizerpostagger" option. Here is an example:

java -jar ISTEX-TermSuiteLauncher.jar -corpus [dir path] -resources [dir path] -model [dir path] -output [file path] -lemmatizerpostagger treetagger -treetaggerhome /home/gael/programs/TreeTagger

TSV output

The output files always are a json file and a tsv file. The tsv file contains indexation based on Weirdness Ratio and will be created next to the json file. It is made of: line number | Term(T) or Variant(V) | TermPilot | Lemmas | Frequency | Weirdness Ratio | SpottingRule

Contacts

gael dot guibon at gmail.com gael dot guibon at inist.fr istex at inist.fr

@2015 ISTEX INIST-CNRS