guibon / ISTEX-LexiconTagger
Program containing tools and options to lemmatize a lexique (RDF SKOS TPS exact format). Initialy made for ISTEX purposes.
guibon authored on 9 Sep 2015
ISTEX-LexiconTagger Libraries 2 years ago
model ajout dossier model 2 years ago
EntryFormat.dtd Update EntryFormat.dtd 2 years ago
ISTEX-LexiconTagger.jar Renommage de LexiconTagger en ISTEX-LexiconTagger 2 years ago
README.md Update README.md 2 years ago
README.md

ISTEX-LexiconTagger

This program lemmatize a lexicon using different methods to find a sentence context. You can choose which one to use between:

  • requesting Wikipedia
  • requesting StartPage
  • looking for a sentence in a given corpus

Basic Usage

Command without any option

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] [-addOptionsHere...]

Help option

Of course, you can always add the "-help" flag option or only launch LexiconTagger without any options to show the help message.

java -jar ISTEX-LexiconTagger.jar

OR

java -jar ISTEX-LexiconTagger.jar [option1] [option...] -help

Without context

If you want to have a first result quickly, please use the -contextfree option. It will override other context option (wikipedia, startpage and corpora).

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree

Using an external corpus

LexiconTagger can tag a french and english lexicon. So you will have to specify each language corpus. Corpora used are expected in the txt format (tokenized full text without tags). If your corpus isn't tokenized you will need to add the "-tokenize" flag option. For instance, here can be seen a usage example:

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -corpusFR [path] -corpusEN [path] [-tokenize]

If you only have one corpus but a two language lexicon, you can specify only one corpus path. It will then be used only for the language it depends on.

Using Wikipedia and/or StartPage

Because you will need an internet connexion, please be sure your proxy are configured if necessary (see the Proxy section). To use search context sentence on wikipedia or startpage you only need to add these flag options. Be sure there isn't "contextfree" or "help" options as they will override it.

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -wikipedia -startpage

One complete basic usage example

java -jar ISTEX-LexiconTagger.jar -lexicon ../bioLexicon.xml -ouput ../taggedBioLexicon.xml -wikipedia -startpage -corpusFR ../corpora/bioCorpusFR.txt -corpusEN ../corpora/bioCorpusEN.txt -tsv

Advanced Usage

Limit the number of lemmatized terms

You add a limit in order to lemmatize only the first x terms of your lexicon with the "-n" option. Useful for test purposes.

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -n 100

Enable tsv outputs

The "-tsv" flag option can be used in order to add one tabular separated format per language. These output are simpler and are very useful if you are using the TermSuite software for instance.

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -tsv

Configure proxy

If you or you company use a proxy and you want to use wikipedia and startpage flag options, then you will have to use together both options "proxy" and "port". Here is an example:

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -wikipedia -startpage -proxy http://proxy.com -port 8080

Convert a XML result to TSV

Maybe you used LexiconTagger to tag your lexicon with every options but forgot to add the "tsv" output flag option. If so, you can use LexiconTagger to generate the tsv outputs from an already tagged lexicon without processing the tagging again.

To do so you will have to use the "-xml2tsv" flag option like this:

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput ../lexicon/taggedLexicon.xml -xml2tsv

It will override other options except the "help" option and will create tsv in the same directory as the "output" option. With this example you will have two tsv: "../lexicon/FR_taggedLexicon.tsv" and "../lexicon/EN_taggedLexicon.tsv".

Use different models

There are three models used for each language:

  1. model for OpenNLP tokenizer "-modelTokenFR | -modelTokenEN"
  2. model for OpenNLP sentence detector "-modelSentFR | -modelSentEN"
  3. model for Mate 3.3 lemmatizer "-modelLemmaFR | -modelLemmaEN"

Here is an example of how you can specify a different model only for french tokenization and sentence detection (if not specify the program will use default model):

java -jar ISTEX-LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -modelTokenFR [new model path] -modelSentFR [new model path]

Use ISTEX-LexiconTagger for another language

ISTEX-LexiconTagger is initially made for a french and/or english lexicon but it is possible to use it for another language. Thus, wikipedia option will not be available.

To do so you have to specify the new models for the given language, and you need to, to specify the corpus under the same language. The output will show a different name, but the result will be fine. Here is an example for italian (IT):

java -jar ISTEX-LexiconTagger.jar -lexicon ../lexiconIT.xml -ouput ../taggedLexiconIT.xml -modelSentFR ../model/modelSentIT -modelTokenFR ../model/modelTokenIT -modelLemmaFR ../model/modelLemmaIT -corpusFR ../corpora/corpusIT.txt -tokenize -tsv -startpage

This will result with two files:

  • ../taggedLexiconIT.xml
  • ../FR_taggedLexiconIT.tsv

You will then only need to change the name of the tsv ouput into "IT_taggedLexiconIT.tsv".

Contacts

gael dot guibon at gmail.com gael dot guibon at inist.fr istex at inist.fr

@2015 ISTEX INIST-CNRS