Skip to content

stg7/ngram-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ngram Extractor (dataset generation for PhrasIt)

You need:

  • java 8

First test, simply run:

./ngram-extractor.sh -h

You will get the help screen after the project was successfully compiled.

Now you can extract n-grams with:

./ngram-extractor.sh FILES

e.g.

./ngram-extractor.sh in_data/*.pdf

The tool will print out all n-grams with its frequency in your collection to stdout. Each line has the following format:

Ngram \t freq

Supported Formats

As input formats are all text formats possible that tika supports, see formats.

  • txt
  • html
  • pdf
  • ...

Development Notes

You can manually compile the project via gradle:

./gradlew build
./gradlew run -Pargs=FIlE1,FILE2

It is possible to build a jar with all dependencies via:

./gradlew shadowJar

About

Ngram Extractor for PhrasIt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors