ArXiv analysis

Run online variational LDA on all the abstracts from the arXiv. The implementation is based on Matt Hoffman's GPL licensed code.

Usage

You'll need a mongod instance running on the port given by the environment variable MONGO_PORT and a redis-server instance running on the port given by the REDIS_PORT environment variable.

The code depends on the Python packages: numpy, scipy, requests, pymongo and redis.

mkdir abstracts
./analysis.py scrape abstracts — scrapes all the metadata from the arXiv OAI interface and saves the raw XML responses as abstracts/raw-*.xml. This takes a long time because of the arXiv's flow control policies. It took me approximately 6 hours.
./analysis.py parse abstracts/raw-*.xml — parses the raw responses and saves the abstracts to a MongoDB database called arxiv in the collection called abstracts.
./analysis.py build-vocab — counts all the words in the corpus removing anything with less than 3 characters and removing any stop words.
./analysis.py get-vocab 100 5000 > vocab.txt — lists the vocabulary skipping the first 100 most popular words and keeping 5000 words total.
./analysis.py run vocab.txt — runs online variational LDA by randomly selecting articles from the database. The topic distributions are stored in the lambda-*.txt files. This will run forever so just kill it whenever you feel like it.
./analysis.py vocab.txt lambda-100.txt — list the topics and their most common words at step 100.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
arxiv		arxiv
twitter		twitter
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArXiv analysis

Usage

About

Uh oh!

Releases

Packages

Languages

rsarxiv/arxiv-analysis

Folders and files

Latest commit

History

Repository files navigation

ArXiv analysis

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages