GitHub

#Text Processing API ####Leila Hofer Mark 43 Application

##About Use curl requests and return processed text

###Endpoints

/words/avg_len : average word lenght in text
/words/most_com : most common word in text
/words/median : words used the median number of times
/sentences/avg_len : average sentence length

###Example Calls

average word length: curl http://localhost:5000/words/avg_len -d '{"text":"something new"}' -X POST -H "Content-type: application/json"
most common word: curl http://localhost:5000/words/most_com -d '{"text":"something new. and now a new sentence. That is pretty and awesome!"}' -X POST -H "Content-type: application/json"
median words: curl http://localhost:5000/words/median -d '{"text":"something new. and now a new sentence. That is pretty and awesome!"}' -X POST -H "Content-type: application/json"
sentence length: curl http://localhost:5000/sentences/avg_len -d '{"text":"something new. and now a new sentence. That is pretty and awesome!"}' -X POST -H "Content-type: application/json"

#####NOTE: the terminal I test on reads requires unicode chars instead of escaped special characters (i.e. apostrophe should be represented as \u0027 instead of ')

###Usage

Start Server in api_mark43 directory

$ . venv/bin/activate
$ python api.py

in another terminal window

$ curl request

###Setup ####IMPORTANT: If this setup doesn't work please let me know and I can host it on a server elsewhere

Clone Repo

clone repo from https://github.com/lhofer/api_mark43.git

Install virtual environment

(mac) $ sudo easy_install virtualenv
(windows) $ sudo apt-get install python-virtualenv

Enter directory and set up virtual environment
$ cd api_mark43
$ virtualenv venv
enter virtual environment

(mac) $ . venv/bin/activate
(windows) $ venv\scripts\activate

install dependencies

$ pip install Flask
$ pip install flask-restful
$ pip install nltk
$ pip install lazysorted
$ pip install Counter

###Design #####Performance Given the initial overhead of parsing, tokenizing, and filtering the text, where possible I tried to optimize my code for larger values of n, thinking that the time saved on large lists would be much more than the time lost on small lists. Additionally, the decision to implement a hash table with the words as a keys in the frequency functions was to allow O(1) access time to any word-frequency pair.

Most Common Word

An alternative I considered was to sort the hash table by frequency and then iterate through only the most frequent words. However, this would take O(n logn) to sort plus the time to iterate through the most frequent words, on top of the O(n) time to hash. The current implementation will run in O(2n) time (O(n) time to hash and O(n) to iterate through each value to check for the max) whereas the previous implementation would take at a minimum of n + (n logn) which would be slower than the current implementation as n increases.

Median Frequency Words

Again, an alternative would be to sort the hash (n logn) and then iterate through to find the median values, stopping once you pass the last median value in the sorted list. This would take at a minimum O(n logn plus) the O(n) time to hash which would again be slower as n increases than the current implementation. The current implementation takes O(n) time to generate the hash table, uses a linear median function (so O(n) agian) and then iterates through every value in the list (O(n)). 3n is still less than n logn + n for larger lists.

#####Definitions As this program is a text processer not a spell checker, and I wanted to make sure someone could process Dr. Suss texts, I decided not to crosscheck all words against an english language dictionary. However, I did decide to use only words comprised of only alphabetic characters, hyphons, and apostrophes (so no numbers or other punctuation could be counted as a word). Beyond this, for the most part I left the definitions of words and sentences up to the NLTK library.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
ignore		ignore
.gitingore		.gitingore
Procfile		Procfile
README.md		README.md
api.py		api.py
api.pyc		api.pyc
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Most Common Word

Median Frequency Words

About

Uh oh!

Releases

Packages

Languages

lhofer/Flask_text_processing_API

Folders and files

Latest commit

History

Repository files navigation

Most Common Word

Median Frequency Words

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages