CCMTM Scripts to extract useful data from the Common Crawl data set. Created on the Machine Translation Marathon 2013