|
1 | | -# text-analysis-pipeline |
| 1 | +# Text Semantic Similarity Analysis Pipeline |
| 2 | + |
| 3 | +This is a Dataflow pipeline that reads article documents in Google Cloud Storage, extracts feature embeddings from the documents, and store those embeddings to BigQuery. After running the pipeline, you can easily search contextually similar documents based on a cosine distance between feature embeddings. |
| 4 | + |
| 5 | +In the pipeline, documents are processed to extract each article’s title, topics, and content. The processing pipeline uses the “Universal Sentence Encoder” module in tf.hub to extract text embeddings for both the title and the content of each article read from the source documents. The title, topics, and content of each article, along with the extracted embeddings are stored in BigQuery. Having the articles, along with their embeddings, stored in BigQuery allow us to explore similar articles, using cosine similarity metric between embeddings of tiles and/or contents. |
| 6 | + |
| 7 | +# How to run the pipeline |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +You need to have your [GCP Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects). You can use [Cloud Shell](https://cloud.google.com/shell/docs/quickstart) or [gcloud CLI](https://cloud.google.com/sdk/) to run all the commands in this guideline. |
| 12 | + |
| 13 | +## Setup a project |
| 14 | + |
| 15 | +Follow the [instruction](https://cloud.google.com/resource-manager/docs/creating-managing-projects) and create a GCP project. |
| 16 | +Once created, enable the Dataflow API, BigQuery API in this [page](https://console.developers.google.com/apis/enabled). You can also find more details about enabling the [billing](https://cloud.google.com/billing/docs/how-to/modify-project?#enable-billing) |
| 17 | + |
| 18 | +We recommend to use CloudShell from the GCP console to run the below commands. CloudShell starts with an environment already logged in to your account and set to the currently selected project. The following commands are required only in a workstation shell environment, they are not needed in the CloudShell. |
| 19 | + |
| 20 | +```bash |
| 21 | +gcloud auth login |
| 22 | +gcloud config set project [your-project-id] |
| 23 | +gcloud config set compute/zone us-central1-a |
| 24 | +``` |
| 25 | + |
| 26 | +## Prepare a input data |
| 27 | + |
| 28 | +You need to download reuters-21578 dataset from from [here](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection). After downloading reuters.tar.gz from the site, you need to type the following commands to store reuter dataset to Google Cloud Storage. |
| 29 | + |
| 30 | +```bash |
| 31 | +export BUCKET=gs://[your-bucket-name] |
| 32 | + |
| 33 | +mkdir temp reuters |
| 34 | +tar -zxvf reuters21578.tar.gz -C temp/ |
| 35 | +mv temp/*.sgm reuters/ && rm -rf temp |
| 36 | +gsutil -m cp -R reuters $BUCKET |
| 37 | +``` |
| 38 | + |
| 39 | +## Setup python environment and sample code |
| 40 | + |
| 41 | +Follow commands below to install required python packages and download a dataflow pipeline code. |
| 42 | + |
| 43 | +```bash |
| 44 | +git clone [this-repo] |
| 45 | +cd [this-repo]/00_Miscellaneous/text-similarity-analysis |
| 46 | + |
| 47 | +# Make sure you have python 2.7 environement |
| 48 | +pip install -r requirements.txt |
| 49 | +``` |
| 50 | + |
| 51 | +## Run the pipeline |
| 52 | + |
| 53 | +Set running configurations for your Dataflow job. You will need GCE instances with high memory in Dataflow job because tf.hub module uses lots of memory that doesn't fit memory of default GCE instance. |
| 54 | + |
| 55 | +```bash |
| 56 | +# Running configurations for Dataflow |
| 57 | +export PROJECT=[your-project-name] |
| 58 | +export JOBNAME=[your-dataflow-job-name] |
| 59 | +export REGION=[your-preferred-region] |
| 60 | +export RUNNER=DataflowRunner |
| 61 | +export MACHINE_TYPE=n1-highmem-2 |
| 62 | +``` |
| 63 | + |
| 64 | +If you've followed the instruction in previous section, you should have reuter dataset in GCS. Set a file pattern of reuter dataset to FILE_PATTERN variable. |
| 65 | + |
| 66 | +```bash |
| 67 | +# A file pattern of reuter dataset located in GCS. |
| 68 | +export FILE_PATTERN=$BUCKET/reuters |
| 69 | +``` |
| 70 | + |
| 71 | +Note that you have to create BigQuery dataset before running Dataflow job. You should also set name of BigQuery dataset and table so Dataflow pipeline can output the feature embeddings to the right place in BigQuery. |
| 72 | + |
| 73 | +```bash |
| 74 | +# Information about output table in BigQuery. |
| 75 | +export BQ_PROJECT=$PROJECT |
| 76 | +export BQ_DATASET=[your-bigquery-dataset-name] |
| 77 | +export BQ_TABLE=[your-bigquery-table-name] |
| 78 | +``` |
| 79 | + |
| 80 | +Next, you have to just run below commands. TF_EXPORT directory is where SavedModel file will be output by tf.transform. You can re-use it to get feature embeddings from documents later. |
| 81 | + |
| 82 | +```bash |
| 83 | +# A root directory. |
| 84 | +export ROOT="$BUCKET/$JOBNAME" |
| 85 | + |
| 86 | +# Working directories for Dataflow jobs. |
| 87 | +export DF_ROOT="$ROOT/dataflow" |
| 88 | +export DF_STAGING="$DF_ROOT/staging" |
| 89 | +export DF_TEMP="$DF_ROOT/temp" |
| 90 | + |
| 91 | +# Working directories for tf.transform. |
| 92 | +export TF_ROOT="$ROOT/transform" |
| 93 | +export TF_TEMP="$TF_ROOT/temp" |
| 94 | +export TF_EXPORT="$TF_ROOT/export" |
| 95 | + |
| 96 | +# A directory where tfrecords data will be output. |
| 97 | +export TFRECORD_OUTPUT_DIR="$ROOT/tfrecords" |
| 98 | +export PIPELINE_LOG_PREFIX="$ROOT/log/output" |
| 99 | +``` |
| 100 | + |
| 101 | +Before running the pipeline, you can remove previous working directory |
| 102 | +with below command if you want. |
| 103 | + |
| 104 | +```bash |
| 105 | +gsutil rm -r $ROOT |
| 106 | +``` |
| 107 | + |
| 108 | +Finally, you can run the pipeline with this command. |
| 109 | + |
| 110 | +```bash |
| 111 | +python etl/run_pipeline.py \ |
| 112 | + --project=$PROJECT \ |
| 113 | + --region=$REGION \ |
| 114 | + --setup_file=$(pwd)/etl/setup.py \ |
| 115 | + --job_name=$JOB_NAME \ |
| 116 | + --runner=$RUNNER \ |
| 117 | + --worker_machine_type=$MACHINE_TYPE \ |
| 118 | + --file_pattern=$FILE_PATTERN \ |
| 119 | + --bq_project=$BQ_PROJECT \ |
| 120 | + --bq_dataset=$BQ_DATASET \ |
| 121 | + --bq_table=$BQ_TABLE \ |
| 122 | + --staging_location=$DF_STAGING \ |
| 123 | + --temp_location=$DF_TEMP \ |
| 124 | + --transform_temp_dir=$TF_TEMP \ |
| 125 | + --transform_export_dir=$TF_EXPORT \ |
| 126 | + --enable_tfrecord \ |
| 127 | + --tfrecord_export_dir $TFRECORD_OUTPUT_DIR \ |
| 128 | + --enable_debug \ |
| 129 | + --debug_output_prefix=$PIPELINE_LOG_PREFIX |
| 130 | +``` |
0 commit comments