Skip to content

Commit 2b9201f

Browse files
authored
Merge pull request #10 from GoogleCloudPlatform/Youki-patch-1
Youki patch 1
2 parents e7938c0 + 33f0238 commit 2b9201f

File tree

1 file changed

+130
-1
lines changed
  • 00_Miscellaneous/text-similarity-analysis

1 file changed

+130
-1
lines changed
Lines changed: 130 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,130 @@
1-
# text-analysis-pipeline
1+
# Text Semantic Similarity Analysis Pipeline
2+
3+
This is a Dataflow pipeline that reads article documents in Google Cloud Storage, extracts feature embeddings from the documents, and store those embeddings to BigQuery. After running the pipeline, you can easily search contextually similar documents based on a cosine distance between feature embeddings.
4+
5+
In the pipeline, documents are processed to extract each article’s title, topics, and content. The processing pipeline uses the “Universal Sentence Encoder” module in tf.hub to extract text embeddings for both the title and the content of each article read from the source documents. The title, topics, and content of each article, along with the extracted embeddings are stored in BigQuery. Having the articles, along with their embeddings, stored in BigQuery allow us to explore similar articles, using cosine similarity metric between embeddings of tiles and/or contents.
6+
7+
# How to run the pipeline
8+
9+
## Requirements
10+
11+
You need to have your [GCP Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects). You can use [Cloud Shell](https://cloud.google.com/shell/docs/quickstart) or [gcloud CLI](https://cloud.google.com/sdk/) to run all the commands in this guideline.
12+
13+
## Setup a project
14+
15+
Follow the [instruction](https://cloud.google.com/resource-manager/docs/creating-managing-projects) and create a GCP project.
16+
Once created, enable the Dataflow API, BigQuery API in this [page](https://console.developers.google.com/apis/enabled). You can also find more details about enabling the [billing](https://cloud.google.com/billing/docs/how-to/modify-project?#enable-billing)
17+
18+
We recommend to use CloudShell from the GCP console to run the below commands. CloudShell starts with an environment already logged in to your account and set to the currently selected project. The following commands are required only in a workstation shell environment, they are not needed in the CloudShell.
19+
20+
```bash
21+
gcloud auth login
22+
gcloud config set project [your-project-id]
23+
gcloud config set compute/zone us-central1-a
24+
```
25+
26+
## Prepare a input data
27+
28+
You need to download reuters-21578 dataset from from [here](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection). After downloading reuters.tar.gz from the site, you need to type the following commands to store reuter dataset to Google Cloud Storage.
29+
30+
```bash
31+
export BUCKET=gs://[your-bucket-name]
32+
33+
mkdir temp reuters
34+
tar -zxvf reuters21578.tar.gz -C temp/
35+
mv temp/*.sgm reuters/ && rm -rf temp
36+
gsutil -m cp -R reuters $BUCKET
37+
```
38+
39+
## Setup python environment and sample code
40+
41+
Follow commands below to install required python packages and download a dataflow pipeline code.
42+
43+
```bash
44+
git clone [this-repo]
45+
cd [this-repo]/00_Miscellaneous/text-similarity-analysis
46+
47+
# Make sure you have python 2.7 environement
48+
pip install -r requirements.txt
49+
```
50+
51+
## Run the pipeline
52+
53+
Set running configurations for your Dataflow job. You will need GCE instances with high memory in Dataflow job because tf.hub module uses lots of memory that doesn't fit memory of default GCE instance.
54+
55+
```bash
56+
# Running configurations for Dataflow
57+
export PROJECT=[your-project-name]
58+
export JOBNAME=[your-dataflow-job-name]
59+
export REGION=[your-preferred-region]
60+
export RUNNER=DataflowRunner
61+
export MACHINE_TYPE=n1-highmem-2
62+
```
63+
64+
If you've followed the instruction in previous section, you should have reuter dataset in GCS. Set a file pattern of reuter dataset to FILE_PATTERN variable.
65+
66+
```bash
67+
# A file pattern of reuter dataset located in GCS.
68+
export FILE_PATTERN=$BUCKET/reuters
69+
```
70+
71+
Note that you have to create BigQuery dataset before running Dataflow job. You should also set name of BigQuery dataset and table so Dataflow pipeline can output the feature embeddings to the right place in BigQuery.
72+
73+
```bash
74+
# Information about output table in BigQuery.
75+
export BQ_PROJECT=$PROJECT
76+
export BQ_DATASET=[your-bigquery-dataset-name]
77+
export BQ_TABLE=[your-bigquery-table-name]
78+
```
79+
80+
Next, you have to just run below commands. TF_EXPORT directory is where SavedModel file will be output by tf.transform. You can re-use it to get feature embeddings from documents later.
81+
82+
```bash
83+
# A root directory.
84+
export ROOT="$BUCKET/$JOBNAME"
85+
86+
# Working directories for Dataflow jobs.
87+
export DF_ROOT="$ROOT/dataflow"
88+
export DF_STAGING="$DF_ROOT/staging"
89+
export DF_TEMP="$DF_ROOT/temp"
90+
91+
# Working directories for tf.transform.
92+
export TF_ROOT="$ROOT/transform"
93+
export TF_TEMP="$TF_ROOT/temp"
94+
export TF_EXPORT="$TF_ROOT/export"
95+
96+
# A directory where tfrecords data will be output.
97+
export TFRECORD_OUTPUT_DIR="$ROOT/tfrecords"
98+
export PIPELINE_LOG_PREFIX="$ROOT/log/output"
99+
```
100+
101+
Before running the pipeline, you can remove previous working directory
102+
with below command if you want.
103+
104+
```bash
105+
gsutil rm -r $ROOT
106+
```
107+
108+
Finally, you can run the pipeline with this command.
109+
110+
```bash
111+
python etl/run_pipeline.py \
112+
--project=$PROJECT \
113+
--region=$REGION \
114+
--setup_file=$(pwd)/etl/setup.py \
115+
--job_name=$JOB_NAME \
116+
--runner=$RUNNER \
117+
--worker_machine_type=$MACHINE_TYPE \
118+
--file_pattern=$FILE_PATTERN \
119+
--bq_project=$BQ_PROJECT \
120+
--bq_dataset=$BQ_DATASET \
121+
--bq_table=$BQ_TABLE \
122+
--staging_location=$DF_STAGING \
123+
--temp_location=$DF_TEMP \
124+
--transform_temp_dir=$TF_TEMP \
125+
--transform_export_dir=$TF_EXPORT \
126+
--enable_tfrecord \
127+
--tfrecord_export_dir $TFRECORD_OUTPUT_DIR \
128+
--enable_debug \
129+
--debug_output_prefix=$PIPELINE_LOG_PREFIX
130+
```

0 commit comments

Comments
 (0)