|
520 | 520 | "* K-Nearest Neighbors (KNN)\n", |
521 | 521 | "* Approximate Nearest Neighbors (ANN)\n", |
522 | 522 | "\n", |
523 | | - "When your dataset is small, the K-Nearest Neighbors (KNN) algorithm works well, but with large datasets, you shall need to use Approximate Nearest Neighbors (ANN) because the latency and cost of a KNN search increases.\nHowever, we shall exhibit how to use both!" |
| 523 | + "When your dataset is small, the K-Nearest Neighbors (KNN) algorithm works well, but with large datasets, you shall need to use Approximate Nearest Neighbors (ANN) because the latency and cost of a KNN search increases.\n", |
| 524 | + "However, we shall exhibit how to use both!" |
524 | 525 | ] |
525 | 526 | }, |
526 | 527 | { |
|
570 | 571 | "id": "KG6rwEuJLNIo" |
571 | 572 | }, |
572 | 573 | "source": [ |
573 | | - "### Try inserting the documents into the vector table\n", |
| 574 | + "#### Try inserting the documents into the vector table\n", |
574 | 575 | "\n", |
575 | 576 | "Now we will create a vector_store object backed by our vector table in the Spanner database. Let's load the data from the documents to the vector table. Note that for each row, the embedding service will be called to compute the embeddings to store in the vector table. Pricing details can be found [here](https://cloud.google.com/vertex-ai/pricing)." |
576 | 577 | ] |
|
643 | 644 | "id": "29iztdvfL2BN" |
644 | 645 | }, |
645 | 646 | "source": [ |
646 | | - "### Import the rest of your data into your vector table\n", |
| 647 | + "#### Import the rest of your data into your vector table\n", |
647 | 648 | "\n", |
648 | 649 | "You don't have to call the embedding service 8,800 times to load all the documents for the demo. Instead, we have prepared data with the all 8,800+ rows with pre-computed embeddings in a `.csv` file. Let's import data from csv directly." |
649 | 650 | ] |
|
698 | 699 | { |
699 | 700 | "cell_type": "markdown", |
700 | 701 | "metadata": { |
701 | | - "id": "jfH8oQJ945Ko" |
| 702 | + "id": "jfH8oQJ945Ko", |
| 703 | + "jp-MarkdownHeadingCollapsed": true |
702 | 704 | }, |
703 | 705 | "source": [ |
704 | 706 | "### Approximate Nearest Neighbors (ANN) based vector store\n", |
| 707 | + "For this task, we shall pull in documents from a popular HackerNews post, insert them into our ANN based vector store and then use ANN to find the most relevent comments/content\n", |
| 708 | + "\n", |
| 709 | + "To create vector embeddings, we shall be using Google's Vertex AI gecko-003 model and then for all related queries, vectorize the query using our embedding service to then perform the search.\n", |
705 | 710 | "\n", |
706 | | - "For this task, we shall pull in documents from a popular HackerNews post, insert them into our ANN based vector store and then use ANN to find the most relevent comments/content\n\nTo create vector embeddings, we shall be using Google's Vertex AI gecko-003 model and then for all related queries, vectorize the query using our embedding service to then perform the search.\n\n", |
707 | 711 | "Cloud Spanner allows for 3 different algorithms to be created with the vector search index and correspondingly used for the search:\n", |
708 | 712 | "* APPROX_COSINE\n", |
709 | 713 | "* APPROX_DOT_PRODUCT\n", |
710 | 714 | "* APPROX_EUCLIDEAN_DISTANCE\n", |
711 | | - "\n\nIn this exhibit, we shall be using using `APPROX_COSINE`\n", |
712 | | - "Our steps shall comprise:\n* Creating the text embedding service\n* Initializing the ANN vector store\n* Loading data from a popular HackerNews post\n* Adding the documents to the vector store\n* Searching by similarity_search, similarity_search_by_vector, max_marginal_relevance_search_with_score_by_vector\n* Deleting the inserted documents\nAll the above using the langchain.VectorStore interfface.\n\n" |
| 715 | + "\n", |
| 716 | + "\n", |
| 717 | + "In this exhibit, we shall be using using `APPROX_COSINE`\n", |
| 718 | + "Our steps shall comprise:\n", |
| 719 | + "* Creating the text embedding service\n", |
| 720 | + "* Initializing the ANN vector store\n", |
| 721 | + "* Loading data from a popular HackerNews post\n", |
| 722 | + "* Adding the documents to the vector store\n", |
| 723 | + "* Searching by similarity_search, similarity_search_by_vector, max_marginal_relevance_search_with_score_by_vector\n", |
| 724 | + "* Deleting the inserted documents\n", |
| 725 | + "All the above using the langchain.VectorStore interface.\n", |
| 726 | + "\n" |
| 727 | + ] |
| 728 | + }, |
| 729 | + { |
| 730 | + "cell_type": "markdown", |
| 731 | + "metadata": { |
| 732 | + "jp-MarkdownHeadingCollapsed": true |
| 733 | + }, |
| 734 | + "source": [ |
| 735 | + "#### Create the embeddings\n", |
| 736 | + "We shall be pulling in an article from HackerNews and creating vector search embeddings using Google Vertex AI's textgecko3 model.\n", |
| 737 | + "\n" |
713 | 738 | ] |
714 | 739 | }, |
715 | 740 | { |
|
718 | 743 | "metadata": {}, |
719 | 744 | "outputs": [], |
720 | 745 | "source": [ |
721 | | - "import os\n", |
722 | | - "import uuid\n", |
723 | | - "\n", |
724 | | - "from langchain_community.document_loaders import HNLoader\n", |
725 | 746 | "from langchain_google_vertexai.embeddings import VertexAIEmbeddings\n", |
726 | | - "from langchain_google_spanner.vector_store import (\n", |
727 | | - " DistanceStrategy,\n", |
728 | | - " QueryParameters,\n", |
729 | | - " SpannerVectorStore,\n", |
730 | | - " TableColumn,\n", |
731 | | - " VectorSearchIndex,\n", |
732 | | - ")\n", |
733 | 747 | "\n", |
734 | 748 | "embeddings_service = VertexAIEmbeddings(\n", |
735 | 749 | " model_name=\"textembedding-gecko@003\", project=project_id\n", |
736 | 750 | ")\n", |
737 | | - "table_name_ANN = \"hnn_articles\"\n", |
738 | 751 | "embedding_vector_size = 768\n", |
739 | 752 | "vector_index_name = \"titles_index\"\n", |
740 | 753 | "title_embedding_column = TableColumn(\n", |
741 | 754 | " name=\"title_embedding\", type=\"ARRAY<FLOAT64>\", is_null=True\n", |
| 755 | + ")" |
| 756 | + ] |
| 757 | + }, |
| 758 | + { |
| 759 | + "cell_type": "markdown", |
| 760 | + "metadata": { |
| 761 | + "jp-MarkdownHeadingCollapsed": true |
| 762 | + }, |
| 763 | + "source": [ |
| 764 | + "#### Initialize the vector store\n", |
| 765 | + "These steps allow you to decide the structure of your vector store table and then issue the creation DDL. In our case we are specifying that the vector search index will use the `COSINE` distance strategy\n" |
| 766 | + ] |
| 767 | + }, |
| 768 | + { |
| 769 | + "cell_type": "code", |
| 770 | + "execution_count": null, |
| 771 | + "metadata": {}, |
| 772 | + "outputs": [], |
| 773 | + "source": [ |
| 774 | + "from langchain_google_spanner.vector_store import (\n", |
| 775 | + " DistanceStrategy,\n", |
| 776 | + " QueryParameters,\n", |
| 777 | + " SpannerVectorStore,\n", |
| 778 | + " TableColumn,\n", |
| 779 | + " VectorSearchIndex,\n", |
742 | 780 | ")\n", |
743 | 781 | "\n", |
744 | | - "\n", |
745 | | - "def main():\n", |
746 | | - " SpannerVectorStore.init_vector_store_table(\n", |
747 | | - " instance_id=instance_id,\n", |
748 | | - " database_id=database_id,\n", |
749 | | - " table_name=table_name_ANN,\n", |
750 | | - " vector_size=embedding_vector_size,\n", |
751 | | - " id_column=\"row_id\",\n", |
752 | | - " metadata_columns=[\n", |
753 | | - " TableColumn(name=\"metadata\", type=\"JSON\", is_null=True),\n", |
754 | | - " TableColumn(name=\"title\", type=\"STRING(MAX)\", is_null=False),\n", |
755 | | - " ],\n", |
756 | | - " embedding_column=title_embedding_column,\n", |
757 | | - " secondary_indexes=[\n", |
758 | | - " VectorSearchIndex(\n", |
759 | | - " index_name=vector_index_name,\n", |
760 | | - " columns=[title_embedding_column.name],\n", |
761 | | - " nullable_column=True,\n", |
762 | | - " num_branches=1000,\n", |
763 | | - " tree_depth=3,\n", |
764 | | - " distance_type=DistanceStrategy.COSINE,\n", |
765 | | - " num_leaves=100000,\n", |
766 | | - " ),\n", |
767 | | - " ],\n", |
768 | | - " )\n", |
769 | | - "\n", |
770 | | - " # 0. Create the handle to the vector store.\n", |
771 | | - " db = SpannerVectorStore(\n", |
772 | | - " instance_id=instance_id,\n", |
773 | | - " database_id=google_database,\n", |
774 | | - " table_name=table_name_ANN,\n", |
775 | | - " id_column=\"row_id\",\n", |
776 | | - " ignore_metadata_columns=[],\n", |
777 | | - " embedding_service=embeddings_service,\n", |
778 | | - " embedding_column=title_embedding_column,\n", |
779 | | - " metadata_json_column=\"metadata\",\n", |
780 | | - " vector_index_name=vector_index_name,\n", |
781 | | - " query_parameters=QueryParameters(\n", |
782 | | - " algorithm=QueryParameters.NearestNeighborsAlgorithm.APPROXIMATE_NEAREST_NEIGHBOR,\n", |
783 | | - " distance_strategy=DistanceStrategy.COSINE,\n", |
| 782 | + "SpannerVectorStore.init_vector_store_table(\n", |
| 783 | + " instance_id=instance_id,\n", |
| 784 | + " database_id=database_id,\n", |
| 785 | + " table_name=table_name_ANN,\n", |
| 786 | + " vector_size=embedding_vector_size,\n", |
| 787 | + " id_column=\"row_id\",\n", |
| 788 | + " metadata_columns=[\n", |
| 789 | + " TableColumn(name=\"metadata\", type=\"JSON\", is_null=True),\n", |
| 790 | + " TableColumn(name=\"title\", type=\"STRING(MAX)\", is_null=False),\n", |
| 791 | + " ],\n", |
| 792 | + " embedding_column=title_embedding_column,\n", |
| 793 | + " secondary_indexes=[\n", |
| 794 | + " VectorSearchIndex(\n", |
| 795 | + " index_name=vector_index_name,\n", |
| 796 | + " columns=[title_embedding_column.name],\n", |
| 797 | + " nullable_column=True,\n", |
| 798 | + " num_branches=1000,\n", |
| 799 | + " tree_depth=3,\n", |
| 800 | + " distance_type=DistanceStrategy.COSINE,\n", |
| 801 | + " num_leaves=100000,\n", |
784 | 802 | " ),\n", |
785 | | - " )\n", |
786 | | - "\n", |
787 | | - " # 1. Add the documents, loaded in from the HackerNews post.\n", |
788 | | - " loader = HNLoader(\"https://news.ycombinator.com/item?id=42797260\")\n", |
789 | | - " inserted_docs = loader.load()\n", |
790 | | - " docs = inserted_docs.copy()\n", |
791 | | - " ids = [str(uuid.uuid4()) for _ in range(len(docs))]\n", |
792 | | - " db.add_documents(documents=docs, ids=ids)\n", |
793 | | - " print(\"n_docs\", len(docs))\n", |
794 | | - "\n", |
795 | | - " # 2. Use similarity_search.\n", |
796 | | - " docs = db.similarity_search(\n", |
797 | | - " \"Open source software\",\n", |
798 | | - " k=2,\n", |
799 | | - " )\n", |
800 | | - " print(\"by similarity_search\", docs)\n", |
801 | | - "\n", |
802 | | - " # 3. Search by vector similarity.\n", |
803 | | - " embeds = embeddings_service.embed_query(\n", |
804 | | - " \"Open source software\",\n", |
805 | | - " )\n", |
806 | | - " docs = db.similarity_search_by_vector(\n", |
807 | | - " embeds,\n", |
808 | | - " k=3,\n", |
809 | | - " )\n", |
810 | | - " print(\"by direct vector_search\", docs)\n", |
811 | | - "\n", |
812 | | - " # 4. Search by max_marginal_relevance_search_with_score_by_vector.\n", |
813 | | - " docs = db.max_marginal_relevance_search_with_score_by_vector(\n", |
814 | | - " embeds,\n", |
815 | | - " k=3,\n", |
816 | | - " )\n", |
817 | | - " print(\"by max_marginal_relevance_search\", docs)\n", |
818 | | - "\n", |
819 | | - " # 5. Delete the inserted docs.\n", |
820 | | - " deleted = db.delete(documents=inserted_docs)\n", |
821 | | - " print(\"deleted\", deleted)\n", |
822 | | - "\n", |
823 | | - "\n", |
824 | | - "if __name__ == \"__main__\":\n", |
825 | | - " main()" |
| 803 | + " ],\n", |
| 804 | + ")" |
| 805 | + ] |
| 806 | + }, |
| 807 | + { |
| 808 | + "cell_type": "markdown", |
| 809 | + "metadata": { |
| 810 | + "jp-MarkdownHeadingCollapsed": true |
| 811 | + }, |
| 812 | + "source": [ |
| 813 | + "#### Acquire a handle to the vector store\n", |
| 814 | + "A handle to the vector store eases the ergonomics of remembering the distance strategy, vector index name and embedding column. ***Please note that to use ANN, you shall need to specify the algorithm and distance_strategy in your query_parameters***.\n" |
| 815 | + ] |
| 816 | + }, |
| 817 | + { |
| 818 | + "cell_type": "code", |
| 819 | + "execution_count": null, |
| 820 | + "metadata": {}, |
| 821 | + "outputs": [], |
| 822 | + "source": [ |
| 823 | + "db = SpannerVectorStore(\n", |
| 824 | + " instance_id=instance_id,\n", |
| 825 | + " database_id=google_database,\n", |
| 826 | + " table_name=table_name_ANN,\n", |
| 827 | + " id_column=\"row_id\",\n", |
| 828 | + " ignore_metadata_columns=[],\n", |
| 829 | + " embedding_service=embeddings_service,\n", |
| 830 | + " embedding_column=title_embedding_column,\n", |
| 831 | + " metadata_json_column=\"metadata\",\n", |
| 832 | + " vector_index_name=vector_index_name,\n", |
| 833 | + " query_parameters=QueryParameters(\n", |
| 834 | + " algorithm=QueryParameters.NearestNeighborsAlgorithm.APPROXIMATE_NEAREST_NEIGHBOR,\n", |
| 835 | + " distance_strategy=DistanceStrategy.COSINE,\n", |
| 836 | + " ),\n", |
| 837 | + ")" |
| 838 | + ] |
| 839 | + }, |
| 840 | + { |
| 841 | + "cell_type": "markdown", |
| 842 | + "metadata": { |
| 843 | + "jp-MarkdownHeadingCollapsed": true |
| 844 | + }, |
| 845 | + "source": [ |
| 846 | + "#### Add documents into the vector store\n", |
| 847 | + "Using LangChain's interface, you can populate the vector store with the posts from HackerNews.\n" |
| 848 | + ] |
| 849 | + }, |
| 850 | + { |
| 851 | + "cell_type": "code", |
| 852 | + "execution_count": null, |
| 853 | + "metadata": {}, |
| 854 | + "outputs": [], |
| 855 | + "source": [ |
| 856 | + "import uuid\n", |
| 857 | + "\n", |
| 858 | + "loader = HNLoader(\"https://news.ycombinator.com/item?id=42797260\")\n", |
| 859 | + "inserted_docs = loader.load()\n", |
| 860 | + "docs = inserted_docs.copy()\n", |
| 861 | + "ids = [str(uuid.uuid4()) for _ in range(len(docs))]\n", |
| 862 | + "db.add_documents(documents=docs, ids=ids)\n", |
| 863 | + "print(\"n_docs\", len(docs))" |
| 864 | + ] |
| 865 | + }, |
| 866 | + { |
| 867 | + "cell_type": "markdown", |
| 868 | + "metadata": { |
| 869 | + "jp-MarkdownHeadingCollapsed": true |
| 870 | + }, |
| 871 | + "source": [ |
| 872 | + "#### Searching for documents by similarity searches\n", |
| 873 | + "We shall perform search by 4 different methods\n" |
| 874 | + ] |
| 875 | + }, |
| 876 | + { |
| 877 | + "cell_type": "markdown", |
| 878 | + "metadata": {}, |
| 879 | + "source": [ |
| 880 | + "##### similarity_search: your search query's embedding will be performed internally" |
| 881 | + ] |
| 882 | + }, |
| 883 | + { |
| 884 | + "cell_type": "code", |
| 885 | + "execution_count": null, |
| 886 | + "metadata": {}, |
| 887 | + "outputs": [], |
| 888 | + "source": [ |
| 889 | + "docs = db.similarity_search(\n", |
| 890 | + " \"Open source software\",\n", |
| 891 | + " k=2,\n", |
| 892 | + ")\n", |
| 893 | + "print(\"by similarity_search\", docs)" |
| 894 | + ] |
| 895 | + }, |
| 896 | + { |
| 897 | + "cell_type": "markdown", |
| 898 | + "metadata": {}, |
| 899 | + "source": [ |
| 900 | + "##### similarity_by_vector_search: passing embeddings directly in" |
| 901 | + ] |
| 902 | + }, |
| 903 | + { |
| 904 | + "cell_type": "code", |
| 905 | + "execution_count": null, |
| 906 | + "metadata": {}, |
| 907 | + "outputs": [], |
| 908 | + "source": [ |
| 909 | + "embeds = embeddings_service.embed_query(\n", |
| 910 | + " \"Open source software\",\n", |
| 911 | + ")\n", |
| 912 | + "docs = db.similarity_search_by_vector(\n", |
| 913 | + " embeds,\n", |
| 914 | + " k=3,\n", |
| 915 | + ")\n", |
| 916 | + "print(\"by direct vector_search\", docs)" |
| 917 | + ] |
| 918 | + }, |
| 919 | + { |
| 920 | + "cell_type": "markdown", |
| 921 | + "metadata": {}, |
| 922 | + "source": [ |
| 923 | + "##### max_marginal_relevance_search_with_score_by_vector: passing embeddings directly in" |
| 924 | + ] |
| 925 | + }, |
| 926 | + { |
| 927 | + "cell_type": "code", |
| 928 | + "execution_count": null, |
| 929 | + "metadata": {}, |
| 930 | + "outputs": [], |
| 931 | + "source": [ |
| 932 | + "docs = db.max_marginal_relevance_search_with_score_by_vector(\n", |
| 933 | + " embeds,\n", |
| 934 | + " k=3,\n", |
| 935 | + ")\n", |
| 936 | + "print(\"by max_marginal_relevance_search\", docs)" |
| 937 | + ] |
| 938 | + }, |
| 939 | + { |
| 940 | + "cell_type": "markdown", |
| 941 | + "metadata": { |
| 942 | + "jp-MarkdownHeadingCollapsed": true |
| 943 | + }, |
| 944 | + "source": [ |
| 945 | + "#### Clean up and delete the previously inserted documents" |
| 946 | + ] |
| 947 | + }, |
| 948 | + { |
| 949 | + "cell_type": "code", |
| 950 | + "execution_count": null, |
| 951 | + "metadata": {}, |
| 952 | + "outputs": [], |
| 953 | + "source": [ |
| 954 | + "deleted = db.delete(documents=inserted_docs)\n", |
| 955 | + "print(\"deleted\", deleted)" |
826 | 956 | ] |
827 | 957 | }, |
828 | 958 | { |
|
0 commit comments