Weaviate

TODO: Use this data set to try Materials and their Mechanical Properties.

What is Weaviate?

Weaviate is an open-source Vector Database. It enables you to store data objects and vector embeddings and query them based on similarity measures.

Probably the most popular use case of vector databases in the context of LLMs is to “provide LLMs with long-term memory”.

Steps of implement

There are three steps to go by using Weavite:

Prerequisites

In this example we will run Weaviate in their Cloud Services (WCS). To be able to use the service, you first need to register with WCS. Once you are registered, you can create a new Weaviate Cluster by clicking the “Create cluster” button.

Next install the python packages:

$ pip install weaviate-client

and import the library:

import weaviate

Here is how to implement the client

auth_config = weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY") # Replace w/ your Weaviate instance API key  
  
# Instantiate the client  
client = weaviate.Client(  
url="https://<your-sandbox-name>.weaviate.network", # Replace w/ your Weaviate cluster URL  
auth_client_secret=auth_config,  
additional_headers={  
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY", # Replace with your OpenAI key  
}  
)

To check set up, we can add code:

client.is_ready()

If it returns True, you’re all set for the next steps.

Create and Populate the Vector Database

You want to import your data first, here is the example:

import pandas as pd  
  
df = pd.read_csv("your_file_path.csv", nrows = 100)

Create a Schema

In Weaviate you create schemas to capture each of the entities you will be searching.

A schema is how you tell Weaviate:

There are some basic configuration to know:

class_obj = {  
	# Class definition  
	"class": "JeopardyQuestion",  
  
	# Property definitions  
	"properties": [  
		{   #those name are cloumns in dataset
			"name": "category",  
			"dataType": ["text"],  
		},  
		{  
			"name": "question",  
			"dataType": ["text"],  
		},  
		{  
			"name": "answer",  
			"dataType": ["text"],  
		},  
	],  
  
	# Specify a vectorizer  
	"vectorizer": "text2vec-openai",  
  
	# Module settings  
	"moduleConfig": {  
		"text2vec-openai": {  
			"vectorizeClassName": False,  
			"model": "ada",  
			"modelVersion": "002",  
			"type": "text"  
		},  
	},  
}

In the above schema, you can see that we will create a class called "JeopardyQuestion", with the three text properties "category""question", and "answer". The vectorizer we are using is OpenAI’s Ada model (version 2). All properties will be vectorized but not the class name ("vectorizeClassName" : False).

Once you have defined the schema, you can create the class with the create_class() method.

client.schema.create_class(class_obj)

To check if the class has been created successfully, you can review its schema as follows:

client.schema.get("JeopardyQuestion")

Import Data into Weaviate

let’s populate it with our dataset. This process is also called “upserting”.

We will upsert the data in batches of 200.

from weaviate.util import generate_uuid5  
  
with client.batch(  
	batch_size=200, # Specify batch size  
	num_workers=2, # Parallelize the process  
) as batch:  
	for _, row in df.iterrows():  
		question_object = {  
			"category": row.category,  
			"question": row.question,  
			"answer": row.answer,  
		}  
		batch.add_data_object(  
			question_object,  
			class_name="JeopardyQuestion",  
			uuid=generate_uuid5(question_object)  
		)

Here is splitting long data sets as 200 batches.
Note. Weaviate will generate a universally unique identifier (uuid) automatically, we will manually generate the uuid with the generate_uuid5()function from the question_object to avoid importing duplicate items.

For a sanity check, you can review the number of imported objects with the following code snippet:

client.query.aggregate("JeopardyQuestion").with_meta_count().do()

Query the Vector Database

The most common operation you will do with a vector database is to retrieve objects. To retrieve objects, you query the Weaviate vector database with the get() function:

client.query.get(  
	<Class>,  
	[<properties>]  
).<arguments>.do()

Example

Let's retrieve some entries from the JeopardyQuestion class with the get()function to see what they look like. This is very similar to df.head() in Pandas, but hte get() is respond with JSON format.

import json  
  
res = client.query.get("JeopardyQuestion",  
					["question", "answer", "category"])  
				.with_additional(["id", "vector"])  
				.with_limit(2)  
				.do()  
  
print(json.dumps(res, indent=4))

In the above code snippet, you can see that we are retrieving objects from the "JeopardyQuestion" class. We specified to retrieve the properties "category""question", and "answer".

We specified two additional arguments: First, we specified with the .with_additional() argument to retrieve additional information about the object's id and the vector embedding. And with the .with_limit(2)argument, we specified only to retrieve two objects. This limitation is important, and you will see it again in the later examples. This is because retrieving objects from a vector database does not return the exact matches but returns objects based on similarity, which has to be limited by a threshold.

Now, we’re ready to do some vector search! What’s cool about retrieving information from a vector database is that you can e.g., tell it to retrieve Jeopardy questions related to the "concepts" around animals.

For this, we can use the .with_near_text() argument and pass it the "concepts" we are interested in as shown below:

res = client.query.get(  
	"JeopardyQuestion",  
	["question", "answer", "category"])\  
	.with_near_text({"concepts": "animals"})\  
	.with_limit(2)\  
	.do()

The specified vectorizer then converts the input text ("animals”) to a vector embedding and retrieves the two closest results:

{  
	"data": {  
		"Get": {  
			"JeopardyQuestion": [  
				{  
					"answer": "an octopus",  
					"category": "SEE & SAY",  
					"question": "Say the name of <a href=\"http://www.j-archive.com/media/2010-07-06_DJ_26.jpg\" target=\"_blank\">this</a> type of mollusk you see"  
				},  
				{  
					"answer": "the ant",  
					"category": "3-LETTER WORDS",  
					"question": "In the title of an Aesop fable, this insect shared billing with a grasshopper"  
				}  
			]  
		}  
	}  
}

Advance Technique

.with_where can be used to limit search range, e.g.

            .with_where({
                "path": ["su"],
                "operator": "GreaterThan",
                "valueInt": 680,
            })\
            .with_where({
                "path": ["su"],
                "operator": "LessThan",
                "valueInt": 900,
            })\

Keyword Searching

Keyword searching are just like google searching, you can directly searching the keyword by bm25 algorithm, e.g.

    response = (client.query.get("Material", properties)\
            .with_additional("id")\
            .with_bm25(
                    query=keyword,
            )\

Hybrid Searching

You are also welcome to use hybrid searching to mix the two methods. And control the weight with alpha value:

    response = (client.query.get("Material", properties)\
            .with_additional("id")\
            .with_hybrid(
                    query=keyword,
                    alpha=0.8,
            )\

Alpha 1 is pure vector search, alpha 0 is pure keyword search.

Question Answering

Question answering is one of the most popular examples when it comes to combining LLMs with vector databases.

To enable question answering, you need to specify a vectorizer (which you should already have) and a question-answering module under the module configuration, as shown in the following example:

# Module settings  
	"moduleConfig": {  
		"text2vec-openai": {  
		...  
		},  
		"qna-openai": {  
			"model": "text-davinci-002"  
		}  
},

For question-answering, you need to add the with_ask() argument and also retrieve _additionalproperties.

ask = {  
	"question": "Which animal was mentioned in the title of the Aesop fable?",  
	"properties": ["answer"]  
}  
  
res = (  
	client.query  
	.get("JeopardyQuestion", [  
		"question",  
		"_additional {answer {hasAnswer property result} }"  
	])  
	.with_ask(ask)  
	.with_limit(1)  
	.do()  
)

The above piece of code looks through all questions that may contain the answer to the question "Which animal was mentioned in the title of the Aesop fable?" and returns the answer "The ant".

{  
	"JeopardyQuestion": [  
		{  
			"_additional": {  
				"answer": {  
					"hasAnswer": true,  
					"property": "",  
					"result": " The ant"  
				}  
			},  
			"question": "In the title of an Aesop fable, this insect shared billing with a grasshopper"  
		}  
	]  
}

Note. The more properties you put, the accurate answer you get but be careful overfitting. The balance is crucial.

By incorporating LLMs, you can also transform the data before returning the search result. This concept is called generative search.

To enable generative search, you need to specify a generative module under the module configuration, as shown in the following example:

# Module settings  
	"moduleConfig": {  
		"text2vec-openai": {  
		...  
		},  
		"generative-openai": {  
		"model": "gpt-3.5-turbo"  
		}  
},

For generative search, you only need to add the with_generate() argument to your previous vector search code as shown below:

res = client.query.get(  
	"JeopardyQuestion",  
	["question", "answer"])\  
	.with_near_text({"concepts": ["animals"]})\  
	.with_limit(1)\  
	.with_generate(single_prompt= "Generate a question to which the answer is {answer}")\  
	.do()

The above piece of code does the following:

  1. Search for the question closest to the concept of "animals"
  2. Return the question "Say the name of this type of mollusk you see" with the answer "an octopus"
  3. Generate a completion for the prompt "Generate a question to which the answer is an octopus” with the final result:
{  
	"generate": {  
		"error": null,  
		"singleResult": "What sea creature has eight arms and is known for its intelligence and camouflage abilities?"  
	}  
}