Skip to main content

ClickHouse

ClickHouse 是最快、资源效率最高的开源数据库,适用于实时应用和分析,支持完整的 SQL 及多种功能,帮助用户编写分析查询。最近添加的数据结构和距离搜索功能(如 L2Distance)以及 近似最近邻搜索索引 使 ClickHouse 能够作为高性能和可扩展的向量数据库,用于存储和搜索带有 SQL 的向量。

您需要使用 pip install -qU langchain-community 安装 langchain-community 以使用此集成。

本笔记本展示了如何使用与 ClickHouse 向量搜索相关的功能。

设置环境

使用 Docker 设置本地 ClickHouse 服务器(可选)

! docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11

设置 ClickHouse 客户端驱动

%pip install --upgrade --quiet  clickhouse-connect

我们希望使用 OpenAIEmbeddings,因此我们需要获取 OpenAI API 密钥。

import getpass
import os

if not os.environ["OPENAI_API_KEY"]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_community.vectorstores import Clickhouse, ClickhouseSettings
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
for d in docs:
d.metadata = {"some": "metadata"}
settings = ClickhouseSettings(table="clickhouse_vector_search_example")
docsearch = Clickhouse.from_documents(docs, embeddings, config=settings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 2801.49it/s]
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

获取连接信息和数据模式

print(str(docsearch))
default.clickhouse_vector_search_example @ localhost:8123

username: None

Table Schema:
---------------------------------------------------
|id |Nullable(String) |
|document |Nullable(String) |
|embedding |Array(Float32) |
|metadata |Object('json') |
|uuid |UUID |
---------------------------------------------------

Clickhouse 表结构

如果 Clickhouse 表不存在,将默认自动创建。高级用户可以预先创建具有优化设置的表。对于具有分片的分布式 Clickhouse 集群,表引擎应配置为 Distributed

print(f"Clickhouse Table DDL:\n\n{docsearch.schema}")
Clickhouse Table DDL:

CREATE TABLE IF NOT EXISTS default.clickhouse_vector_search_example(
id Nullable(String),
document Nullable(String),
embedding Array(Float32),
metadata JSON,
uuid UUID DEFAULT generateUUIDv4(),
CONSTRAINT cons_vec_len CHECK length(embedding) = 1536,
INDEX vec_idx embedding TYPE annoy(100,'L2Distance') GRANULARITY 1000
) ENGINE = MergeTree ORDER BY uuid SETTINGS index_granularity = 8192

过滤

您可以直接访问 ClickHouse SQL 的 where 子句。您可以按照标准 SQL 编写 WHERE 子句。

注意:请注意 SQL 注入,此接口不得直接由最终用户调用。

如果您在设置中自定义了 column_map,您可以使用以下方式进行过滤搜索:

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Clickhouse, ClickhouseSettings

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
d.metadata = {"doc_id": i}

docsearch = Clickhouse.from_documents(docs, embeddings)
Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 6939.56it/s]
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
"What did the president say about Ketanji Brown Jackson?",
k=4,
where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
print(dist, d.metadata, d.page_content[:20] + "...")
0.6779101415357189 {'doc_id': 0} Madam Speaker, Madam...
0.6997970363474885 {'doc_id': 8} And so many families...
0.7044504914336727 {'doc_id': 1} Groups of citizens b...
0.7053558702165094 {'doc_id': 6} And I’m taking robus...

删除您的数据

docsearch.drop()

相关


此页面是否有帮助?


您还可以留下详细的反馈 在 GitHub 上