Skip to main content
Version: Next

ai-proxy-multi

描述#

ai-proxy-multi 插件通过将插件配置转换为 OpenAI、DeepSeek、Azure、AIMLAPI 和其他 OpenAI 兼容 API 的指定请求格式,简化了对 LLM 和嵌入模型的访问。它通过负载均衡、重试、故障转移和健康检查扩展了 ai-proxy 的功能。

此外,该插件还支持在访问日志中记录 LLM 请求信息,如令牌使用量、模型、首次响应时间等。

请求格式#

名称类型必选项描述
messagesArray消息对象数组。
messages.roleString消息的角色(systemuserassistant)。
messages.contentString消息的内容。

属性#

名称类型必选项默认值有效值描述
fallback_strategystring 或 arraystring: "instance_health_and_rate_limiting", "http_429", "http_5xx"
array: ["rate_limiting", "http_429", "http_5xx"]
故障转移策略。设置后,插件将在转发请求时检查指定实例的令牌是否已耗尽。如果是,则无论实例优先级如何,都将请求转发到下一个实例。未设置时,当高优先级实例的令牌耗尽时,插件不会将请求转发到低优先级实例。
balancerobject负载均衡配置。
balancer.algorithmstringroundrobin[roundrobin, chash]负载均衡算法。设置为 roundrobin 时,使用加权轮询算法。设置为 chash 时,使用一致性哈希算法。
balancer.hash_onstring[vars, headers, cookie, consumer, vars_combinations]typechash 时使用。支持基于 NGINX 变量、标头、cookie、消费者或 NGINX 变量组合进行哈希。
balancer.keystringtypechash 时使用。当 hash_on 设置为 headercookie 时,需要 key。当 hash_on 设置为 consumer 时,不需要 key,因为消费者名称将自动用作键。
instancesarray[object]LLM 实例配置。
instances.namestringLLM 服务实例的名称。
instances.providerstring[openai, deepseek, azure-openai, aimlapi, openai-compatible]LLM 服务提供商。设置为 openai 时,插件将代理请求到 api.openai.com。设置为 deepseek 时,插件将代理请求到 api.deepseek.com。设置为 aimlapi 时,插件使用 OpenAI 兼容驱动程序,默认将请求代理到 api.aimlapi.com。设置为 openai-compatible 时,插件将代理请求到在 override 中配置的自定义端点。
instances.priorityinteger0LLM 实例在负载均衡中的优先级。priority 优先于 weight
instances.weightstring0大于或等于 0LLM 实例在负载均衡中的权重。
instances.authobject身份验证配置。
instances.auth.headerobject身份验证标头。应配置 headerquery 中的至少一个。
instances.auth.queryobject身份验证查询参数。应配置 headerquery 中的至少一个。
instances.optionsobject模型配置。除了 model 之外,您还可以配置其他参数,它们将在请求体中转发到上游 LLM 服务。例如,如果您使用 OpenAI、DeepSeek 或 AIMLAPI,可以配置其他参数,如 max_tokenstemperaturetop_pstream。有关更多可用选项,请参阅您的 LLM 提供商的 API 文档。
instances.options.modelstringLLM 模型的名称,如 gpt-4gpt-3.5。有关更多可用模型,请参阅您的 LLM 提供商的 API 文档。
loggingobject日志配置。
logging.summariesbooleanfalse如果为 true,记录请求 LLM 模型、持续时间、请求和响应令牌。
logging.payloadsbooleanfalse如果为 true,记录请求和响应负载。
logging.overrideobject覆盖设置。
logging.override.endpointstring用于替换默认端点的 LLM 提供商端点。如果未配置,插件使用默认的 OpenAI 端点 https://api.openai.com/v1/chat/completions
checksobject健康检查配置。请注意,目前 OpenAI、DeepSeek 和 AIMLAPI 不提供官方健康检查端点。您可以在 openai-compatible 提供商下配置的其他 LLM 服务可能有可用的健康检查端点。
checks.activeobject主动健康检查配置。
checks.active.typestringhttp[http, https, tcp]健康检查连接类型。
checks.active.timeoutnumber1健康检查超时时间(秒)。
checks.active.concurrencyinteger10同时检查的上游节点数量。
checks.active.hoststringHTTP 主机。
checks.active.portinteger1 到 65535(包含)HTTP 端口。
checks.active.http_pathstring/HTTP 探测请求的路径。
checks.active.https_verify_certificatebooleantrue如果为 true,验证节点的 TLS 证书。
timeoutinteger30000大于或等于 1请求 LLM 服务时的请求超时时间(毫秒)。
keepalivebooleantrue如果为 true,在请求 LLM 服务时保持连接活跃。
keepalive_timeoutinteger60000大于或等于 1000请求 LLM 服务时的请求超时时间(毫秒)。
keepalive_poolinteger30连接 LLM 服务时的保活池大小。
ssl_verifybooleantrue如果为 true,验证 LLM 服务的证书。

示例#

以下示例演示了如何为不同场景配置 ai-proxy-multi

note

您可以使用以下命令从 config.yaml 获取 admin_key 并保存到环境变量中:

admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g')

实例间负载均衡#

以下示例演示了如何配置两个模型进行负载均衡,将 80% 的流量转发到一个实例,20% 转发到另一个实例。

为了演示和更容易区分,您将配置一个 OpenAI 实例和一个 DeepSeek 实例作为上游 LLM 服务。

创建路由并更新您的 LLM 提供商、模型、API 密钥和端点(如果适用):

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "openai-instance",
"provider": "openai",
"weight": 8,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
"model": "gpt-4"
}
},
{
"name": "deepseek-instance",
"provider": "deepseek",
"weight": 2,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
"model": "deepseek-chat"
}
}
]
}
}
}'

向路由发送 10 个 POST 请求,在请求体中包含系统提示和示例用户问题,以查看转发到 OpenAI 和 DeepSeek 的请求数量:

openai_count=0
deepseek_count=0

for i in {1..10}; do
model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}' | jq -r '.model')

if [[ "$model" == *"gpt-4"* ]]; then
((openai_count++))
elif [[ "$model" == "deepseek-chat" ]]; then
((deepseek_count++))
fi
done

echo "OpenAI responses: $openai_count"
echo "DeepSeek responses: $deepseek_count"

您应该看到类似以下的响应:

OpenAI responses: 8
DeepSeek responses: 2

配置实例优先级和速率限制#

以下示例演示了如何配置两个具有不同优先级的模型,并在优先级较高的实例上应用速率限制。在 fallback_strategy 设置为 ["rate_limiting"] 的情况下,一旦高优先级实例的速率限制配额完全消耗,插件应继续将请求转发到低优先级实例。

创建路由并更新您的 LLM 提供商、模型、API 密钥和端点(如果适用):

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
"fallback_strategy": ["rate_limiting"],
"instances": [
{
"name": "openai-instance",
"provider": "openai",
"priority": 1,
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
"model": "gpt-4"
}
},
{
"name": "deepseek-instance",
"provider": "deepseek",
"priority": 0,
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
"model": "deepseek-chat"
}
}
]
},
"ai-rate-limiting": {
"instances": [
{
"name": "openai-instance",
"limit": 10,
"time_window": 60
}
],
"limit_strategy": "total_tokens"
}
}
}'

向路由发送 POST 请求,在请求体中包含系统提示和示例用户问题:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

您应该收到类似以下的响应:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1+1 equals 2.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 8,
"total_tokens": 31,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": null
}

由于 total_tokens 值超过了配置的 10 配额,预计在 60 秒窗口内的下一个请求将转发到另一个实例。

在同一个 60 秒窗口内,向路由发送另一个 POST 请求:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "Explain Newton law" }
]
}'

您应该看到类似以下的响应:

{
...,
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n \\[\n F = ma\n \\]\n where:\n - \\( F \\) = net force applied (in Newtons),\n -"
},
...
}
],
...
}
```#
## 按消费者进行负载均衡和速率限制

以下示例演示了如何配置两个模型进行负载均衡,并按消费者应用速率限制。

创建消费者 `johndoe` 并在 `openai-instance` 实例上设置 60 秒窗口内 10 个令牌的速率限制配额:

```shell
curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"username": "johndoe",
"plugins": {
"ai-rate-limiting": {
"instances": [
{
"name": "openai-instance",
"limit": 10,
"time_window": 60
}
],
"rejected_code": 429,
"limit_strategy": "total_tokens"
}
}
}'

johndoe 配置 key-auth 凭据:

curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "cred-john-key-auth",
"plugins": {
"key-auth": {
"key": "john-key"
}
}
}'

创建另一个消费者 janedoe 并在 deepseek-instance 实例上设置 60 秒窗口内 10 个令牌的速率限制配额:

curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"username": "janedoe",
"plugins": {
"ai-rate-limiting": {
"instances": [
{
"name": "deepseek-instance",
"limit": 10,
"time_window": 60
}
],
"rejected_code": 429,
"limit_strategy": "total_tokens"
}
}
}'

janedoe 配置 key-auth 凭据:

curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "cred-jane-key-auth",
"plugins": {
"key-auth": {
"key": "jane-key"
}
}
}'

创建路由并更新您的 LLM 提供商、模型、API 密钥和端点(如果适用):

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"key-auth": {},
"ai-proxy-multi": {
"fallback_strategy": ["rate_limiting"],
"instances": [
{
"name": "openai-instance",
"provider": "openai",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
"model": "gpt-4"
}
},
{
"name": "deepseek-instance",
"provider": "deepseek",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
"model": "deepseek-chat"
}
}
]
}
}
}'

向路由发送 POST 请求,不带任何消费者密钥:

curl -i "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

您应该收到 HTTP/1.1 401 Unauthorized 响应。

使用 johndoe 的密钥向路由发送 POST 请求:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-H 'apikey: john-key' \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

您应该收到类似以下的响应:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1+1 equals 2.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 8,
"total_tokens": 31,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": null
}

由于 total_tokens 值超过了 johndoeopenai 实例配置配额,预计在 60 秒窗口内来自 johndoe 的下一个请求将转发到 deepseek 实例。

在同一个 60 秒窗口内,使用 johndoe 的密钥向路由发送另一个 POST 请求:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-H 'apikey: john-key' \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "Explain Newtons laws to me" }
]
}'

您应该看到类似以下的响应:

{
...,
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n \\[\n F = ma\n \\]\n where:\n - \\( F \\) = net force applied (in Newtons),\n -"
},
...
}
],
...
}

使用 janedoe 的密钥向路由发送 POST 请求:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-H 'apikey: jane-key' \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

您应该收到类似以下的响应:

{
...,
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The sum of 1 and 1 is 2. This is a basic arithmetic operation where you combine two units to get a total of two units."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 31,
"total_tokens": 45,
"prompt_tokens_details": {
"cached_tokens": 0
},
"prompt_cache_hit_tokens": 0,
"prompt_cache_miss_tokens": 14
},
"system_fingerprint": "fp_3a5770e1b4_prod0225"
}

由于 total_tokens 值超过了 janedoedeepseek 实例配置配额,预计在 60 秒窗口内来自 janedoe 的下一个请求将转发到 openai 实例。

在同一个 60 秒窗口内,使用 janedoe 的密钥向路由发送另一个 POST 请求:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-H 'apikey: jane-key' \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "Explain Newtons laws to me" }
]
}'

您应该看到类似以下的响应:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Sure, here are Newton's three laws of motion:\n\n1) Newton's First Law, also known as the Law of Inertia, states that an object at rest will stay at rest, and an object in motion will stay in motion, unless acted on by an external force. In simple words, this law suggests that an object will keep doing whatever it is doing until something causes it to do otherwise. \n\n2) Newton's Second Law states that the force acting on an object is equal to the mass of that object times its acceleration (F=ma). This means that force is directly proportional to mass and acceleration. The heavier the object and the faster it accelerates, the greater the force.\n\n3) Newton's Third Law, also known as the law of action and reaction, states that for every action, there is an equal and opposite reaction. Essentially, any force exerted onto a body will create a force of equal magnitude but in the opposite direction on the object that exerted the first force.\n\nRemember, these laws become less accurate when considering speeds near the speed of light (where Einstein's theory of relativity becomes more appropriate) or objects very small or very large. However, for everyday situations, they provide a good model of how things move.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
...
}

这显示了 ai-proxy-multi 根据消费者在 ai-rate-limiting 中的速率限制规则对流量进行负载均衡。

限制完成令牌的最大数量#

以下示例演示了如何在生成聊天完成时限制使用的 completion_tokens 数量。

为了演示和更容易区分,您将配置一个 OpenAI 实例和一个 DeepSeek 实例作为上游 LLM 服务。

创建路由并更新您的 LLM 提供商、模型、API 密钥和端点(如果适用):

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "openai-instance",
"provider": "openai",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
"model": "gpt-4",
"max_tokens": 50
}
},
{
"name": "deepseek-instance",
"provider": "deepseek",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
"model": "deepseek-chat",
"max_tokens": 100
}
}
]
}
}
}'

向路由发送 POST 请求,在请求体中包含系统提示和示例用户问题:

curl "http://127.0.0.1:9080/anything" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "Explain Newtons law" }
]
}'

如果请求被代理到 OpenAI,您应该看到类似以下的响应,其中内容根据 50 个 max_tokens 阈值被截断:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Newton's Laws of Motion are three physical laws that form the bedrock for classical mechanics. They describe the relationship between a body and the forces acting upon it, and the body'",
"refusal": null
},
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 50,
"total_tokens": 70,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": null
}

如果请求被代理到 DeepSeek,您应该看到类似以下的响应,其中内容根据 100 个 max_tokens 阈值被截断:

{
...,
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Newton's Laws of Motion are three fundamental principles that form the foundation of classical mechanics. They describe the relationship between a body and the forces acting upon it, and the body's motion in response to those forces. Here's a brief explanation of each law:\n\n1. **Newton's First Law (Law of Inertia):**\n - **Statement:** An object will remain at rest or in uniform motion in a straight line unless acted upon by an external force.\n - **Explanation:** This law"
},
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 100,
"total_tokens": 110,
"prompt_tokens_details": {
"cached_tokens": 0
},
"prompt_cache_hit_tokens": 0,
"prompt_cache_miss_tokens": 10
},
"system_fingerprint": "fp_3a5770e1b4_prod0225"
}

代理到嵌入模型#

以下示例演示了如何配置 ai-proxy-multi 插件以代理请求并在嵌入模型之间进行负载均衡。

创建路由并更新您的 LLM 提供商、嵌入模型、API 密钥和端点:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "openai-instance",
"provider": "openai",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
"model": "text-embedding-3-small"
},
"override": {
"endpoint": "https://api.openai.com/v1/embeddings"
}
},
{
"name": "az-openai-instance",
"provider": "openai-compatible",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$AZ_OPENAI_API_KEY"'"
}
},
"options": {
"model": "text-embedding-3-small"
},
"override": {
"endpoint": "https://ai-plugin-developer.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15"
}
}
]
}
}
}'

向路由发送 POST 请求,包含输入字符串:

curl "http://127.0.0.1:9080/embeddings" -X POST \
-H "Content-Type: application/json" \
-d '{
"input": "hello world"
}'

您应该收到类似以下的响应:

{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
-0.0067144386,
-0.039197803,
0.034177095,
0.028763203,
-0.024785956,
-0.04201061,
...
],
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 2,
"total_tokens": 2
}
}

启用主动健康检查#

以下示例演示了如何配置 ai-proxy-multi 插件以代理请求并在模型之间进行负载均衡,并启用主动健康检查以提高服务可用性。您可以在一个或多个实例上启用健康检查。

创建路由并更新 LLM 提供商、嵌入模型、API 密钥和健康检查相关配置:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
-H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "llm-instance-1",
"provider": "openai-compatible",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
}
},
"options": {
"model": "'"$YOUR_LLM_MODEL"'"
}
},
{
"name": "llm-instance-2",
"provider": "openai-compatible",
"weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
}
},
"options": {
"model": "'"$YOUR_LLM_MODEL"'"
},
"checks": {
"active": {
"type": "https",
"host": "yourhost.com",
"http_path": "/your/probe/path",
"healthy": {
"interval": 2,
"successes": 1
},
"unhealthy": {
"interval": 1,
"http_failures": 3
}
}
}
}
]
}
}
}'

为了验证,行为应与主动健康检查中的验证一致。

在访问日志中包含 LLM 信息#

以下示例演示了如何在网关的访问日志中记录 LLM 请求相关信息,以改进分析和审计。以下变量可用:

  • request_llm_model:请求中指定的 LLM 模型名称。
  • apisix_upstream_response_time:APISIX 向上游服务发送请求并接收完整响应所花费的时间
  • request_type:请求类型,值可能是 traditional_httpai_chatai_stream
  • llm_time_to_first_token:从发送请求到从 LLM 服务接收第一个令牌的持续时间(毫秒)。
  • llm_model:LLM 模型。
  • llm_prompt_tokens:提示中的令牌数量。
  • llm_completion_tokens:提示中的聊天完成令牌数量。

在配置文件中更新访问日志格式以包含其他 LLM 相关变量:

conf/config.yaml
nginx_config:
http:
access_log_format: "$remote_addr - $remote_user [$time_local] $http_host \"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $apisix_upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" \"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" \"$request_llm_model\" \"$llm_prompt_tokens\" \"$llm_completion_tokens\""

重新加载 APISIX 以使配置更改生效。

接下来,使用 ai-proxy-multi 插件创建路由并发送请求。例如,如果请求转发到 OpenAI 并且您收到以下响应:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1+1 equals 2.",
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 8,
"total_tokens": 31,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
...
},
"service_tier": "default",
"system_fingerprint": null
}

在网关的访问日志中,您应该看到类似以下的日志条目:

192.168.215.1 - - [21/Mar/2025:04:28:03 +0000] api.openai.com "POST /anything HTTP/1.1" 200 804 2.858 "-" "curl/8.6.0" - - - 5765 "http://api.openai.com" "5c5e0b95f8d303cb81e4dc456a4b12d9" "ai_chat" "2858" "gpt-4" "gpt-4" "23" "8"

访问日志条目显示请求类型为 ai_chat,Apisix 上游响应时间为 5765 毫秒,首次令牌时间为 2858 毫秒,请求的 LLM 模型为 gpt-4。LLM 模型为 gpt-4,提示令牌使用量为 23,完成令牌使用量为 8