ai-proxy-multi

ai-proxy-multi 插件通过将插件配置转换为 OpenAI、DeepSeek、Gemini、Vertex AI 及其他兼容 OpenAI 的 API 所需的请求格式，简化了对 LLM 和嵌入模型的访问。它在 ai-proxy 的基础上增加了负载均衡、重试、回退和健康检查功能。

此外，该插件还支持在访问日志中记录 LLM 请求信息，例如令牌使用量、模型、首响应时间等。

演示

以下演示展示了配置实例优先级和速率限制示例。它展示了如何在 API7 企业版中使用仪表板配置两个具有不同优先级的模型，并对优先级较高的实例应用速率限制。在将 fallback_strategy 设置为 ["rate_limiting"] 的情况下，一旦高优先级实例的速率限制配额用完，插件应继续将请求转发到低优先级实例。

示例

以下示例演示了如何针对不同场景配置 ai-proxy-multi。

实例间负载均衡

以下示例演示了如何配置两个模型进行负载均衡，将 80% 的流量转发到一个实例，20% 转发到另一个实例。

为便于演示和区分，你将配置一个 OpenAI 实例和一个 DeepSeek 实例作为上游 LLM 服务。

创建路由如下，并根据需要更新你的 LLM 提供商、模型、API 密钥和端点：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 8,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4"
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "weight": 2,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat"
            }
          }
        ]
      }
    }
  }'

❶ 将 openai-instance 的权重配置为 8。

❷ 将 deepseek-instance 的权重配置为 2。

向该路由发送 10 个 POST 请求，请求体中包含系统提示和示例用户问题，以查看转发到 OpenAI 和 DeepSeek 的请求数量：

openai_count=0
deepseek_count=0

for i in {1..10}; do 
  model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        { "role": "system", "content": "You are a mathematician" },
        { "role": "user", "content": "What is 1+1?" }
      ]
    }' | jq -r '.model')

  if [[ "$model" == *"gpt-4"* ]]; then
    ((openai_count++))
  elif [[ "$model" == "deepseek-chat" ]]; then
    ((deepseek_count++))
  fi
done

echo "OpenAI responses: $openai_count"
echo "DeepSeek responses: $deepseek_count"

你应该会看到类似于以下的响应：

OpenAI responses: 8
DeepSeek responses: 2

Gemini 与 Vertex AI 间的负载均衡

以下示例演示了如何在 Google AI Studio Gemini 和 Vertex AI Gemini 之间配置负载均衡，将 70% 的流量转发到 Gemini，30% 转发到 Vertex AI。此示例仅适用于 API7 企业版 3.9.2 及以上版本，不适用于 APISIX。

在继续之前：

对于 Google AI Studio Gemini，获取 Gemini API 密钥。
对于 Vertex AI Gemini，为你的 GCP 项目启用 Vertex AI 和结算。然后，按照服务账号凭证说明，在 GCP 中创建服务账号，为其分配“Vertex AI User”角色，并以 JSON 格式下载账号凭证。

创建路由如下，并更新你的项目 ID 和区域：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-google-ai-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "fallback_strategy": ["rate_limiting"],
        "instances": [
          {
            "name": "gemini-instance",
            "provider": "gemini",
            "weight": 7,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$GEMINI_API_KEY"'"
              }
            },
            "options": {
              "model": "gemini-2.5-flash"
            }
          },
          {
            "name": "vertex-ai-instance",
            "provider": "vertex-ai",
            "weight": 3,
            "auth": {
              "gcp": {
                "service_account_json": "'"$GCP_SA_JSON"'"
              }
            },
            "provider_conf": {
              "project_id": "api7-vertex",
              "region": "us-central1"
            },
            "options": {
              "model": "google/gemini-2.5-flash"
            }
          }
        ]
      }
    }
  }'

❶ 将提供商配置为 gemini，以访问 Google AI Studio Gemini。

❷ 在 Authorization 请求头中替换为你的 Gemini API 密钥。

❸ 以 <model> 格式指定通过 Google AI Studio 使用的 Gemini 模型名称。

❹ 将提供商配置为 vertex-ai，以访问 Vertex AI Gemini。

❺ 替换为你的 JSON 凭证。确保它是一个 JSON 转义字符串。

❻ 替换为你的 Vertex AI 项目 ID 和区域。

❼ 以 <publisher>/<model> 格式指定通过 Vertex AI 使用的 Gemini 模型名称。

向该路由发送 10 个 POST 请求，以查看负载均衡分布：

studio_count=0
vertex_count=0

for i in {1..10}; do 
  model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        { "role": "system", "content": "You are a mathematician" },
        { "role": "user", "content": "What is 1+1?" }
      ]
    }' | jq -r '.model')

  if [[ "$model" == "gemini-2.5-flash" ]]; then
    ((studio_count++))
  elif [[ "$model" == "google/gemini-2.5-flash" ]]; then
    ((vertex_count++))
  fi
done

echo "Google AI Studio Gemini responses: $studio_count"
echo "Vertex AI Gemini responses: $vertex_count"

你应该会看到类似于以下的响应：

Google AI Studio Gemini responses: 7
Vertex AI Gemini responses: 3

配置实例优先级和速率限制

以下示例演示了如何配置两个具有不同优先级的模型，并对优先级较高的实例应用速率限制。在将 fallback_strategy 设置为 ["rate_limiting"] 的情况下，一旦高优先级实例的速率限制配额用完，插件应继续将请求转发到低优先级实例。

创建路由如下，并根据需要更新你的 LLM 提供商、模型、API 密钥和端点：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "fallback_strategy": ["rate_limiting"],
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "priority": 1,
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4"
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "priority": 0,
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat"
            }
          }
        ]
      },
      "ai-rate-limiting": {
        "instances": [
          {
            "name": "openai-instance",
            "limit": 10,
            "time_window": 60
          }
        ],
        "limit_strategy": "total_tokens"
      }
    }
  }'

❶ 将 fallback_strategy 设置为 ["rate_limiting"]。

❷ 为 openai-instance 实例设置较高的优先级。

❸ 为 deepseek-instance 实例设置较低的优先级。

❹ 对 openai-instance 实例应用速率限制。

❺ 配置配额为 10 个令牌。

❻ 配置时间窗口为 60 秒。

❼ 按 total_tokens 应用速率限制。

向该路由发送一个 POST 请求，请求体中包含系统提示和示例用户问题：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

你应该会收到类似于以下的响应：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1+1 equals 2.",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 8,
    "total_tokens": 31,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": null
}

由于 total_tokens 值超过了配置的 10 配额，因此在 60 秒窗口内的下一个请求预计将被转发到另一个实例。

在同一 60 秒窗口内，向该路由发送另一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "Explain Newton law" }
    ]
  }'

你应该会看到类似于以下的响应：

{
  ...,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
      },
      ...
    }
  ],
  ...
}

按消费者进行负载均衡和速率限制

以下示例演示了如何配置两个模型进行负载均衡，并按消费者应用速率限制。

创建消费者 johndoe，并为 openai-instance 实例配置 60 秒窗口内 10 个令牌的速率限制配额：

curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "username": "johndoe",
    "plugins": {
      "ai-rate-limiting": {
        "instances": [
          {
            "name": "openai-instance",
            "limit": 10,
            "time_window": 60
          }
        ],
        "rejected_code": 429,
        "limit_strategy": "total_tokens"
      }
    }
  }'

为 johndoe 配置 key-auth 凭证：

curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "cred-john-key-auth",
    "plugins": {
      "key-auth": {
        "key": "john-key"
      }
    }
  }'

创建另一个消费者 janedoe，并为 deepseek-instance 实例配置 60 秒窗口内 10 个令牌的速率限制配额：

curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "username": "janedoe",
    "plugins": {
      "ai-rate-limiting": {
        "instances": [
          {
            "name": "deepseek-instance",
            "limit": 10,
            "time_window": 60
          }
        ],
        "rejected_code": 429,
        "limit_strategy": "total_tokens"
      }
    }
  }'

为 janedoe 配置 key-auth 凭证：

curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "cred-jane-key-auth",
    "plugins": {
      "key-auth": {
        "key": "jane-key"
      }
    }
  }'

创建路由如下，并根据需要更新你的 LLM 提供商、模型、API 密钥和端点：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "key-auth": {},
      "ai-proxy-multi": {
        "fallback_strategy": ["rate_limiting"],
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4"
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat"
            }
          }
        ]
      }
    }
  }'

❶ 在路由上启用 key-auth。

❷ 配置一个 openai 实例。

❸ 配置一个 deepseek 实例。

向该路由发送一个不带任何消费者密钥的 POST 请求：

curl -i "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

你应该会收到 HTTP/1.1 401 Unauthorized 响应。

使用 johndoe 的密钥向该路由发送一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -H 'apikey: john-key' \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

你应该会收到类似于以下的响应：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1+1 equals 2.",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 8,
    "total_tokens": 31,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": null
}

由于 total_tokens 值超过了 johndoe 的 openai 实例的配置配额，因此在 60 秒窗口内来自 johndoe 的下一个请求预计将被转发到 deepseek 实例。

在同一 60 秒窗口内，使用 johndoe 的密钥向该路由发送另一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -H 'apikey: john-key' \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "Explain Newtons laws to me" }
    ]
  }'

你应该会看到类似于以下的响应：

{
  ...,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight line at a constant speed), unless acted upon by an external force.\n- **Key Idea:** This law introduces the concept of **inertia**, which is the tendency of an object to resist changes in its state of motion.\n- **Example:** If you slide a book across a table, it eventually stops because of the force of friction acting on it. Without friction, the book would keep moving indefinitely.\n\n---\n\n### **2. Newton's Second Law (Law of Acceleration):**\n- **Statement:** The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass. Mathematically, this is expressed as:\n  \\[\n  F = ma\n  \\]\n  where:\n  - \\( F \\) = net force applied (in Newtons),\n  -"
      },
      ...
    }
  ],
  ...
}

使用 janedoe 的密钥向该路由发送一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -H 'apikey: jane-key' \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

你应该会收到类似于以下的响应：

{
  ...,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The sum of 1 and 1 is 2. This is a basic arithmetic operation where you combine two units to get a total of two units."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 31,
    "total_tokens": 45,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "prompt_cache_hit_tokens": 0,
    "prompt_cache_miss_tokens": 14
  },
  "system_fingerprint": "fp_3a5770e1b4_prod0225"
}

由于 total_tokens 值超过了 janedoe 的 deepseek 实例的配置配额，因此在 60 秒窗口内来自 janedoe 的下一个请求预计将被转发到 openai 实例。

在同一 60 秒窗口内，使用 janedoe 的密钥向该路由发送另一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -H 'apikey: jane-key' \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "Explain Newtons laws to me" }
    ]
  }'

你应该会看到类似于以下的响应：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure, here are Newton's three laws of motion:\n\n1) Newton's First Law, also known as the Law of Inertia, states that an object at rest will stay at rest, and an object in motion will stay in motion, unless acted on by an external force. In simple words, this law suggests that an object will keep doing whatever it is doing until something causes it to do otherwise. \n\n2) Newton's Second Law states that the force acting on an object is equal to the mass of that object times its acceleration (F=ma). This means that force is directly proportional to mass and acceleration. The heavier the object and the faster it accelerates, the greater the force.\n\n3) Newton's Third Law, also known as the law of action and reaction, states that for every action, there is an equal and opposite reaction. Essentially, any force exerted onto a body will create a force of equal magnitude but in the opposite direction on the object that exerted the first force.\n\nRemember, these laws become less accurate when considering speeds near the speed of light (where Einstein's theory of relativity becomes more appropriate) or objects very small or very large. However, for everyday situations, they provide a good model of how things move.",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  ...
}

这表明 ai-proxy-multi 根据 ai-rate-limiting 中按消费者的速率限制规则进行流量负载均衡。

限制最大补全令牌数

以下示例演示了如何限制生成聊天补全时使用的 completion_tokens 数量。

为便于演示和区分，你将配置一个 OpenAI 实例和一个 DeepSeek 实例作为上游 LLM 服务。

创建路由如下，并根据需要更新你的 LLM 提供商、模型、API 密钥和端点：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4",
              "max_tokens": 50
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat",
              "max_tokens": 100
            }
          }
        ]
      }
    }
  }'

❶ 将 OpenAI 实例的 max_tokens 配置为 50。

❷ 将 DeepSeek 实例的 max_tokens 配置为 100。

向该路由发送一个 POST 请求，请求体中包含系统提示和示例用户问题：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "Explain Newtons law" }
    ]
  }'

如果请求被代理到 OpenAI，你应该会看到类似于以下的响应，其中内容根据 50 个 max_tokens 阈值被截断：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Newton's Laws of Motion are three physical laws that form the bedrock for classical mechanics. They describe the relationship between a body and the forces acting upon it, and the body's motion in response to those forces. \n\n1. Newton's First Law",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 50,
    "total_tokens": 70,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": null
}

如果请求被代理到 DeepSeek，你应该会看到类似于以下的响应，其中内容根据 100 个 max_tokens 阈值被截断：

{
  ...,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Newton's Laws of Motion are three fundamental principles that form the foundation of classical mechanics. They describe the relationship between a body and the forces acting upon it, and the body's motion in response to those forces. Here's a brief explanation of each law:\n\n1. **Newton's First Law (Law of Inertia):**\n   - **Statement:** An object will remain at rest or in uniform motion in a straight line unless acted upon by an external force.\n   - **Explanation:** This law"
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 100,
    "total_tokens": 110,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "prompt_cache_hit_tokens": 0,
    "prompt_cache_miss_tokens": 10
  },
  "system_fingerprint": "fp_3a5770e1b4_prod0225"
}

代理到嵌入模型

以下示例演示了如何配置 ai-proxy-multi 插件以代理请求并在嵌入模型之间进行负载均衡。

创建路由如下，并根据需要更新你的 LLM 提供商、嵌入模型、API 密钥和端点：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "text-embedding-3-small"
            },
            "override": {
              "endpoint": "https://api.openai.com/v1/embeddings"
            }
          },
          {
            "name": "az-openai-instance",
            "provider": "azure-openai",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$AZ_OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "text-embedding-3-small"
            },
            "override": {
              "endpoint": "https://ai-plugin-developer.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15"
            }
          }
        ]
      }
    }
  }'

❶ 指定嵌入模型的名称。

❷ 将默认的 OpenAI 端点覆盖为嵌入 API 端点。

❸ 指定嵌入模型的名称。

❹ 指定 Azure 嵌入 API 端点。

向该路由发送一个包含输入字符串的 POST 请求：

curl "http://127.0.0.1:9080/embeddings" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "input": "hello world"
  }'

你应该会收到类似于以下的响应：

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.0067144386,
        -0.039197803,
        0.034177095,
        0.028763203,
        -0.024785956,
        -0.04201061,
        ...
      ],
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 2,
    "total_tokens": 2
  }
}

启用主动健康检查

以下示例演示了如何配置 ai-proxy-multi 插件以代理请求并在模型之间进行负载均衡，并启用主动健康检查以提高服务可用性。你可以在一个或多个实例上启用健康检查。

创建路由如下，并更新 LLM 提供商、嵌入模型、API 密钥和健康检查相关配置：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "llm-instance-1",
            "provider": "openai-compatible",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
              }
            },
            "options": {
              "model": "'"$YOUR_LLM_MODEL"'"
            }
          },
          {
            "name": "llm-instance-2",
            "provider": "openai-compatible",
            "weight": 0,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
              }
            },
            "options": {
              "model": "'"$YOUR_LLM_MODEL"'"
            },
            "checks": {
              "active": {
                "type": "https",
                "host": "yourhost.com",
                "http_path": "/your/probe/path",
                "healthy": {
                  "interval": 2,
                  "successes": 1
                },
                "unhealthy": {
                  "interval": 1,
                  "http_failures": 3
                }
              }
            }
          }
        ]
      }
    }
  }'

❶ 更新主动健康检查的类型。

❷ 如果可用，更新主机。

❸ 更新探测路径。

❹ 配置定期检查健康节点的时间间隔（秒）。

❺ 配置判定上游节点健康的成功次数阈值。

❻ 配置定期检查不健康节点的时间间隔（秒）。

❼ 配置判定上游节点不健康的超时次数阈值。

验证时，行为应与主动健康检查中的验证一致。

在访问日志中包含 LLM 信息

以下示例演示了如何在网关的访问日志中记录 LLM 请求相关信息，以改进分析和审计。除了 NGINX 变量外，以下变量也可用：

apisix_upstream_response_time：APISIX 向上游服务发送请求并接收完整响应所花费的时间。从 API7 企业版 3.8.8 起可用。
request_type：请求类型，其值可以是 traditional_http、ai_chat 或 ai_stream。
llm_time_to_first_token：从请求发送到从 LLM 服务收到第一个令牌的持续时间，单位为毫秒。
llm_model：转发到上游 LLM 服务的 LLM 模型名称。
request_llm_model：请求中指定的 LLM 模型名称。
llm_prompt_tokens：提示词中的令牌数。
llm_completion_tokens：提示词中的聊天补全令牌数。

以下变量自 API7 Enterprise 3.9.14 版本起可用：

llm_total_tokens：使用的 token 总数，包括提示词和补全 token。
llm_stream：请求是否为流式请求，值为 true 或 false。
llm_has_tool_calls：LLM 响应是否包含工具调用，值为 true 或 false。
llm_tool_count：请求中提供的工具数量。
llm_end_user_id：从请求体中提取的终端用户标识，例如 user、safety_identifier 或 metadata.user_id。
llm_cache_read_input_tokens：由提供方提示词缓存命中的提示词 token 数量。
llm_cache_creation_input_tokens：写入提供方提示词缓存的提示词 token 数量。
llm_reasoning_tokens：推理模型使用的推理 token 数量。

提示

这些变量在访问日志格式中演示，但也适用于日志记录插件。

要在访问日志中记录这些值，请将 LLM 相关变量加入网关访问日志格式：

宿主机或 Docker
Kubernetes (Helm)

在网关配置文件中新增或更新以下配置：

config.yaml
nginx_config:
  http:
    access_log_format: "$remote_addr - $remote_user [$time_local] $http_host \"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $apisix_upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" \"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" \"$request_llm_model\" \"$llm_prompt_tokens\" \"$llm_completion_tokens\" \"$llm_total_tokens\" \"$llm_stream\" \"$llm_has_tool_calls\" \"$llm_tool_count\" \"$llm_end_user_id\" \"$llm_cache_read_input_tokens\" \"$llm_cache_creation_input_tokens\" \"$llm_reasoning_tokens\""

重新加载网关以使配置更改生效。

对于 APISIX Helm Chart，设置以下 values：

values.yaml
apisix:
  nginx:
    logs:
      accessLogFormat: "$remote_addr - $remote_user [$time_local] $http_host \"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $apisix_upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" \"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" \"$request_llm_model\" \"$llm_prompt_tokens\" \"$llm_completion_tokens\" \"$llm_total_tokens\" \"$llm_stream\" \"$llm_has_tool_calls\" \"$llm_tool_count\" \"$llm_end_user_id\" \"$llm_cache_read_input_tokens\" \"$llm_cache_creation_input_tokens\" \"$llm_reasoning_tokens\""

对于 API7 网关 Helm Chart，设置以下 values：

values.yaml
logs:
  accessLogFormat: "$remote_addr - $remote_user [$time_local] $http_host \"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $apisix_upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" \"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" \"$request_llm_model\" \"$llm_prompt_tokens\" \"$llm_completion_tokens\" \"$llm_total_tokens\" \"$llm_stream\" \"$llm_has_tool_calls\" \"$llm_tool_count\" \"$llm_end_user_id\" \"$llm_cache_read_input_tokens\" \"$llm_cache_creation_input_tokens\" \"$llm_reasoning_tokens\""

然后使用当前网关 release 对应的 Chart 应用 values 文件：

helm upgrade <release-name> <chart-name> -n <namespace> -f values.yaml

接下来，按照之前的示例创建一个带有 ai-proxy-multi 插件的路由并发送请求。例如，如果你发送如下请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5",
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

如果 ai-proxy-multi 中的 LLM 实例模型是 gpt-4，那么请求将被转发到 GPT-4 模型，你将收到类似于以下的响应：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1+1 equals 2.",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 8,
    "total_tokens": 31,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    ...
  },
  "service_tier": "default",
  "system_fingerprint": null
}

在网关的访问日志中，你应该会看到类似于以下的日志条目：

192.168.215.1 - - [29/Aug/2025:09:54:16 +0000] 127.0.0.1:9080 "POST /anything HTTP/1.1" 200 808 2.670 "-" "curl/8.6.0" - - 2670 "http://127.0.0.1:9080" "6526bf5c961b6e6bb8cfcb66486f02dc" "ai_chat" "2670" "gpt-4" "gpt-3.5" "23" "8" "31" "false" "false" "0" "" "0" "0" "0"

访问日志条目显示 APISIX 上游响应时间为 2.670 秒，请求类型为 ai_chat，首令牌时间为 2670 毫秒，请求转发到的 LLM 模型为 gpt-4，请求中的 LLM 模型为 gpt-3.5，提示词令牌使用量为 23，补全令牌使用量为 8，总令牌使用量为 31，请求为非流式请求，不包含工具调用，请求中未提供工具，没有终端用户标识，没有提示词缓存令牌，也没有推理令牌。

将请求日志发送到日志记录器

以下示例演示了如何记录请求和请求信息（包括 LLM 模型、令牌和负载），并将它们推送到日志记录器。在继续之前，你应该先设置一个日志记录器，例如 Kafka。有关更多信息，请参阅 kafka-logger。

创建到你的 LLM 服务的路由，并如下配置日志记录详细信息：

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 8,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4"
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "weight": 2,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat"
            }
          }
        ],
        "logging": {
          "summaries": true,
          "payloads": true
        }
      },
      "kafka-logger": {
        "brokers": [
          {
            "host": "127.0.0.1",
            "port": 9092
          }
        ],
        "kafka_topic": "test2",
        "key": "key1",
        "batch_max_size": 1
        }
      }
    }
  }'

❶ 记录请求的 LLM 模型、持续时间、请求和响应令牌数。

❷ 记录请求和响应负载。

❸ 更新为你的 Kafka 地址。

❹ 更新为你的 Kafka 主题。

❺ 更新为你的 Kafka 密钥。

❻ 设置为 1 以立即发送日志条目。

向该路由发送一个 POST 请求：

curl "http://127.0.0.1:9080/anything" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a mathematician" },
      { "role": "user", "content": "What is 1+1?" }
    ]
  }'

如果请求被转发到 OpenAI，你应该会收到类似于以下的响应：

{
  ...,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1+1 equals 2.",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  ...
}

在 Kafka 主题中，你还应该看到与该请求对应的日志条目，其中包含 LLM 摘要和请求/响应负载。

演示​

示例​

实例间负载均衡​

Gemini 与 Vertex AI 间的负载均衡​

配置实例优先级和速率限制​

按消费者进行负载均衡和速率限制​

限制最大补全令牌数​

代理到嵌入模型​

启用主动健康检查​

在访问日志中包含 LLM 信息​

将请求日志发送到日志记录器​

演示

示例

实例间负载均衡

Gemini 与 Vertex AI 间的负载均衡

配置实例优先级和速率限制

按消费者进行负载均衡和速率限制

限制最大补全令牌数

代理到嵌入模型

启用主动健康检查

在访问日志中包含 LLM 信息

将请求日志发送到日志记录器