OpenAI compatible API
This document contains the SambaStudio OpenAI compatible API reference information. It describes input and output formats for the SambaStudio OpenAI compatible API, which makes it easy to try out our open source models on existing applications.
Create chat completions
Creates a model response for the given chat conversation.
POST https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions
Request body
The chat request body formats are described below.
Reference
Parameter | Definition | Type | Values |
---|---|---|---|
|
The name of the model to query. |
String |
The expert name. |
|
A list of messages comprising the conversation so far. |
Array of objects |
Array of message objects, each containing:
|
|
The maximum number of tokens to generate. |
Integer |
The total length of input tokens and generated tokens is limited by the model’s context length. Default value is the context length of the model. |
|
Determines the degree of randomness in the response. |
Float |
The temperature value can be between |
|
The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities. |
Float |
The top_p value can be between |
|
The top_k parameter is used to limit the number of choices for the next predicted word or token. |
Integer |
The top k value can be between |
|
If set, partial message deltas will be sent. |
Boolean or null |
Default is false. |
|
Options for streaming response. Only set this when you set |
Object or null |
Default is null. Value can be |
repetition_penalty |
A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive. |
Float or null |
Default is The repetition penalty value can be between |
Example request
Below is an example request body for a streaming response.
{
"messages": [
{"role": "system", "content": "Answer the question in a couple sentences."},
{"role": "user", "content": "Share a happy story with me"}
],
"max_tokens": 800,
"model": "Meta-Llama-3.1-8B-Instruct",
"stream": true,
"stream_options": {"include_usage": true}
}
Response
The API returns a chat completion object , or a streamed sequence of chat completion chunk objects, if the request is streamed.
Chat completion object
Represents a chat completion response returned by model, based on the provided input.
Reference
Property | Type | Description |
---|---|---|
id |
String |
A unique identifier for the chat completion. |
choices |
Array |
A list containing a single chat completion. |
created |
Integer |
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp. |
model |
String |
The model used to generate the completion. |
object |
String |
The object type, which is always |
usage |
Object |
An optional field present when When present, it contains a null value except for the last chunk, which contains the token usage statistics for the entire request. Values returned are:
|
Chat completion chunk object
Represents a streamed chunk of a chat completion response returned by model, based on the provided input.
Reference
Property | Type | Description |
---|---|---|
id |
String |
A unique identifier for the chat completion. |
choices |
Array |
A list containing a single chat completion. |
created |
Integer |
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp. |
model |
String |
The object type, which is always |
usage |
Object |
An optional field present when When present, it contains a Values returned are:
|
Batch API
You can send a batch of queries in one request using the batch API.
curl --location 'https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions' \
--header 'Content-Type: application/json' \
--header 'key: API Key' \
--data '[
{
"model": "Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are an AI assistant that helps with answering questions and providing information."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"process_prompt": true,
"max_tokens": 50,
"stream": true
},
{
"model": "Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are an AI assistant that helps with answering questions and providing information."
},
{
"role": "user",
"content": "What is the capital of India?"
}
],
"process_prompt": true,
"max_tokens": 50,
"stream": true
}
]'
Making API Calls to base models without instruction tuning
Some SambaNova base models—particularly those without “instruct” in the name—do not include a chat template in their tokenizer_config.json
. This affects how they should be used during inference.
Chat template
A chat template is a formatting configuration in a model’s tokenizer_config.json
file. It structures prompts and responses in a chat-like format, enabling compatibility with:
-
process_prompt=true
(for v1 / v2 APIs) -
/v1/chat/completions
(OpenAI-compatible APIs)
Without a chat template, models cannot interpret structured conversation turns and must use alternative endpoints or options:
-
process_prompt=false
(for v1 / v2 APIs) -
/v1/completions
(OpenAI-compatible APIs)
Training and inference behavior
Whether a model supports a chat template depends on how it was trained:
-
If the tokenizer used during data preparation included a chat template (using
--apply-chat-template
), the model will support chat-style interaction and inference using/v1/chat/completions
. -
If the dataset was prepared without a chat template, the resulting model will not support structured chat interaction and must be queried using
/v1/completions
.
Most SambaNova-provided base models (e.g., sarashina2-70b
) are trained without chat templates and should be queried accordingly.
Models without chat templates are not fine-tuned for instruction-following or conversational behavior. They tend to produce open-ended, free-form text and may require few-shot prompting to elicit meaningful responses. |
When to Use /v1/completions
Use the /v1/completions
endpoint when:
-
The model lacks a chat template in its
tokenizer_config.json
-
You are working with a base model (typically one without “instruct” in its name)
-
You do not require chat-style input/output formatting
Attempting to use these models with |
Example: Base model inference with /v1/completions
#!/bin/bash
# Set your SambaNova API Key
API_KEY="API_KEY"
# Create the prompt payload
MESSAGES=$(cat <<EOF
{
"stream": false,
"model": "sarashina2-70b",
"stream_options": {"include_usage": true},
"prompt": "Hello"
}
EOF
)
# SambaNova API endpoint
BASE_URL="https://iftvh4zrezqd.cloud.snova.ai/"
# Make the API call
curl -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "$MESSAGES" \
-X POST $BASE_URL/v1/completions \
| jq
Example requests using OpenAI client
Example requests for streaming and non-streaming are shown below.
|
Streaming
from openai import OpenAI
client = OpenAI(
base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
api_key= "YOUR ENDPOINT API KEY"
)
completion = client.chat.completions.create(
model="Meta-CodeLlama-70b-Instruct",
messages = [
{"role": "system", "content": "You are intelligent"},
{"role": "user", "content": "Tell me a story in 3 lines"}
],
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta)
Non-streaming
from openai import OpenAI
client = OpenAI(
base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
api_key= "YOUR ENDPOINT API KEY"
)
response = client.chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "Answer the question in a couple sentences."},
{"role": "user", "content": "Share a happy story with me"}
]
)
print(response.choices[0].message)