OpenAI compatible API

This document contains the SambaStudio OpenAI compatible API reference information. It describes input and output formats for the SambaStudio OpenAI compatible API, which makes it easy to try out our open source models on existing applications.

Create chat completions

Creates a model response for the given chat conversation.

POST https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions

Request body

The chat request body formats are described below.

Reference

Parameter Definition Type Values

Parameter	Definition	Type	Values
`model`	The name of the model to query.	String	The expert name.
`messages`	A list of messages comprising the conversation so far.	Array of objects	Array of message objects, each containing: role (string, required): The role of the messages author. Choice between system, user, or assistant. content (string, required): The contents of the message.
`max_tokens`	The maximum number of tokens to generate.	Integer	The total length of input tokens and generated tokens is limited by the model’s context length. Default value is the context length of the model.
`temperature`	Determines the degree of randomness in the response.	Float	The temperature value can be between `0` and `1`.
`top_p`	The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities.	Float	The top_p value can be between `0` and `1`.
`top_k`	The top_k parameter is used to limit the number of choices for the next predicted word or token.	Integer	The top k value can be between `1` to `100`.
`stream`	If set, partial message deltas will be sent.	Boolean or null	Default is false.
`stream_options`	Options for streaming response. Only set this when you set `stream: true`.	Object or null	Default is null. Value can be `include_usage:` boolean.
repetition_penalty	A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive.	Float or null	Default is `1.0`, which means no penalty. The repetition penalty value can be between `1.0` to `10.0`.

model

The name of the model to query.

String

The expert name.

messages

A list of messages comprising the conversation so far.

Array of objects

Array of message objects, each containing:

role (string, required): The role of the messages author. Choice between system, user, or assistant.
content (string, required): The contents of the message.

max_tokens

The maximum number of tokens to generate.

Integer

The total length of input tokens and generated tokens is limited by the model’s context length. Default value is the context length of the model.

temperature

Determines the degree of randomness in the response.

Float

The temperature value can be between 0 and 1.

top_p

The top_p (nucleus) parameter is used to dynamically adjust the number of choices for each predicted token based on the cumulative probabilities.

Float

The top_p value can be between 0 and 1.

top_k

The top_k parameter is used to limit the number of choices for the next predicted word or token.

Integer

The top k value can be between 1 to 100.

stream

If set, partial message deltas will be sent.

Boolean or null

Default is false.

stream_options

Options for streaming response. Only set this when you set stream: true.

Object or null

Default is null.

Value can be include_usage: boolean.

repetition_penalty

A parameter that controls how repetitive text can be. A lower value means more repetitive, while a higher value means less repetitive.

Float or null

Default is 1.0, which means no penalty.

The repetition penalty value can be between 1.0 to 10.0.

Example request

Below is an example request body for a streaming response.

Example streaming request

{
   "messages": [
      {"role": "system", "content": "Answer the question in a couple sentences."},
      {"role": "user", "content": "Share a happy story with me"}
   ],
   "max_tokens": 800,
   "model": "Meta-Llama-3.1-8B-Instruct",
   "stream": true,
   "stream_options": {"include_usage": true}
}

Response

The API returns a chat completion object , or a streamed sequence of chat completion chunk objects, if the request is streamed.

Chat completion object

Represents a chat completion response returned by model, based on the provided input.

Reference

Property Type Description

Property	Type	Description
id	String	A unique identifier for the chat completion.
choices	Array	A list containing a single chat completion.
created	Integer	The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.
model	String	The model used to generate the completion.
object	String	The object type, which is always `chat.completion`.
usage	Object	An optional field present when `stream_options: {"include_usage": true}` is set. When present, it contains a null value except for the last chunk, which contains the token usage statistics for the entire request. Values returned are: `throughput_after_first_token`: The rate (as tokens per second) at which output tokens are generated after the first token has been delivered. `time_to_first_token`: The time (in seconds) the model takes to generate the first token. `model_execution_time`: The time (in seconds) to generate a complete response or all tokens. `completion_tokens`: Number of tokens generated in the response. `prompt_tokens`: Number of tokens in the input prompt. `total_tokens`: The sum of input and output tokens.

String

A unique identifier for the chat completion.

choices

Array

A list containing a single chat completion.

created

Integer

The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.

model

String

The model used to generate the completion.

object

String

The object type, which is always chat.completion.

usage

Object

An optional field present when stream_options: {"include_usage": true} is set.

When present, it contains a null value except for the last chunk, which contains the token usage statistics for the entire request.

Values returned are:

throughput_after_first_token: The rate (as tokens per second) at which output tokens are generated after the first token has been delivered.
time_to_first_token: The time (in seconds) the model takes to generate the first token.
model_execution_time: The time (in seconds) to generate a complete response or all tokens.
completion_tokens: Number of tokens generated in the response.
prompt_tokens: Number of tokens in the input prompt.
total_tokens: The sum of input and output tokens.

Example

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "Llama-3-8b-chat",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "logprobs": null,
    "finish_reason": "stop"
  }]
}

Chat completion chunk object

Represents a streamed chunk of a chat completion response returned by model, based on the provided input.

Reference

Property Type Description

Property	Type	Description
id	String	A unique identifier for the chat completion.
choices	Array	A list containing a single chat completion.
created	Integer	The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.
model	String	The object type, which is always `chat.completion`.
usage	Object	An optional field present when `stream_options: {"include_usage": true}` is set. When present, it contains a `null` value except for the last chunk, which contains the token usage statistics for the entire request. Values returned are: `throughput_after_first_token`: The rate (as tokens per second) at which output tokens are generated after the first token has been delivered. `time_to_first_token`: The time (in seconds) the model takes to generate the first token. `model_execution_time`: The time (in seconds) to generate a complete response or all tokens. `completion_tokens`: Number of tokens generated in the response. `prompt_tokens`: Number of tokens in the input prompt. `total_tokens`: The sum of input and output tokens.

String

A unique identifier for the chat completion.

choices

Array

A list containing a single chat completion.

created

Integer

The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.

model

String

The object type, which is always chat.completion.

usage

Object

An optional field present when stream_options: {"include_usage": true} is set.

When present, it contains a null value except for the last chunk, which contains the token usage statistics for the entire request.

Values returned are:

throughput_after_first_token: The rate (as tokens per second) at which output tokens are generated after the first token has been delivered.
time_to_first_token: The time (in seconds) the model takes to generate the first token.
model_execution_time: The time (in seconds) to generate a complete response or all tokens.
completion_tokens: Number of tokens generated in the response.
prompt_tokens: Number of tokens in the input prompt.
total_tokens: The sum of input and output tokens.

Example

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1694268190,
  "model": "Llama-3-8b-chat",
  "system_fingerprint": "fp_44709d6fcb",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "logprobs": null,
      "finish_reason": "stop"
    }
  ]
}

Batch API

You can send a batch of queries in one request using the batch API.

curl --location 'https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions' \
--header 'Content-Type: application/json' \
--header 'key: API Key' \
--data '[
    {
  "model": "Meta-Llama-3-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant that helps with answering questions and providing information."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "process_prompt": true,
  "max_tokens": 50,
  "stream": true
},
{
  "model": "Meta-Llama-3-8B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant that helps with answering questions and providing information."
    },
    {
      "role": "user",
      "content": "What is the capital of India?"
    }
  ],
  "process_prompt": true,
  "max_tokens": 50,
  "stream": true
}
]'

Making API Calls to base models without instruction tuning

Some SambaNova base models—particularly those without “instruct” in the name—do not include a chat template in their tokenizer_config.json. This affects how they should be used during inference.

Chat template

A chat template is a formatting configuration in a model’s tokenizer_config.json file. It structures prompts and responses in a chat-like format, enabling compatibility with:

process_prompt=true (for v1 / v2 APIs)
/v1/chat/completions (OpenAI-compatible APIs)

Without a chat template, models cannot interpret structured conversation turns and must use alternative endpoints or options:

process_prompt=false (for v1 / v2 APIs)
/v1/completions (OpenAI-compatible APIs)

Training and inference behavior

Whether a model supports a chat template depends on how it was trained:

If the tokenizer used during data preparation included a chat template (using --apply-chat-template), the model will support chat-style interaction and inference using /v1/chat/completions.
If the dataset was prepared without a chat template, the resulting model will not support structured chat interaction and must be queried using /v1/completions.

Most SambaNova-provided base models (e.g., sarashina2-70b) are trained without chat templates and should be queried accordingly.

Models without chat templates are not fine-tuned for instruction-following or conversational behavior. They tend to produce open-ended, free-form text and may require few-shot prompting to elicit meaningful responses.

When to Use `/v1/completions`

Use the /v1/completions endpoint when:

The model lacks a chat template in its tokenizer_config.json
You are working with a base model (typically one without “instruct” in its name)
You do not require chat-style input/output formatting

Attempting to use these models with /v1/chat/completions or process_prompt=true will result in errors or unintended behavior.

Example: Base model inference with `/v1/completions`

#!/bin/bash

# Set your SambaNova API Key
API_KEY="API_KEY"

# Create the prompt payload
MESSAGES=$(cat <<EOF
{
  "stream": false,
  "model": "sarashina2-70b",
  "stream_options": {"include_usage": true},
  "prompt": "Hello"
}
EOF
)

# SambaNova API endpoint
BASE_URL="https://iftvh4zrezqd.cloud.snova.ai/"

# Make the API call
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d "$MESSAGES" \
     -X POST $BASE_URL/v1/completions \
     | jq

Example Response

"choices": [
  {
    "finish_reason": "stop",
    "index": 0,
    "logprobs": null,
    "message": {
      "content": null,
      "role": "assistant"
    },
    "text": "Hello, how can I help you?"
  }
]

Summary: Choosing the Right API

Scenario API Endpoint process_prompt Chat Template Required

Scenario	API Endpoint	`process_prompt`	Chat Template Required
Chat-capable model	`/v1/chat/completions`	`true`	Yes
Base model (no chat template)	`/v1/completions`	`false`	No

Chat-capable model

/v1/chat/completions

true

Yes

Base model (no chat template)

/v1/completions

false

Example requests using OpenAI client

Example requests for streaming and non-streaming are shown below.

The following parameters are not yet supported and will be ignored:
- logprobs
- top_logprobs
- n
- presence_penalty
- frequency_penalty
- logit_bias
- seed
The SambaNova API supports the top_k parameter, which can be used with a direct API call. However, this is not supported by the OpenAI client.

Streaming

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
    api_key= "YOUR ENDPOINT API KEY"
)

completion = client.chat.completions.create(
  model="Meta-CodeLlama-70b-Instruct",
  messages = [
      {"role": "system", "content": "You are intelligent"},
      {"role": "user", "content": "Tell me a story in 3 lines"}
    ],
  stream=True
)

for chunk in completion:
  print(chunk.choices[0].delta)

Non-streaming

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-sambastudio-domain>/v1/<project-id>/<endpoint-id>/chat/completions",
    api_key= "YOUR ENDPOINT API KEY"
)

response = client.chat.completions.create(
  model="Meta-Llama-3.1-8B-Instruct",
  messages=[
      {"role": "system", "content": "Answer the question in a couple sentences."},
      {"role": "user", "content": "Share a happy story with me"}
    ]
)
print(response.choices[0].message)

OpenAI compatible API

Create chat completions

Request body

Reference

Example request

Response

Chat completion object

Reference

Example

Chat completion chunk object

Reference

Example

Batch API

Making API Calls to base models without instruction tuning

Chat template

Training and inference behavior

When to Use /v1/completions

Example: Base model inference with /v1/completions

Example Response

Summary: Choosing the Right API

Example requests using OpenAI client

Streaming

Non-streaming

When to Use `/v1/completions`

Example: Base model inference with `/v1/completions`