> ## Documentation Index
> Fetch the complete documentation index at: https://docs-preprod.sambanova.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# LM Evaluation Harness

[Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) is a unified evaluation framework developed by [EleutherAI](https://www.eleuther.ai/) for testing generative language models across a wide range of evaluation tasks.

## Features

* Supports over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
* Compatible with commercial APIs.
* Supports local models and custom benchmarks.
* Uses publicly available prompts to ensure reproducibility and comparability between papers.
* Easy integration of custom prompts and evaluation metrics.

## Setup

1. Create a [SambaCloud](https://cloud.sambanova.ai/apis) account and obtain an API key.
2. Clone the `lm-evaluation-harness` repository:

   ```bash
   git clone https://github.com/EleutherAI/lm-eval-harness.git
   cd lm-evaluation-harness
   ```
3. Create and activate a virtual environment:

   ```
   python -m venv .venv
   source .venv/bin/activate  
   ```
4. Install dependencies

   ```bash
   pip install -e .
   pip install -e ."[api]"
   pip install tqdm
   ```

<Note>
  Additional Python packages may be required depending on the selected benchmark or task. If you encounter errors related to missing libraries, install them manually.
</Note>

## Example use case

Run this evaluation locally or in a notebook environment.

* Example benchmark: GSM8K (Grade School Math)
* Model source: SambaCloud

### Resources

* [Quickstart Notebook](https://github.com/sambanova/integrations/blob/main/lm_eval_harness/quickstart_eval.ipynb)
* [Integration directory](https://github.com/sambanova/integrations/tree/main/lm_eval_harness)

This example demonstrates how to evaluate the reasoning and arithmetic skills of an LLM using standard prompt formats and metrics.