Evaluation at the forefront.

Standardizing and streamlining the evaluation process from foundational to multi-step models.

Our Mission

The mission of Walnut Research is to bring sustainable, well-maintained safety and evaluation practices to AI practitioners to ensure that we know what our powerful AI systems are capable of.

Problems & Opportunities

Lag between AI Development and Evaluation

AI Models are becoming much more complex through development, but the evaluation framework available to the community is often lagging behind these advancements.

Difficult LLM

Evaluation Process

Existing LLM evaluation libraries are not flexible enough to serve all papers and evaluation methodologies. Moreover, it is difficult to find the right benchmark. Our goal is to make this process as modularized and effortless as possible.

Implication on Research Ethics

Developers cannot compare their model against another model as all in-house evaluation setups will differ. Furthermore, researchers can only report the model's validity by exposing their product or model entirely through open sourcing.

Introducing the




We build a PyTorch-like library for LLM evaluation

The Nutcracker is an open-source evaluation tool that evaluates LLM APIs. Through this framework, we hope to further develop our mission into streamlining and standardizing the evaluation process with the AI development community.

Nutcracker currently supports 100+ tasks/benchmarks:

Our current beta version of Nutcracker has implemented more than 100 ready-to-go LLM benchmarks. This is not easily replicable. This requires heavy pre-processing procedures, including formatting existing benchmark data to a Nutcracker-specific format.

Streamline evaluation

Customized evaluation does not require writing code from the ground up.

Like how PyTorch provides necessary abstractions like DataLoader, Nutcracker provides modularized object modules for customized setups.

Data in your hands

The existing LLM evaluation libraries are too high-level.

Download pre-processed benchmark data in a human-readable format (JSONL) in a user-specified directory.

API-based approach

Nutcracker is lightweight and customizable by design.

Be creative. Bring your RAG, multi-step inference, local LLM models and provide an API endpoint. Nutcracker takes care of the rest.

Share model outputs

Share the exact inputs and outputs from the model.

Modularized framework means that you can pickle-save each and every part of your evaluation steps. Share these to prove the validity of your evaluations.

Contact us


Standardize and pave the way for groundbreaking advancements in AI research.

Our streamlined evaluation will provide an accurate, ethical evaluation that will

return the same benchmark score today, tomorrow, and ten years after.