Evaluation at the forefront.
Standardizing and streamlining the evaluation process from foundational to multi-step models.
Our Mission
The mission of Walnut Research is to bring sustainable, well-maintained safety and evaluation practices to AI practitioners to ensure that we know what our powerful AI systems are capable of.
Problems & Opportunities
Lag between AI Development and Evaluation
AI Models are becoming much more complex through development, but the evaluation framework available to the community is often lagging behind these advancements.
Difficult LLM
Evaluation Process
Existing LLM evaluation libraries are not flexible enough to serve all papers and evaluation methodologies. Moreover, it is difficult to find the right benchmark. Our goal is to make this process as modularized and effortless as possible.
Implication on Research Ethics
Developers cannot compare their model against another model as all in-house evaluation setups will differ. Furthermore, researchers can only report the model's validity by exposing their product or model entirely through open sourcing.
Introducing the
Nutcracker
Framework.
NUTCRACKER
We build a PyTorch-like library for LLM evaluation
The Nutcracker is an open-source evaluation tool that evaluates LLM APIs. Through this framework, we hope to further develop our mission into streamlining and standardizing the evaluation process with the AI development community.
Nutcracker currently supports 100+ tasks/benchmarks:
Our current beta version of Nutcracker has implemented more than 100 ready-to-go LLM benchmarks. This is not easily replicable. This requires heavy pre-processing procedures, including formatting existing benchmark data to a Nutcracker-specific format.
Streamline evaluation
Customized evaluation does not require writing code from the ground up.
Like how PyTorch provides necessary abstractions like DataLoader, Nutcracker provides modularized object modules for customized setups.
Data in your hands
The existing LLM evaluation libraries are too high-level.
Download pre-processed benchmark data in a human-readable format (JSONL) in a user-specified directory.
API-based approach
Nutcracker is lightweight and customizable by design.
Be creative. Bring your RAG, multi-step inference, local LLM models and provide an API endpoint. Nutcracker takes care of the rest.
Share model outputs
Share the exact inputs and outputs from the model.
Modularized framework means that you can pickle-save each and every part of your evaluation steps. Share these to prove the validity of your evaluations.
Contact us
jehseok@walnutresearch.com
Standardize and pave the way for groundbreaking advancements in AI research.
Our streamlined evaluation will provide an accurate, ethical evaluation that will
return the same benchmark score today, tomorrow, and ten years after.