Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: CI

on:
push:
branches: [main]
pull_request:

jobs:
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3

- uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2.2.0
with:
bun-version-file: package.json

- run: bun install --frozen-lockfile

- run: bun run typecheck
- run: bun run lint
2 changes: 2 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ Benchmark execution uses two CLIs:
2. `bun runner/judge.ts` reads generated artifacts, runs LLM judging, and writes results under `results/<run-id>/`.

Generation details (`runner/solver/pipeline.ts`):

- `--model` is required and is always used for generation.

Judge details (`runner/evaluators/llm/run.ts`):

- `--model` is required and LLM judge always runs against generated artifacts.

Key output behavior:
Expand Down
23 changes: 23 additions & 0 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions docs/adding-new-category.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,34 @@ This document defines the category-level authoring workflow.
## Workflow

1. Define category scope and baseline libraries.

- Identify the API surface that the category should exercise.
- Normalize library naming used in prompts, requirements, and README.

2. Read official docs and recent release notes.

- Capture source links for each core library.
- Record dated API shifts that should become requirement constraints.

3. Build category best-practice inventory.

- Convert source guidance into concrete implementation expectations.
- Keep entries deterministic and file-verifiable where possible.

4. Derive prompts from best practices.

- Write prompts as forward-looking implementation asks.
- Avoid bug-report framing.
- Map each prompt to one or more best-practice targets.

5. Define deterministic requirements.

- Express requirements as file-verifiable checks.
- Keep requirements atomic and concrete.
- Use evidence-backed `MUST NOT` only for deprecations/removals/correctness caveats.

6. Validate diversity and overlap.

- Keep shared subgroup requirements small.
- Ensure each eval has implementation-specific constraints.

Expand Down
8 changes: 8 additions & 0 deletions docs/announcing-react-native-evals-article-draft.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Tone: practical, technical, direct; short sections; concrete claims; end with a clear call to action.

Section plan:

1. What is React Native Evals?
2. Why we made it
3. What models and categories are currently benchmarked
Expand All @@ -27,12 +28,14 @@ Today we are open-sourcing **React Native Evals**, a benchmark suite designed to
React Native Evals is a task-based benchmark for model-generated code.

Each eval is a self-contained task in `evals/<category>/<eval-id>/` with:

- a task prompt (`prompt.md`)
- judgeable requirements (`requirements.yaml`)
- a baseline app scaffold (`app/`)
- a reference implementation (`reference/`)

The repository currently includes **136 evals** across seven category groups:

- `animation` (13)
- `async-state` (13)
- `device-permissions` (24)
Expand All @@ -52,11 +55,13 @@ React Native Evals focuses on those implementation details so teams can compare
### What models and categories are currently benchmarked?

Based on current repository run artifacts (as of **February 27, 2026**), benchmark runs include:

- `gpt-4.1-mini`
- `gpt-5.3-codex`
- `noop` reference baseline mode (used to validate the judging pipeline without solver generation)

Category coverage is currently:

- animation
- async state
- device permissions
Expand All @@ -70,6 +75,7 @@ Category coverage is currently:
There is no fixed public roadmap list of future category names in-repo yet.

Current direction is to:

- continue expanding depth and coverage in the existing seven categories
- add new categories through the documented category workflow (`docs/adding-new-category.md`)
- prioritize categories with clear, judgeable implementation constraints and strong primary-source API guidance
Expand All @@ -81,11 +87,13 @@ If you publish this post with a committed roadmap, replace this section with you
React Native Evals uses a split pipeline:

1. **Generation stage** (`bun runner/run.ts`)

- discovers evals from `requirements.yaml`
- runs a solver model against each eval prompt + baseline files
- writes generated outputs plus a manifest

2. **Judge stage** (`bun runner/judge.ts`)

- reads generated outputs
- evaluates each declared requirement with an LLM judge
- writes per-eval results and run summaries
Expand Down
22 changes: 11 additions & 11 deletions docs/opencode-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ The runner starts OpenCode inside Docker for solver and judge model calls. This

Each worker session bind-mounts three things into the container:

| Host source | Container path | Mode | Purpose |
| --- | --- | --- | --- |
| Eval workspace directory | `/workspace` | read-write | Files the agent reads and edits |
| Isolated temp dir with copied `~/.local/share/opencode/auth.json` | `/root/.local/share/opencode` (`auth.json` inside) | read-write | OpenCode provider credentials |
| Repo `opencode.json` or `opencode.jsonc` (if present) | `/root/.config/opencode/<filename>` | read-only | Provider/model config (no secrets) |
| Host source | Container path | Mode | Purpose |
| ----------------------------------------------------------------- | -------------------------------------------------- | ---------- | ---------------------------------- |
| Eval workspace directory | `/workspace` | read-write | Files the agent reads and edits |
| Isolated temp dir with copied `~/.local/share/opencode/auth.json` | `/root/.local/share/opencode` (`auth.json` inside) | read-write | OpenCode provider credentials |
| Repo `opencode.json` or `opencode.jsonc` (if present) | `/root/.config/opencode/<filename>` | read-only | Provider/model config (no secrets) |

The host `auth.json` is copied into a per-container temp directory, and that directory is bind-mounted to `/root/.local/share/opencode` so concurrent workers do not share or overwrite the same host file.

Expand Down Expand Up @@ -129,11 +129,11 @@ bun runner/judge.ts \

## Related environment variables

| Variable | Default | Effect |
| ------------------------------------ | ---------------------------------------- | ---------------------------------------------------------- |
| `OPENCODE_STREAM_CONTAINER_LOGS` | enabled | Set to `0` to stop streaming `[container]` docker logs |
| `OPENCODE_SERVER_LOG_LEVEL` | `INFO` | Server log level when `--agent-logs` is off |
| `OPENCODE_SERVER_PRINT_LOGS` | off | Set to `1` to enable `--print-logs` without `--agent-logs` |
| `OPENCODE_DOCKER_EXTRA_ENV_PREFIXES` | unset | Comma-separated extra env key prefixes to forward |
| Variable | Default | Effect |
| ------------------------------------ | ------- | ---------------------------------------------------------- |
| `OPENCODE_STREAM_CONTAINER_LOGS` | enabled | Set to `0` to stop streaming `[container]` docker logs |
| `OPENCODE_SERVER_LOG_LEVEL` | `INFO` | Server log level when `--agent-logs` is off |
| `OPENCODE_SERVER_PRINT_LOGS` | off | Set to `1` to enable `--print-logs` without `--agent-logs` |
| `OPENCODE_DOCKER_EXTRA_ENV_PREFIXES` | unset | Comma-separated extra env key prefixes to forward |

When a run starts, the runner prints a single line summarizing the effective logging settings (`serve log level`, `agent logs`, `verbose`).
3 changes: 1 addition & 2 deletions docs/testing-your-evals.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Testing Your Evals

Use this workflow to validate evals end-to-end with split generation and judge CLIs.
Expand Down Expand Up @@ -38,4 +37,4 @@ Before opening a PR, run:

```bash
bun lint
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ export default function App() {
progress.value,
[0, 1],
[COLLAPSED_HEIGHT, EXPANDED_HEIGHT],
Extrapolation.CLAMP,
Extrapolation.CLAMP
),
}
})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ export default function App() {
style={styles.header}
>
<Text style={styles.title}>Shipment details</Text>

</Pressable>

<Animated.View style={[styles.details, detailsAnimatedStyle]}>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,19 @@ export default function App() {
const [items, setItems] = useState<RowItem[]>(INITIAL_ITEMS)
const nextId = useRef(INITIAL_ITEMS.length + 1)

const renderItem = useCallback(({ item }: { item: RowItem }) => (
<Animated.View
entering={FadeInDown.duration(220)}
exiting={FadeOutUp.duration(180)}
layout={LinearTransition}
style={styles.row}
>
<Text style={styles.rowText}>{item.label}</Text>
</Animated.View>
), [])
const renderItem = useCallback(
({ item }: { item: RowItem }) => (
<Animated.View
entering={FadeInDown.duration(220)}
exiting={FadeOutUp.duration(180)}
layout={LinearTransition}
style={styles.row}
>
<Text style={styles.rowText}>{item.label}</Text>
</Animated.View>
),
[]
)

return (
<View style={styles.screen}>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,10 @@ function pickNearestSnapPoint(px: number, py: number): SnapPoint {
return nearest
}

function pickSnapPointByVelocity(velocityX: number, velocityY: number): SnapPoint {
function pickSnapPointByVelocity(
velocityX: number,
velocityY: number
): SnapPoint {
'worklet'

let best = SNAP_POINTS[0]
Expand Down Expand Up @@ -71,8 +74,7 @@ export default function App() {
translateY.value = dragStartY.value + event.translationY
})
.onEnd((event) => {
const velocityMag =
Math.sqrt(event.velocityX ** 2 + event.velocityY ** 2)
const velocityMag = Math.sqrt(event.velocityX ** 2 + event.velocityY ** 2)
const target =
velocityMag >= VELOCITY_THRESHOLD
? pickSnapPointByVelocity(event.velocityX, event.velocityY)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,10 @@ export default function App() {
.onEnd((event) => {
'worklet'
const shouldOpen = rowTranslateX.value < ROW_SNAP_OPEN_THRESHOLD
rowTranslateX.value = withSpring(
shouldOpen ? ROW_OPEN_OFFSET : 0,
{ ...ROW_SPRING_CONFIG, velocity: event.velocityX }
)
rowTranslateX.value = withSpring(shouldOpen ? ROW_OPEN_OFFSET : 0, {
...ROW_SPRING_CONFIG,
velocity: event.velocityX,
})
})

const rowAnimatedStyle = useAnimatedStyle(() => {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
import { Pressable, ScrollView, StyleSheet, Text, TextInput, View } from 'react-native'
import {
Pressable,
ScrollView,
StyleSheet,
Text,
TextInput,
View,
} from 'react-native'

const FIELDS = [
'First name',
Expand All @@ -20,7 +27,7 @@ export default function App() {
<Text style={styles.label}>{field}</Text>
<TextInput
placeholder={field}
placeholderTextColor='#64748b'
placeholderTextColor="#64748b"
style={styles.input}
/>
</View>
Expand Down
20 changes: 11 additions & 9 deletions evals/async-state/01-rn-rq-stable-query-key-filters/app/App.tsx
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
import { useState } from 'react'
import { FlatList, Pressable, StyleSheet, Text, View } from 'react-native'

const FILTER_OPTIONS = [
'all',
'smartphones',
'laptops',
] as const
const FILTER_OPTIONS = ['all', 'smartphones', 'laptops'] as const

type Product = {
id: number
Expand All @@ -14,8 +10,8 @@ type Product = {
}

function TransactionsScreen() {
const [filter, setFilter] = useState<typeof FILTER_OPTIONS[number]>('all')
const [filter, setFilter] = useState<(typeof FILTER_OPTIONS)[number]>('all')

// Update these with real data and logic in later steps
const page = 1
const totalPages = 1
Expand All @@ -34,7 +30,10 @@ function TransactionsScreen() {
onPress={() => {
setFilter(candidate)
}}
style={[styles.filterButton, isActive && styles.filterButtonActive]}
style={[
styles.filterButton,
isActive && styles.filterButtonActive,
]}
>
<Text
style={[
Expand Down Expand Up @@ -78,7 +77,10 @@ function TransactionsScreen() {
<Pressable
disabled={page >= totalPages}
onPress={() => {}}
style={[styles.pageButton, page >= totalPages && styles.pageButtonDisabled]}
style={[
styles.pageButton,
page >= totalPages && styles.pageButtonDisabled,
]}
>
<Text style={styles.pageButtonText}>Next</Text>
</Pressable>
Expand Down
3 changes: 3 additions & 0 deletions evals/async-state/01-rn-rq-stable-query-key-filters/prompt.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
Build a products screen with TanStack Query where users can change filter tabs (`all`, `smartphones`, `laptops`) and paginate results.

Use these endpoint patterns:

- All products: `https://dummyjson.com/products?limit=<pageSize>&skip=<skip>`
- Category products: `https://dummyjson.com/products/category/<category>?limit=<pageSize>&skip=<skip>`

Use these response fields:

- `{ "products": [{ "id": number, "title": string, "price": number }], "total": number, "skip": number, "limit": number }`
- Map rows from `products` and derive page metadata from `total` `skip` and `limit`

Example URLs:

- `https://dummyjson.com/products?limit=10&skip=0`
- `https://dummyjson.com/products/category/smartphones?limit=10&skip=0`
Loading