callstackincubator · artus9033 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,21 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  quality:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
+
+      - uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2.2.0
+        with:
+          bun-version-file: package.json
+
+      - run: bun install --frozen-lockfile
+
+      - run: bun run typecheck
+      - run: bun run lint
diff --git a/AGENTS.md b/AGENTS.md
@@ -20,9 +20,11 @@ Benchmark execution uses two CLIs:
 2. `bun runner/judge.ts` reads generated artifacts, runs LLM judging, and writes results under `results/<run-id>/`.
 
 Generation details (`runner/solver/pipeline.ts`):
+
 - `--model` is required and is always used for generation.
 
 Judge details (`runner/evaluators/llm/run.ts`):
+
 - `--model` is required and LLM judge always runs against generated artifacts.
 
 Key output behavior:

diff --git a/bun.lock b/bun.lock
diff --git a/docs/adding-new-category.md b/docs/adding-new-category.md
@@ -5,28 +5,34 @@ This document defines the category-level authoring workflow.
 ## Workflow
 
 1. Define category scope and baseline libraries.
+
 - Identify the API surface that the category should exercise.
 - Normalize library naming used in prompts, requirements, and README.
 
 2. Read official docs and recent release notes.
+
 - Capture source links for each core library.
 - Record dated API shifts that should become requirement constraints.
 
 3. Build category best-practice inventory.
+
 - Convert source guidance into concrete implementation expectations.
 - Keep entries deterministic and file-verifiable where possible.
 
 4. Derive prompts from best practices.
+
 - Write prompts as forward-looking implementation asks.
 - Avoid bug-report framing.
 - Map each prompt to one or more best-practice targets.
 
 5. Define deterministic requirements.
+
 - Express requirements as file-verifiable checks.
 - Keep requirements atomic and concrete.
 - Use evidence-backed `MUST NOT` only for deprecations/removals/correctness caveats.
 
 6. Validate diversity and overlap.
+
 - Keep shared subgroup requirements small.
 - Ensure each eval has implementation-specific constraints.
 

diff --git a/docs/announcing-react-native-evals-article-draft.md b/docs/announcing-react-native-evals-article-draft.md
@@ -7,6 +7,7 @@
 - Tone: practical, technical, direct; short sections; concrete claims; end with a clear call to action.
 
 Section plan:
+
 1. What is React Native Evals?
 2. Why we made it
 3. What models and categories are currently benchmarked
@@ -27,12 +28,14 @@ Today we are open-sourcing **React Native Evals**, a benchmark suite designed to
 React Native Evals is a task-based benchmark for model-generated code.
 
 Each eval is a self-contained task in `evals/<category>/<eval-id>/` with:
+
 - a task prompt (`prompt.md`)
 - judgeable requirements (`requirements.yaml`)
 - a baseline app scaffold (`app/`)
 - a reference implementation (`reference/`)
 
 The repository currently includes **136 evals** across seven category groups:
+
 - `animation` (13)
 - `async-state` (13)
 - `device-permissions` (24)
@@ -52,11 +55,13 @@ React Native Evals focuses on those implementation details so teams can compare
 ### What models and categories are currently benchmarked?
 
 Based on current repository run artifacts (as of **February 27, 2026**), benchmark runs include:
+
 - `gpt-4.1-mini`
 - `gpt-5.3-codex`
 - `noop` reference baseline mode (used to validate the judging pipeline without solver generation)
 
 Category coverage is currently:
+
 - animation
 - async state
 - device permissions
@@ -70,6 +75,7 @@ Category coverage is currently:
 There is no fixed public roadmap list of future category names in-repo yet.
 
 Current direction is to:
+
 - continue expanding depth and coverage in the existing seven categories
 - add new categories through the documented category workflow (`docs/adding-new-category.md`)
 - prioritize categories with clear, judgeable implementation constraints and strong primary-source API guidance
@@ -81,11 +87,13 @@ If you publish this post with a committed roadmap, replace this section with you
 React Native Evals uses a split pipeline:
 
 1. **Generation stage** (`bun runner/run.ts`)
+
 - discovers evals from `requirements.yaml`
 - runs a solver model against each eval prompt + baseline files
 - writes generated outputs plus a manifest
 
 2. **Judge stage** (`bun runner/judge.ts`)
+
 - reads generated outputs
 - evaluates each declared requirement with an LLM judge
 - writes per-eval results and run summaries

diff --git a/docs/opencode-docker.md b/docs/opencode-docker.md
@@ -6,11 +6,11 @@ The runner starts OpenCode inside Docker for solver and judge model calls. This
 
 Each worker session bind-mounts three things into the container:
 
-| Host source | Container path | Mode | Purpose |
-| --- | --- | --- | --- |
-| Eval workspace directory | `/workspace` | read-write | Files the agent reads and edits |
-| Isolated temp dir with copied `~/.local/share/opencode/auth.json` | `/root/.local/share/opencode` (`auth.json` inside) | read-write | OpenCode provider credentials |
-| Repo `opencode.json` or `opencode.jsonc` (if present) | `/root/.config/opencode/<filename>` | read-only | Provider/model config (no secrets) |
+| Host source                                                       | Container path                                     | Mode       | Purpose                            |
+| ----------------------------------------------------------------- | -------------------------------------------------- | ---------- | ---------------------------------- |
+| Eval workspace directory                                          | `/workspace`                                       | read-write | Files the agent reads and edits    |
+| Isolated temp dir with copied `~/.local/share/opencode/auth.json` | `/root/.local/share/opencode` (`auth.json` inside) | read-write | OpenCode provider credentials      |
+| Repo `opencode.json` or `opencode.jsonc` (if present)             | `/root/.config/opencode/<filename>`                | read-only  | Provider/model config (no secrets) |
 
 The host `auth.json` is copied into a per-container temp directory, and that directory is bind-mounted to `/root/.local/share/opencode` so concurrent workers do not share or overwrite the same host file.
 
@@ -129,11 +129,11 @@ bun runner/judge.ts \
 
 ## Related environment variables
 
-| Variable                             | Default                                  | Effect                                                     |
-| ------------------------------------ | ---------------------------------------- | ---------------------------------------------------------- |
-| `OPENCODE_STREAM_CONTAINER_LOGS`     | enabled                                  | Set to `0` to stop streaming `[container]` docker logs     |
-| `OPENCODE_SERVER_LOG_LEVEL`          | `INFO`                                   | Server log level when `--agent-logs` is off                |
-| `OPENCODE_SERVER_PRINT_LOGS`         | off                                      | Set to `1` to enable `--print-logs` without `--agent-logs` |
-| `OPENCODE_DOCKER_EXTRA_ENV_PREFIXES` | unset                                    | Comma-separated extra env key prefixes to forward          |
+| Variable                             | Default | Effect                                                     |
+| ------------------------------------ | ------- | ---------------------------------------------------------- |
+| `OPENCODE_STREAM_CONTAINER_LOGS`     | enabled | Set to `0` to stop streaming `[container]` docker logs     |
+| `OPENCODE_SERVER_LOG_LEVEL`          | `INFO`  | Server log level when `--agent-logs` is off                |
+| `OPENCODE_SERVER_PRINT_LOGS`         | off     | Set to `1` to enable `--print-logs` without `--agent-logs` |
+| `OPENCODE_DOCKER_EXTRA_ENV_PREFIXES` | unset   | Comma-separated extra env key prefixes to forward          |
 
 When a run starts, the runner prints a single line summarizing the effective logging settings (`serve log level`, `agent logs`, `verbose`).
diff --git a/docs/testing-your-evals.md b/docs/testing-your-evals.md
@@ -1,4 +1,3 @@
-
 # Testing Your Evals
 
 Use this workflow to validate evals end-to-end with split generation and judge CLIs.
@@ -38,4 +37,4 @@ Before opening a PR, run:
 
 ```bash
 bun lint
-```
+```
diff --git a/evals/animation/01-rn-anim-pressable-scale-with-timing/reference/App.tsx b/evals/animation/01-rn-anim-pressable-scale-with-timing/reference/App.tsx
@@ -27,7 +27,7 @@ export default function App() {
         progress.value,
         [0, 1],
         [COLLAPSED_HEIGHT, EXPANDED_HEIGHT],
-        Extrapolation.CLAMP,
+        Extrapolation.CLAMP
       ),
     }
   })

diff --git a/evals/animation/02-rn-anim-spring-toggle-card-height/reference/App.tsx b/evals/animation/02-rn-anim-spring-toggle-card-height/reference/App.tsx
@@ -51,7 +51,6 @@ export default function App() {
           style={styles.header}
         >
           <Text style={styles.title}>Shipment details</Text>
-
         </Pressable>
 
         <Animated.View style={[styles.details, detailsAnimatedStyle]}>

diff --git a/evals/animation/03-rn-anim-layout-enter-exit-list-items/reference/App.tsx b/evals/animation/03-rn-anim-layout-enter-exit-list-items/reference/App.tsx
@@ -23,16 +23,19 @@ export default function App() {
   const [items, setItems] = useState<RowItem[]>(INITIAL_ITEMS)
   const nextId = useRef(INITIAL_ITEMS.length + 1)
 
-  const renderItem = useCallback(({ item }: { item: RowItem }) => (
-    <Animated.View
-      entering={FadeInDown.duration(220)}
-      exiting={FadeOutUp.duration(180)}
-      layout={LinearTransition}
-      style={styles.row}
-    >
-      <Text style={styles.rowText}>{item.label}</Text>
-    </Animated.View>
-  ), [])
+  const renderItem = useCallback(
+    ({ item }: { item: RowItem }) => (
+      <Animated.View
+        entering={FadeInDown.duration(220)}
+        exiting={FadeOutUp.duration(180)}
+        layout={LinearTransition}
+        style={styles.row}
+      >
+        <Text style={styles.rowText}>{item.label}</Text>
+      </Animated.View>
+    ),
+    []
+  )
 
   return (
     <View style={styles.screen}>

diff --git a/evals/animation/05-rn-anim-pan-drag-card-with-snap-points/reference/App.tsx b/evals/animation/05-rn-anim-pan-drag-card-with-snap-points/reference/App.tsx
@@ -38,7 +38,10 @@ function pickNearestSnapPoint(px: number, py: number): SnapPoint {
   return nearest
 }
 
-function pickSnapPointByVelocity(velocityX: number, velocityY: number): SnapPoint {
+function pickSnapPointByVelocity(
+  velocityX: number,
+  velocityY: number
+): SnapPoint {
   'worklet'
 
   let best = SNAP_POINTS[0]
@@ -71,8 +74,7 @@ export default function App() {
       translateY.value = dragStartY.value + event.translationY
     })
     .onEnd((event) => {
-      const velocityMag =
-        Math.sqrt(event.velocityX ** 2 + event.velocityY ** 2)
+      const velocityMag = Math.sqrt(event.velocityX ** 2 + event.velocityY ** 2)
       const target =
         velocityMag >= VELOCITY_THRESHOLD
           ? pickSnapPointByVelocity(event.velocityX, event.velocityY)

diff --git a/evals/animation/12-rn-rngh-scroll-pan-failure-chain/reference/App.tsx b/evals/animation/12-rn-rngh-scroll-pan-failure-chain/reference/App.tsx
@@ -73,10 +73,10 @@ export default function App() {
     .onEnd((event) => {
       'worklet'
       const shouldOpen = rowTranslateX.value < ROW_SNAP_OPEN_THRESHOLD
-      rowTranslateX.value = withSpring(
-        shouldOpen ? ROW_OPEN_OFFSET : 0,
-        { ...ROW_SPRING_CONFIG, velocity: event.velocityX }
-      )
+      rowTranslateX.value = withSpring(shouldOpen ? ROW_OPEN_OFFSET : 0, {
+        ...ROW_SPRING_CONFIG,
+        velocity: event.velocityX,
+      })
     })
 
   const rowAnimatedStyle = useAnimatedStyle(() => {

diff --git a/evals/animation/14-rn-keyboard-controller-aware-scroll-footer/app/App.tsx b/evals/animation/14-rn-keyboard-controller-aware-scroll-footer/app/App.tsx
@@ -1,4 +1,11 @@
-import { Pressable, ScrollView, StyleSheet, Text, TextInput, View } from 'react-native'
+import {
+  Pressable,
+  ScrollView,
+  StyleSheet,
+  Text,
+  TextInput,
+  View,
+} from 'react-native'
 
 const FIELDS = [
   'First name',
@@ -20,7 +27,7 @@ export default function App() {
             <Text style={styles.label}>{field}</Text>
             <TextInput
               placeholder={field}
-              placeholderTextColor='#64748b'
+              placeholderTextColor="#64748b"
               style={styles.input}
             />
           </View>

diff --git a/evals/async-state/01-rn-rq-stable-query-key-filters/app/App.tsx b/evals/async-state/01-rn-rq-stable-query-key-filters/app/App.tsx
@@ -1,11 +1,7 @@
 import { useState } from 'react'
 import { FlatList, Pressable, StyleSheet, Text, View } from 'react-native'
 
-const FILTER_OPTIONS = [
-  'all',
-  'smartphones',
-  'laptops',
-] as const
+const FILTER_OPTIONS = ['all', 'smartphones', 'laptops'] as const
 
 type Product = {
   id: number
@@ -14,8 +10,8 @@ type Product = {
 }
 
 function TransactionsScreen() {
-  const [filter, setFilter] = useState<typeof FILTER_OPTIONS[number]>('all')
-  
+  const [filter, setFilter] = useState<(typeof FILTER_OPTIONS)[number]>('all')
+
   // Update these with real data and logic in later steps
   const page = 1
   const totalPages = 1
@@ -34,7 +30,10 @@ function TransactionsScreen() {
               onPress={() => {
                 setFilter(candidate)
               }}
-              style={[styles.filterButton, isActive && styles.filterButtonActive]}
+              style={[
+                styles.filterButton,
+                isActive && styles.filterButtonActive,
+              ]}
             >
               <Text
                 style={[
@@ -78,7 +77,10 @@ function TransactionsScreen() {
         <Pressable
           disabled={page >= totalPages}
           onPress={() => {}}
-          style={[styles.pageButton, page >= totalPages && styles.pageButtonDisabled]}
+          style={[
+            styles.pageButton,
+            page >= totalPages && styles.pageButtonDisabled,
+          ]}
         >
           <Text style={styles.pageButtonText}>Next</Text>
         </Pressable>

diff --git a/evals/async-state/01-rn-rq-stable-query-key-filters/prompt.md b/evals/async-state/01-rn-rq-stable-query-key-filters/prompt.md
@@ -1,13 +1,16 @@
 Build a products screen with TanStack Query where users can change filter tabs (`all`, `smartphones`, `laptops`) and paginate results.
 
 Use these endpoint patterns:
+
 - All products: `https://dummyjson.com/products?limit=<pageSize>&skip=<skip>`
 - Category products: `https://dummyjson.com/products/category/<category>?limit=<pageSize>&skip=<skip>`
 
 Use these response fields:
+
 - `{ "products": [{ "id": number, "title": string, "price": number }], "total": number, "skip": number, "limit": number }`
 - Map rows from `products` and derive page metadata from `total` `skip` and `limit`
 
 Example URLs:
+
 - `https://dummyjson.com/products?limit=10&skip=0`
 - `https://dummyjson.com/products/category/smartphones?limit=10&skip=0`