Skip to content
大相撲 SumoFans
夏場所 Natsu (Summer)
methodology · how it works · last build June 3, 2026 at 1:52 PM

資料 How it works

The pipeline: where the data comes from, how far back it goes, how the snapshot gets built, and where the code lives. Static end-to-end — one derived.json baked at deploy.

00 / Pipeline

The pipeline at a glance

The short version: two upstream sources fused into a single static artifact at build time.

  • 3 sources sumo-api.com (bouts and banzuke), Wikipedia (photos + pre-1958 yokozuna lineage), JSA (photo fallback)
  • 67 years deep Nov 1958 - May 2026, the modern six-basho era
  • 1 artifact one derived.json baked at deploy time, no DB or CMS
01 / Sources

Where the data comes from

sumo-api.com

The primary data source. An unofficial but well-maintained REST API mirroring the Japan Sumo Association's official banzuke and bout records.

GET/api/rikishis?limit=1000&skip=N

Paginated full roster - currently 610 active rikishi across all six divisions. The pipeline pages through until the response is empty, then writes data/raw/rikishis.json.

Sample response
{
  "rikishis": [
    {
      "id": 8854,
      "shikonaEn": "Aonishiki",
      "shikonaJp": "安青錦 新大",
      "currentRank": "Ozeki 1 West",
      "heya": "Ajigawa",
      "shusshin": "Ukraine, Vinnytsia Oblast",
      "height": 182,
      "weight": 141,
      "debut": "202307",
      "birthDate": "2004-03-23T00:00:00Z"
    }
    // ... 620 more rikishi
  ],
  "total": 621
}
GET/api/rikishi/<id>/stats

Career W-L by division, yusho and special-prize (sansho) counts. Fetched for Makuuchi sekitori only - reaches back to each rikishi's professional debut.

Sample response
{
  "id": 8854,
  "totalMatches": 177,
  "totalWins": 135,
  "totalLosses": 42,
  "yusho": 4,
  "basho": 16,
  "sansho": {
    "Gino_sho": 3,
    "Kanto_sho": 2,
    "Shukun_sho": 1
  }
}
GET/api/basho/<YYYYMM>

Tournament metadata: dates, status, and yusho winners per division. Fetched for the last 6 bashos (~1 year window).

Sample response
{
  "bashoId": "202507",
  "startDate": "2025-07-13T00:00:00Z",
  "endDate": "2025-07-27T00:00:00Z",
  "yusho": [
    {
      "division": "Makuuchi",
      "rikishiId": 8854,
      "shikonaEn": "Aonishiki"
    }
    // ... 5 more divisions
  ]
}
GET/api/basho/<YYYYMM>/banzuke/Makuuchi

Full 15-day per-bout records for every Makuuchi rikishi in that basho - rank, side, result, opponent, and kimarite for each match day.

Sample response
{
  "bashoId": "202507",
  "division": "Makuuchi",
  "rikishi": [
    {
      "rikishiId": 19,
      "shikonaEn": "Hoshoryu",
      "rank": "Yokozuna 1 East",
      "records": [
        { "day": 1, "result": "win",  "opponentId": 44,   "kimarite": "sotogake" },
        { "day": 2, "result": "loss", "opponentId": 13,   "kimarite": "yorikiri" },
        { "day": 3, "result": "loss", "opponentId": 8854, "kimarite": "watashikomi" }
        // ... 12 more days
      ]
    }
    // ... 41 more rikishi
  ]
}

Pulls last 6 bashos by default (~1 year window). Re-run npm run data daily during a basho to refresh.

Wikipedia / Wikimedia Commons

Wrestler photos via the Wikipedia REST APIs. Two endpoints are tried in sequence for each rikishi in the union of the active roster (610 wrestlers across all divisions) and the historical Makuuchi corpus (558 wrestlers since 1958) — 1,095 unique lookups in total. Three quality guards keep the search fallback honest: a title gate (the rikishi's shikona must start the matched Wikipedia page title — rejects "Mudoho" → "Taihō Kōki" because Mudoho is Taihō's grandson with no own article), a description filter (rejects list pages, "X stable", manga, fictional characters with "sumo wrestler" descriptions), and cross-ID dedupe (a Wikipedia URL claimed by one rikishi can't be reclaimed by another with a similar shikona).

GET/api/rest_v1/page/summary/<shikonaEn>

Direct title lookup - tried first. Succeeds when the English shikona exactly matches a Wikipedia article title.

Sample response
{
  "title": "Aonishiki Arata",
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/...",
    "width": 320,
    "height": 427
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/5/5a/...",
    "width": 1200,
    "height": 1600
  },
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Aonishiki_Arata"
    }
  }
}
GET/w/rest.php/v1/search/page?q=<shikona>+sumo+wrestler

Fallback keyword search when the direct lookup misses. The first result with a thumbnail is accepted.

Sample response
{
  "pages": [
    {
      "key": "Aonishiki_Arata",
      "title": "Aonishiki Arata",
      "thumbnail": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/...",
        "width": 200,
        "height": 267
      }
    }
    // ... more results
  ]
}

Cached locally in data/raw/wiki-photos.json with resolved: "summary" | "search" | "jsa" | "miss" and source: "wikipedia" | "jsa". Wikipedia photos hotlink at the Wikimedia thumbnail URL (CC BY-SA).

JSA official profile (fallback)

When both Wikipedia lookups miss, fall back to the Japan Sumo Association's official rikishi profile page. JSA has an authoritative headshot for every currently-ranked rikishi — this lifts portrait completeness for the active banzuke to ~100%. JSA has no presence for retired wrestlers (their nskId is 0), so the historical Makuuchi corpus relies entirely on Wikipedia for the 55% of profile-page rikishi that resolve; the rest fall through to a 力 silhouette card.

GET/ResultRikishiData/profile/<nskId>/

HTML-scrape lookup — extracts the 270x474 portrait img with class="col1". Polite UA + 1 req/sec rate limit (per the project's scraping ethics policy in data shape). Photo URL is hotlinked from sumo.or.jp/img/sumo_data/rikishi/…; credit reads "Japan Sumo Association".

JSA's robots.txt returns 404 (no rules → default-permissive). nskId is the JSA's internal rikishi identifier; it ships with the sumo-api crawl so no extra lookup needed.

Static / authored content

Some sections are not data-driven - they're factual descriptions of the sport and culture:

  • Rules & ranks (section 02-03) - sumo regulations and the four titled ranks
  • Schedule (05) - 6-tournament calendar, venues, capacities
  • Garb & gear (06) - 12 attire/ritual items with descriptions
  • Glossary (01 + /glossary) - 37 terms across 6 cohorts, deep-linkable by romaji
  • Salary table (04) - JSA monthly pay schedule, public
  • Kimarite glossary & descriptions (17/18/19) - English glosses + mechanical detail

All authored content is in the component files, not the data pipeline. Updating these requires editing the component (intentional - they're slow-moving facts).

02 / On freshness

How current the data is

What snapshot you're seeing

Snapshot
June 3, 2026 at 1:52 PM

The site is a static build. The data above was frozen at this timestamp. To refresh: re-run npm run data && npm run build and redeploy. There's no client-side polling; nothing updates while you watch.

Coverage windows

Roster window
All currently active rikishi (621)

The full roster is pulled every build. Anyone on the books at sumo-api at fetch time is included. New entrants appear after their first banzuke listing; retirees drop off when sumo-api updates.

Banzuke window
Last 6 honbasho (~1 year)

The fetch script auto-snaps to the most recent 6 bashos (Jan/Mar/May/Jul/Sep/Nov). At time of build: 202507 through 202605. Anything older isn't pulled. Bumping the window means changing one constant in fetch-data.ts.

Career stats
Full lifetime, per rikishi

Career W-L, yusho count, sansho prize counts, and division-tenure are pulled per-sekitori from sumo-api's /stats endpoint. These reach back to each rikishi's professional debut (so for a veteran like Tamawashi: ~20 years).

Refresh cadence

Live-basho refresh
Manual

During an active basho, day-by-day results land in sumo-api typically within hours of each match day ending. To pick them up here, re-run npm run data:fetch && npm run data:build && npm run build. Cron-friendly but not currently scheduled.

Photo cache
Sticky until refreshed

Wikipedia photo URLs cache in data/raw/wiki-photos.json and survive across builds. The fetch script only re-checks misses unless you pass --refresh. Photos for new sekitori (e.g. just promoted from Juryo) require a one-off npm run data:photos after their first appearance.

No history kept

No archive
Each build overwrites

We don't keep a history of prior derived.json snapshots. The deployed site reflects whatever was built last. If you wanted historical comparisons (e.g. "how did Yokozuna Hoshoryu's win rate change over 3 bashos?") you'd archive snapshots manually or query sumo-api directly.

03 / Pipeline

Fetch → transform → render

  1. A
    scripts/fetch-data.ts

    Hits sumo-api with throttle (80-200ms between calls). Pages through all rikishis, fetches stats for sekitori, fetches the most recent 6 bashos. Writes raw JSON files into data/raw/.

  2. B
    scripts/fetch-photos.ts

    Cached lookup of Wikipedia summaries for current Makuuchi sekitori. Direct title match first, falls back to keyword search. Idempotent (skips entries already resolved). Run with --refresh to re-check everyone.

  3. C
    scripts/build-data.ts

    Reads data/raw/* and produces src/data/derived.json. All the heavy lifting happens here: filtering to Makuuchi, parsing ranks, computing aggregates, generating per-tier breakdowns, photo merge.

  4. D
    Astro build

    src/pages/index.astro imports derived.json; each component reads what it needs and emits HTML. Output is fully static - one index.html, hashable + CDN-friendly. Zero client-side data fetches.

Historical pipeline (/history)

The /history page draws on a separate corpus spanning November 1958 through the current basho (~67 years, 406 tournaments). Two additional scripts feed it: fetch-historical.ts crawls sumo-api for every Makuuchi banzuke and rikishi bio since 1958 (cached in data/raw/historical/); build-historical.ts aggregates those into src/data/historical.json with decade-level trends -- kimarite shifts, yokozuna reigns, the Mongolian-wave line chart, and win rates by rank tier.

Body-stats note: wrestler height and weight are JSA-reported career values from the bio endpoint, not per-basho time series. The decade averages on the body-stats chart reflect all wrestlers who appeared in Makuuchi during that decade, averaged with their static measurements -- not a snapshot of what they weighed during those specific years.

04 / Architecture

How the site is built

Repo layout

sumo/
├── package.json              # scripts: data, data:fetch, data:photos, data:build, dev, build
├── astro.config.mjs          # Astro config, port 4327
├── tsconfig.json
├── data/
│   └── raw/                  # gitignored - sumo-api dumps + wiki photo cache
├── scripts/
│   ├── fetch-data.ts         # sumo-api crawl (rikishis, stats, basho, banzuke)
│   ├── fetch-photos.ts       # Wikipedia photo lookup w/ cache
│   └── build-data.ts         # raw → derived.json, all transforms here
└── src/
    ├── data/
    │   └── derived.json      # the single source of truth at build time
    ├── lib/
    │   ├── types.ts          # all schema types (Derived, Rikishi, BanzukeEntry, etc.)
    │   ├── historical-cache.ts  # shared loader for the 1958-present corpus
    │   └── rikishi-profile.ts   # per-wrestler aggregation for the profile pages
    ├── layouts/
    │   ├── Base.astro
    │   └── MethodologyLayout.astro
    ├── pages/
    │   ├── index.astro       # the main dashboard
    │   ├── history.astro     # the longitudinal /history page (1958-present)
    │   ├── rikishi/[id].astro # 558 per-wrestler profile pages
    │   └── methodology/      # this section
    ├── styles/
    │   └── global.css        # paper/sumi-ink design tokens
    └── components/           # ~22 dashboard sections + RikishiLink (the name → profile link)

Profileable entities

Beyond the main dashboard, the site generates deep-link dossier pages for four entity types, each backed by its own build-time aggregation module:

  • Rikishi (/rikishi/[id]) - per-wrestler dossiers covering career rank arc, head-to-head history, kimarite preferences, sansho honors, and post-active kabu tenure.
  • Heya (/heya/[slug]) - per-stable profiles with full roster, all-time yusho/sansho/yokozuna output, and ichimon affiliation.
  • Basho (/basho/[YYYYMM]) - per-tournament pages for every honbasho from 1958 to the present.
  • Kabu (/kabu/[slug]) - the 104 currently-active toshiyori-kabu (年寄株), the fixed pool of inheritable elder names that retired wrestlers acquire from the Japan Sumo Association. Each kabu has a per-name lineage page tracing successive holders generation-by-generation, drawn from Japanese Wikipedia. The directory page groups all 104 by their ichimon affiliation. Data source and match-quality detail on data shape.

Stack

Core

  • Astro 5 - Static site generator with file-based routing and scoped CSS. Chosen for zero-runtime output: the build emits one index.html and static assets; no JS framework ships to the browser.
  • TypeScript end-to-end - Both the data pipeline scripts (scripts/) and Astro components are TypeScript. One type system for the whole repo; src/lib/types.ts is the shared schema.
  • Plain CSS + custom-property tokens - No Tailwind, no CSS framework. The parchment palette lives in src/styles/global.css as --paper / --paper-2 / --ink / etc. Components reference tokens, never raw hex.
  • No client framework - Sidebar nav, sunburst hover, scroll-spy, and other interactive behaviors are vanilla TypeScript in <script> blocks per component. No React, Vue, or Svelte.
  • Hand-rolled SVG - All visualizations (sankey, sunburst, weight distribution) are authored directly as SVG with TypeScript math. No D3, no chart library.
  • One derived.json artifact - The single source of truth at build time. Every component reads from it; no database, no CMS, no client-side state. Deterministic and cacheable: same inputs always produce the same HTML.

Tooling & hosting

  • tsx - Runs TypeScript pipeline scripts directly without a separate compile step (tsx scripts/fetch-data.ts). Dev-dependency only; not in the Astro build.
  • @resvg/resvg-js - SVG-to-PNG rendering for OG image generation. Dev-dependency.
  • @astrojs/sitemap - Sitemap plugin; generates /sitemap-index.xml at build time.
  • Node 22 - Pinned via .nvmrc. Both local dev and Cloudflare Pages build use the same version.
  • Cloudflare Pages - Hosting with auto-deploy from main. Every PR gets a preview at a unique <hash>.sumofans-com.pages.dev URL. Build command: npm run data && npm run build; output: dist/.