NeurIPS 2025

Dec 2025

Mining every NeurIPS 2025 paper to surface the most-supported and most-contradicted claims in the field

NeurIPS 2025 insight graph visualisation — A single claim selected from the corpus. The black dot is the claim itself. Green dots are papers whose claims support it. Red dots are papers whose claims contradict it.

Motivation

I went to NeurIPS 2025. There were 5,846 papers this year and the field moves fast, so I kept wondering: how many of these papers actually contradict each other?

You can't read all of them. But you can have an LLM pull the main claims out of every paper, embed those claims, find the closest matches in the rest of the corpus, and ask another LLM whether each pair agrees or disagrees. The result is a graph. The interesting nodes are the most-supported and most-contradicted claims.

Numbers

5,846 papers (every accepted NeurIPS 2025 paper)
46,771 extracted insights with 1536-d embeddings
53,443 classified relations between insights
49,021 supporting and 4,422 contradicting (~8.3% disagreement)

Phase 1 - extraction

Goal of this phase: turn 5,846 PDFs into a queryable table of embedded claims. It runs once, end-to-end, and writes to Postgres. Nothing downstream depends on it being live.

Parse. Each PDF is split into title, abstract, and section-level chunks. Chunks are sized to fit a single claim with enough surrounding context.

Extract. Each chunk goes through gpt-4.1-mini with a structured tool call (see schema below). The model returns 1-30 claims per chunk, each one short and self-contained ("method X improves Y by Z on benchmark W"), plus 1-5 lowercase topic tags and a short quantitative context string. Keeping each claim self-contained is what makes embeddings discriminate later.

Embed and store. Every insight string is embedded with text-embedding-3-small (1536-d) and inserted into a single insights table in Postgres with the pgvector extension. The embedding column has an HNSW index on cosine distance. End of phase.

5,846 NeurIPS 2025 papers

parse · per paper

pdf → texttitle, abstract, sections

chunkclaim-sized windows

llm extract · gpt-4.1-mini · extract_insights

claimself-contained insight

tagstopics, methods

contextnumbers, metrics

embed · text-embedding-3-small

vector(1536)46,771 insights

store · Postgres + pgvector

insightstext + embedding + tags

hnsw indexcosine, 1536-d

llm toolextract_insights

Called once per chunk. Returns up to 30 claims with tags and quantitative context.

{
  "name": "extract_insights",
  "description": "Pull self-contained claims out of a paper chunk.",
  "parameters": {
    "type": "object",
    "required": ["insights"],
    "properties": {
      "insights": {
        "type": "array",
        "minItems": 1, "maxItems": 30,
        "items": {
          "type": "object",
          "required": ["insight", "tags"],
          "properties": {
            "insight": { "type": "string", "minLength": 10 },
            "context": {
              "type": "string",
              "description": "numbers, %, metrics, conditions"
            },
            "tags": {
              "type": "array",
              "minItems": 1, "maxItems": 5,
              "items": { "type": "string" }
            }
          }
        }
      }
    }
  }
}

Phase 2 - search and classification

This phase reads the table that phase 1 produced. It does not need phase 1 to be running. You could rerun phase 2 with different parameters (different k, different prompt) without re-extracting anything.

Nearest neighbours. For each insight, do an approximate kNN over the HNSW index to get the top-k closest insights from other papers. These are candidates: topically nearby, not necessarily in agreement. Pure embedding similarity can't tell "X says scale helps" from "X says scale doesn't help" because both look semantically close.

Classify. Each candidate pair (focal insight, neighbour) is sent to gpt-4.1-mini with a second tool call (see schema below). The model labels each pair as supporting or contradicting and returns a short reasoning string. Pairs the model considers unrelated are simply omitted from the response, which keeps the schema small and the labels honest.

The output is the insight_relations table: 53,443 directed edges, 49,021 supporting and 4,422 contradicting.

insights table (from phase 1)

knn · per insight, top-k

hnsw lookupapproximate kNN over 46k

candidatestopically nearby insights

llm classify · gpt-4.1-mini · compare_insights

supportingclaim agrees

contradictingclaim disagrees

unrelatedomitted from response

store · insight_relations

49,021 supporting

4,422 contradicting

llm toolcompare_insights

Called once per focal insight with its top-k candidates. Returns one label per candidate; unrelated ones are omitted.

{
  "name": "compare_insights",
  "description": "Label each candidate as supporting or contradicting the focal insight.",
  "parameters": {
    "type": "object",
    "required": ["relations"],
    "properties": {
      "relations": {
        "type": "array",
        "items": {
          "type": "object",
          "required": ["other_id", "relation", "reasoning"],
          "properties": {
            "other_id": { "type": "integer" },
            "relation": { "enum": ["supporting", "contradicting"] },
            "reasoning": { "type": "string", "minLength": 10 }
          }
        }
      }
    }
  }
}

Querying

With the graph in Postgres you can ask things like "which claims have the most contradicting evidence?", "show me all papers that disagree with paper X", or "find the claim most supported across the corpus". The HNSW index makes per-claim neighbour lookups cheap, so the same infrastructure works for an interactive web UI on top.

Technology

Postgres, pgvector (HNSW, cosine), OpenAI gpt-4.1-mini for extraction and classification, text-embedding-3-small for embeddings, Python + Alembic for the pipeline, Next.js + Three.js for the visualisation.

← back