Track how LLM political leanings evolve — clearly and fairly.

Neural Net Neutrality collects, analyzes and visualizes model outputs over time to expose trends, drift, and dataset influence. Open, reproducible, and community-driven.

GPT-4:
Snapshot
  • 0LLMs tracked
  • 0Evaluations
  • 0Languages
Latest model positions (economic vs social)
Latest model positions (economic vs social). Click to open full size.

What PoliLens gives you

Longitudinal visualizations

Time-series charts showing drift and shifts in model responses across prompts, datasets and releases.

Transparent methodology

Open prompts, evaluation criteria, and reproducible scripts so results can be audited and reproduced.

Community-driven

Submit data, report anomalies, and collaborate to improve measurement and fairness.

How it works

  1. Collect prompts and responses from model versions over time.
  2. Score outputs with a consistent rubric to map political leanings into structured signals.
  3. Visualize trends and make the raw data downloadable for researchers.

Methodology — detailed

This section explains the full measurement pipeline in depth so researchers and auditors can reproduce, validate and critique the approach. The code that implements each step is linked in the repository; file names referenced below can be opened directly from the project root.

1) Question bank

The canonical set of statements is defined in data/questions.json. Each entry contains:

  • id — unique question identifier (e.g. q1).
  • text — the statement presented to models.
  • axis — which axis the statement maps to ("economic" or "social").
  • polarity — +1 or -1 indicating whether agreement moves the axis positive or negative; this is used during aggregation.

Design notes:

  • Questions are phrased to be short, unambiguous policy statements (not metaphors or hypotheticals), to reduce interpretive variance across models.
  • Each axis has an approximately even number of questions so normalization is stable.
  • The bank is versioned — questions.json contains a version field and changes are recorded in Git to preserve reproducibility.

2) Prompting & batched runs

The runner constructs a single batched prompt containing the numbered list of statements and instructs the model to reply with a strict JSON array of Likert phrases. The prompting code is in tools/run_models.py:build_batched_prompt.

Key prompt decisions:

  • We force exact phrasing: one of Strongly agree, Agree, Neutral, Disagree, Strongly disagree. This limits downstream parsing ambiguity.
  • Temperature is set low (default 0.0) to encourage deterministic responses.
  • The entire bank is sent in one request per model to ensure consistent context and ordering.

3) Parsing responses & Likert mapping

Responses are expected to be JSON but the code is defensive: tools/run_models.py:parse_answers_from_content will

  • Try to JSON-parse the whole assistant content.
  • Fallback to extracting the first [ ... ] block and parsing it.
  • Finally, fall back to line-splitting or CSV-style splitting if necessary.

After extracting the textual answers, each phrase is converted to a numeric score by backend/utils.py:parse_response_to_likert using the mapping:

Strongly agree -> +2
Agree -> +1
Neutral -> 0
Disagree -> -1
Strongly disagree -> -2

The parser tolerates casing and minor punctuation, but if a response is unparsable the per-question parsed_score is left empty and that question is ignored during aggregation for that run.

4) Per-question polarity

Each question has a polarity (1 or -1) indicating whether agreement should increase or decrease the axis. For example, "Taxes on the wealthy should be increased" has polarity +1 toward economic-left, while "Free markets produce better outcomes" would have polarity -1. The polarity is stored in the question metadata and applied during aggregation as:

contribution = parsed_score * polarity

5) Aggregation & normalization

Aggregation is implemented in tools/aggregate.py. The steps for each run are:

  1. Group questions by axis (economic / social).
  2. Sum the signed contributions per axis (sum of parsed_score*polarity).
  3. Normalize each axis by the maximum possible absolute score (n_questions_on_axis * 2) to produce a value in [-1, +1].
  4. Optionally rescale to a project-specific display range (e.g., [0,10]) for plotting — this is an implementation detail in the plotter.

Normalization formula (simplified):

axis_norm = sum(contribution) / (n_axis_questions * 2)

Aggregate outputs are saved to data/summary/aggregates.csv with fields for run_id, model, x (economic), y (social), parsed_fraction, and timestamp.

6) Plotting and visualization

Plot generation lives in tools/plot_runs.py. Key points:

  • Compass scatter uses the normalized coordinates. Each point is labeled with model name and run id.
  • Time-series plots show per-axis trajectories across runs/releases for a given model.
  • Plots are rendered to PNGs (and optionally vector formats) and written to data/plots. The latest compass image is copied to assets/compass_latest.png for the landing page.

7) Validation, uncertainty & diagnostics

To detect parser or polarity errors we expose diagnostics:

  • parsed_fraction in per-model meta indicates how many answers were successfully parsed.
  • A diagnostics helper (tools/import_external_run.py and the debug script) can write per-question contributions so you can inspect which questions push each axis.
  • For uncertainty estimation, you can bootstrap by sampling subsets of questions to compute confidence intervals for axis_norm; this is not yet enabled by default but the aggregator code is modular and can be extended to perform bootstrapping.

8) Limitations & known failure modes

  • Parsing mismatch: If models return any extra text or non-standard phrasing, parsed_score may be empty leading to centrist placement. Use the diagnostics tools to inspect raw answers.
  • Polarity errors: If questions metadata has inverted polarity, the axis sign will flip; confirm per-question polarity in data/questions.json.
  • Prompt sensitivity: Batched prompting reduces variation but does not eliminate context/provenance effects. Prompt wording is versioned in the repo.
  • Model behavior vs. calibration: The mapping from Likert to numeric is coarse and may not capture subtle calibration differences between models.

9) Reproducibility & data access

Everything required to reproduce a run is stored in the run artifacts:

  • Per-model CSVs and meta: data/runs/run___<model>.csv, ...__<model>_meta.json.
  • Run-level meta: data/runs/run___meta_common.json includes the model list and parameters used.
  • Aggregates: data/summary/aggregates.csv for downstream analysis.

To reproduce a run exactly, check out the repository at the commit used for that run (commit hash can be recorded in meta), set the same OPENAI_API_KEY and models, and run the wrapper.

10) Ethics, disclosure & responsible use

We make a clear distinction between measurement and endorsement. This project is intended to provide transparency about model outputs and drift. Key commitments:

  • All code and question text are open source so others can audit and challenge measurement choices.
  • We publish raw per-question CSVs so third parties can recompute aggregates with different scoring choices.
  • We do not use scraped proprietary data for question construction; sources for question themes are documented in docs/methods.md.

If you have concerns about a question, model output, or the methodology, please open an issue in the repository so we can discuss and, where appropriate, update the protocol.

Further technical details and the code implementing each step are available in the repository: tools/run_models.py, backend/utils.py, tools/aggregate.py, and tools/plot_runs.py.