Methodology

How Caller Track measures influencer accuracy.

The Methodology in 60 Seconds

Only real buy calls count. A three-model AI ensemble sorts every tweet into eleven categories — jokes, warnings, commentary, and recaps are not calls, and anything the models can't commit to is excluded, never guessed. detail ↓
The record starts at the first recommendation. A caller's call is timestamped the first time they post it; re-posts and victory laps don't reset the clock. detail ↓
Entry is the price you could have gotten. The low of the 1-minute candle at post time — deliberately the caller-friendly price. A call hits if the token reaches the chosen multiplier (2x–10x) within the window, on minute-level on-chain prices. detail ↓
Completeness is enforced, not assumed. Two independent X data sources are cross-checked and re-queried until they agree; a run that can't prove it captured the full history publishes nothing. detail ↓
Tickers resolve to the exact on-chain token. A pasted contract address is unambiguous; bare tickers resolve by liquidity and how close the token's creation is to the call. detail ↓
Every doubt favors the caller. Calls without price data are excluded rather than counted as misses, and every near-miss is re-checked across every price provider before it stays a miss. detail ↓
No pay-to-play. No financial relationship with any caller; verification is free and changes nothing about how a score is computed. detail ↓

The full methodology below is the canonical version — every rule, every edge case, every limitation.

What We Measure

Caller Track timestamps the first time a caller makes a direct buy recommendation — by contract address or ticker — on X (Twitter), then tracks whether the token reached a chosen multiplier (2x, 3x, 5x, or 10x) within a time window, scored against minute-level on-chain OHLCV pricing. The default 14-day window captures the slower-developing plays common in meme coins; shorter windows (24h, 48h, 72h, 7d) offer stricter evaluation. The result is a verifiable hit rate: the percentage of calls that reached the target.

Only direct buy recommendations are scored — and a paid shill is still a buy recommendation the caller made, so promoted calls are scored the same as organic ones. Commenting on, warning about, or joking about a token is not a buy recommendation — but most tracking tools count every mention as a call, inflating hit rates with noise. An LLM classification engine separates buy calls from everything else. Major tokens are also excluded from scoring: a token already above $1 billion market cap with a trading pair older than 180 days at call time is not a gem call — scoring it would add market beta, not skill. Excluded calls stay visible in the call log, labeled. The same methodology applies to every caller regardless of verification status.

The Classification Problem

The hardest problem in caller accountability is deciding what counts as a call. A tweet with a ticker could be a buy recommendation, a sell warning, a joke, a follow-up, a casual mention, or paid promotion. Treating every mention as a call — as most tools do — makes every caller look better than they are.

A large language model classifies each post into one of eleven categories: buy call, sell call, commentary, news, shill, joke, casual mention, warning, follow-up, hold, or repost. Few-shot prompting is anchored to a gold set of hundreds of manually ruled classifications spanning diverse caller styles. Tweets that cannot contain a call — pure retweets, emoji-only posts, no token reference — are filtered out before classification.

Two thresholds govern the system: 75% classification confidence to be scored at all, and below 85% primary confidence a tweet is routed to two independent helper models for review — same prompt, no model sees another's output. The primary and the two helpers form a three-model ensemble; two-of-three majority rules, even against the primary, and full disagreement keeps the primary. A certified scorecard is always produced by the complete three-model ensemble: if a helper is unavailable and enough tweets needed review, the run aborts rather than quietly scoring on the primary alone. Nothing is published on a degraded ensemble.

The category boundary is deliberate: clear forward conviction counts as a call even when phrasing is borderline, because a missed call corrupts a record in either direction — and the confidence gate still excludes anything the model cannot commit to. The examples that anchor the prompt are a fixed, human-reviewed set. They do not grow or change between runs, and no automatic retrieval selects examples per tweet — the same prompt scores every caller, so a score does not drift as the system sees more data.

One boundary case is worth stating plainly: a post containing a token's contract address is scored as a call at its post time — including posts made while the token is already running. Pasting an address mid-pump hands the audience something directly buyable, and the caller claims the credit if it keeps going, so the record treats it the way a follower would experience it: entry at the low of the 1-minute candle containing that post. A victory lap with an address attached is a call at lap-time prices; pure recaps without an address are not calls. This cuts against the caller more often than it helps them, and that is the point.

Once a tweet has been ruled under a given version of the classifier, that ruling is recorded and reused. Re-scoring the same caller does not re-roll the classifier — it reads back the ruling made the first time, so the same tweet always lands in the same category and the same call always counts the same way. Language models are not perfectly repeatable even at their most deterministic setting, so without this a caller could be scored twice and get two slightly different numbers. Pinning the first ruling makes a scorecard reproducible: the same caller under the same configuration yields the same result every time. If the classifier configuration itself changes, that is a new version, and tweets are ruled fresh under it.

The Completeness Problem

A caller's score is only as reliable as the data behind it. Missing a buy call that hit 10x makes the caller look worse; missing one that went to zero makes them look better. Either way the score can't be trusted — and X makes completeness genuinely hard: its search is eventually consistent, so a single scrape of a high-volume day can silently capture only 95-97% of tweets. Most tools haven't tested for this.

No system can prove it captured every tweet a person ever posted — deleted posts, private replies, and platform gaps are outside anyone's reach. We do not claim omniscience. What we claim is a defined verification standard, and a certified run must clear it before it publishes.

A self-hosted scraper queries two independently backed X data sources — a search index covering four years of history, a timeline endpoint covering recent months — and unions them, covering gaps neither catches alone. That is the dual-endpoint cross-check: a tweet one source misses is recovered by the other, so the published set is the wider of the two rather than whichever source happened to lag. On high-volume days a multi-pass process re-queries with different credentials to sample different replicas and stops only once a pass recovers nothing new; two passes reaching the same set is the concordance signal that the day is fully captured. Multi-pass is a strict superset — it never loses tweets, only recovers missed ones.

Originals, replies, and quote tweets are captured; pure retweets are excluded — a retweet is not the caller's own recommendation. Capture has been manually spot-checked in both directions across four years of history, and automated consistency testing across independent runs has shown full or near-full agreement, including identical tweet-for-tweet output on a worst-case high-volume month. Completeness is enforced, not assumed: a run that cannot meet the verification standard fails closed and produces no scorecard rather than publishing an incomplete one.

Two narrow, fully-recorded exceptions keep the standard honest rather than brittle. When a re-scrape transiently misses a tweet that a prior verified scrape of the same caller captured in full — X's search index is eventually consistent — the record keeps the union: at most a handful of tweets per run, each re-verified as still live at proof time, each listed in the run's verification record. And once a tweet has been captured in a verified scrape, deleting it does not remove it from the record: captured-then-deleted tweets are retained with proof of the deletion, because a record a caller can edit after the fact is not a record. Tweets deleted before we ever observed them remain outside anyone's reach, as stated under Limitations.

The Resolution Problem

When a caller tweets “$PEPE,” which token do they mean? Dozens share that ticker across Solana, Base, Ethereum, and BSC. Matching the wrong one means scoring against the wrong price.

Resolution uses on-chain pair scoring with liquidity weighting, multi-pair ecosystem detection, and temporal proximity — tokens created near the call date are preferred over older tokens with the same ticker, preventing false matches to abandoned or migrated contracts. Dead migration pools are penalized, near-zero-volume pairs deprioritized, and curated blocklists (nearly 300 entries) filter false positives like “HODL” and “FOMC.” A posted contract address resolves exactly — no ambiguity. Resolved data is cached, so the same lookup always returns the same matches.

When a tweet contains both a contract address and ticker symbols, the address takes priority: it is the exact token the caller pasted, and the tickers are treated as context. One exception: if the tweet's own words identify the address as something other than the call — for example a position the caller says they sold while recommending a different token — the target classifier can override the address and score the named ticker instead. If the classifier keeps both, the address wins and only the address is scored; a second token named alongside a pasted address is the one shape this pipeline can undercount.

The Pricing Problem

Entry price determines everything downstream — a few cents changes whether a call hit 2x. Entry is the low of the 1-minute candle at call time: the best price available in the minute the call was posted. That is deliberately conservative in the caller's favor — a higher entry would mean lower multipliers and fewer hits.

Some 1-minute candles report zero traded volume: no swap landed in that exact minute, and the price provider carries the last traded price forward as a flat candle. Those candles count as valid entries. On automated market makers the quoted price is continuous — it holds between trades — so a flat, zero-volume candle is the real price anyone could have transacted at in that minute. Rejecting such candles would silently drop calls, and the dropped calls skew toward misses, which would inflate hit rates.

Peak price is the highest price within the selected window; the multiplier is peak divided by entry; a call is a “hit” if the multiplier reached the target in the window, and hit rate is hits over scored calls. Unscored calls — where price data could not be resolved — are excluded with a recorded reason rather than counted as misses, again favoring the caller. Before a scorecard can be certified, every no-data call whose failure was a provider-coverage gap — rather than a settled verdict like an unsupported chain or an invalid address — is given one more full pass through every price provider, with the cache bypassed, in case coverage has improved since. Any call that now returns real price data is scored; only calls that still return nothing anywhere are set aside, each with its recorded reason. Pump velocity (time-to-target) is tracked for each hit. Candle highs implying more than 100,000x the entry are dropped as bad provider prints — one garbage candle should not fabricate a hit.

Different price providers can disagree slightly on the peak a token reached, because they index different trades. That disagreement only matters near the line between a hit and a miss. So any call scored as a miss that still came within striking distance of its target — at least 80% of the way there — is re-checked against every other price provider, and the highest credible peak wins. This can only turn a near-miss into a hit; it can never turn a hit into a miss. A caller is never scored a miss on one provider's data when another provider's data shows the target was reached.

Price data flows through a multi-source OHLCV engine — primary on-chain provider first, cascading to secondary and tertiary sources — behind a cache that makes every retrieved result deterministic.

Peer Ranking

Raw hit rates mislead without context. The “Top X%” ranking compares a caller against the distribution of certified callers at the same multiplier and window — “Top 10%” means a higher hit rate than 90% of tracked callers at that setting. Rankings update with the selectors; pools under 30 callers show no ranking; distributions are recalculated periodically, not real-time.

Board positions require a minimum sample: callers with fewer than 30 scored calls are listed but unranked. Their full scorecard still publishes — every call, every receipt — but they hold no rank and sort below ranked callers until enough calls have been scored. Three hits out of five is a coin streak, not a record; the ranked board only compares records large enough to mean something.

Chart Vision

Calls are identified from tweet text only. A caller who posts nothing but a chart screenshot — no ticker, no contract address, no written recommendation — is not scored on that post. This rule is applied uniformly to every caller: a chart-image-only call never enters the scored set in either direction.

Charts are still read, but only for display. A vision model extracts structured metadata from attached charts — chart type, trend direction, timeframe, indicators, annotated entry zones, targets, stop-loss levels, and a quality score — analyzing each image in isolation, without the tweet text, so it reports what the chart shows rather than what the text says. This surfaces as chart badges and stats on a caller's calls.

Chart analysis never changes a score. It does not promote, demote, or filter any call, and it is not part of how hit rate is computed. Scoring depends only on the text classification and the price data. Reading charts as evidence a viewer can see, while keeping them out of the math, avoids adding a second model's judgment to numbers that must stay reproducible.

Verified Caller Program

Verification means a caller opted in and agreed to public tracking: request, ownership confirmation via DM, a full-history run, then a one-week private review window for flagging misclassifications — each reviewed manually — before the scorecard publishes with a “Verified” badge. Unverified callers get the same methodology and pipeline — the badge reflects participation, nothing more. Verification is free: no paid tier, no pay-to-play, no favorable treatment.

Market Phase Definitions

Scores can be filtered by market phase for context. Phases are defined on Total3 — total cryptocurrency market capitalization excluding Bitcoin and Ethereum — the broad altcoin environment callers operate in.

Bull Market Phases

Oct 16, 2023 – Mar 31, 2024 (ETF Rally)
Nov 1, 2024 – Jan 19, 2025 (Trump Rally)
Jun 1, 2025 – Oct 6, 2025 (Breakout Rally)

Bear Market Phases

Apr 1, 2023 – Oct 15, 2023 (Post-FTX Winter)
Apr 1, 2024 – Oct 31, 2024 (Mid-Cycle Correction)
Oct 7, 2025 – Present (Tariff Crash)

Calls during transitions between defined phases (e.g., Jan 20 – May 31, 2025) appear only in the All Time view. Phase definitions are reviewed as conditions evolve.

Limitations

Price data depends on third-party APIs — some low-cap or very new tokens may lack OHLCV data. Classification is automated and probabilistic; edge cases exist where unusual phrasing is missed or miscategorized. We track X only — calls on Discord, Telegram, private channels, or other platforms are not captured. Calls made only inside a chart image, with no ticker or contract address in the text, are not scored. The scraper captures tweets that exist at time of analysis — tweets deleted or edited before the scrape reflect their final state, not their original content. First-mention wins: if a caller mentioned a token before our scrape window, that earlier mention is missed. Sample sizes are always displayed prominently — a caller with 12 scored calls should be interpreted differently than one with 340. Scores are descriptive, not predictive. A high hit rate does not guarantee future performance.

Caller Track scores X callers only. Every published call must carry an independently verifiable public receipt — a link anyone can click to see the original post. Sources that cannot provide that, such as Telegram exports or private groups, are not scored, because a corpus the scored party supplies or curates cannot prove its own completeness.

Our Commitment

This tool has no financial relationship with any caller. We do not accept payment for verification badges, featured placement, or favorable treatment. The data is the data. If you believe a result is incorrect, report it — we investigate every report.