Methodology

How Caller Track measures influencer accuracy.

What We Measure

Caller Track records the first time a caller makes a direct buy recommendation — either by contract address or ticker symbol — using first-mention timestamping across X (Twitter) and Telegram. We then track whether the token reached a specified multiplier (2x, 3x, 5x, or 10x) within a configurable time window using minute-level OHLCV pricing from on-chain data sources. The default window is 14 days, which captures slower-developing plays common in meme coins — callers who identify tokens 10-12 days before the move deserve credit. Shorter windows (24h, 48h, 72h, 7d) are available for stricter evaluation. This produces a verifiable hit rate: the percentage of calls that reached the target.

Only direct buy recommendations are scored. A caller commenting on, warning about, or joking about a token is not making a buy recommendation — but most tracking tools count every mention as a call, inflating hit rates with noise. Caller Track uses an LLM classification engine to separate genuine buy calls from commentary, profit updates, warnings, jokes, and casual mentions. The same methodology applies to every caller regardless of verification status.

The Classification Problem

The hardest problem in caller accountability is determining what counts as a call. A tweet containing a token ticker could be a buy recommendation, a warning to sell, a joke, a follow-up on a previous position, a casual mention, or paid promotion. Treating every mention as a call — as most tracking tools do — inflates hit rates and makes every caller look better than they are.

A large language model classifies each post into one of eleven categories: buy call, shill, follow-up, hold, commentary, news, casual mention, warning, joke, sell call, or repost. Classification uses few-shot prompting with a curated set of real-world examples drawn from hundreds of manually reviewed, high-confidence classifications spanning diverse caller styles and phrasing. Before classification, tweets that cannot contain a call — pure retweets, emoji-only posts, and tweets with no token reference — are filtered out automatically, reducing noise without risking missed calls.

Two confidence thresholds govern the system. Classification confidence must reach 75% for a tweet to be scored; anything below that is excluded. Separately, tweets where primary confidence falls below 85% are routed to an ensemble of two additional models for independent review. All three models receive the same prompt and examples — no model sees another's output. When two of three agree, the majority rules — even if the primary model is the dissenter, it can be overridden. When all three disagree, the primary model's judgment stands. If secondary models are unavailable or time out, the system falls back to the primary classification rather than failing.

The system fails closed: when uncertain, it undercounts rather than inflates. Every classified tweet is converted into a vector embedding that captures semantic meaning, powering a similarity search used for internal quality assurance. High-confidence classifications are accumulated into a growing pool of verified examples. As this pool matures, the system will retrieve semantically relevant examples for new tweets — adapting to emerging slang, niche phrasing, and caller-specific language patterns over time.

The Completeness Problem

A caller's score is only as reliable as the data behind it. If the scraper misses tweets, the score is wrong — and not in a way that's obvious. Missing a buy call that hit 10x makes the caller look worse. Missing one that went to zero makes them look better. Either way, the score can't be trusted.

This is harder than it sounds. X's search infrastructure is eventually consistent — distributed replicas can return slightly different result sets between identical queries. A single scrape of a high-volume day might capture 95-97% of tweets, silently missing the rest. Most tools don't acknowledge this because most tools haven't tested for it.

Caller Track uses a self-hosted tweet scraper that queries two independent X data sources and unions the results: the primary search index covers the full historical range up to four years, while an independent timeline endpoint covers recent months. Because these are backed by separate infrastructure on X's side, unioning them covers gaps that neither source would catch alone.

To compensate for search-level non-determinism, the scraper runs a multi-pass retrieval process on high-volume days, re-querying with different credentials to sample different replicas. Results are unioned into a single deduplicated set, and the process stops early when a pass recovers zero new tweets. Multi-pass produces a strict superset — it never loses tweets, only recovers ones that were missed. The process is deterministic: independent runs produce identical output.

All tweet types are captured: originals, replies, quote tweets, and retweets. Tweet capture has been manually verified across multiple callers spanning four years of history. Forward verification confirmed that every captured tweet exists on X with matching content. Reverse verification confirmed that every tweet visible in X's search appears in the scraper's output. Automated consistency testing across 15 caller-months showed zero variance after multi-pass processing.

The scraper captures tweets that exist at the time of analysis. Tweets deleted before the scrape are not recoverable. Tweets edited within X's edit window are captured in their final form. These are inherent platform limitations, not engineering gaps.

The Resolution Problem

When a caller tweets “$PEPE,” which token are they talking about? There might be dozens of tokens with that ticker across Solana, Base, Ethereum, and BSC — some active, some abandoned, some created yesterday. Matching the wrong one means scoring against the wrong price, and the caller gets a score that has nothing to do with what they actually recommended.

Ticker-to-contract resolution uses on-chain pair scoring with liquidity weighting, multi-pair ecosystem detection, and temporal proximity — tokens created close to the call date are preferred over older tokens sharing the same ticker, preventing false matches to abandoned or migrated contracts. Dead migration pools are penalized. Near-zero-volume pairs are deprioritized. A 189-entry noise blocklist filters common false positives like “HODL,” “FOMC,” and “STOCKS” that appear as tickers but aren't token references. When a caller posts a contract address directly, resolution is exact — no ambiguity.

All resolved data is cached after first retrieval, making results deterministic and reproducible. The same caller lookup run twice will always produce the same token matches.

The Pricing Problem

Entry price determines everything downstream. A few cents difference changes whether a call hit 2x or missed it. Every choice about how to measure price is a design decision with consequences.

Entry price is the low of the 1-minute candle at call time — the best price available in the minute the call was posted. This is conservative by design: we assume the caller's audience could have bought at the lowest point of that minute, not an average or delayed price. This favors the caller — a higher entry price would produce lower multipliers and fewer hits.

Peak price is the highest price reached within the selected time window (default: 14 days) — derived from the maximum high across all candles in the window. The multiplier is peak divided by entry. A call is a “hit” if the multiplier reached or exceeded the target within the window. Hit rate is the number of hits divided by the number of scored calls. Unscored calls — where price data could not be resolved — are excluded from the hit rate, favoring the caller by not counting data gaps as misses. Pump velocity (time-to-target) is also tracked for each hit, measuring how quickly the token reached 2x, 3x, 5x, or 10x.

Price data is resolved through a multi-source OHLCV engine: the primary on-chain provider is queried first; if unavailable, the request cascades to secondary and tertiary sources. A caching layer ensures that once price data is retrieved, results are deterministic and reproducible — the same query always returns the same answer.

Peer Ranking

Raw hit rates can be misleading without context. The “Top X%” ranking compares a caller's hit rate against a curated distribution of 50+ tracked callers under the same multiplier target and time window. “Top 10%” means the caller's hit rate is higher than 90% of tracked callers at that setting. Rankings update when you change the multiplier or window selector. If the comparison pool has fewer than 30 callers for a given setting, no ranking is shown. Distributions are stored server-side and recalculated periodically as the tracked pool grows — they are not real-time.

Chart Vision

Callers don't always write out their recommendations — they post chart screenshots with drawn entry zones, targets, and trend lines. A classifier that can only read text misses the intent behind these posts entirely.

When callers attach chart screenshots, a vision model extracts structured metadata — chart type, trend direction, timeframe, technical indicators, annotated entry zones, targets, support/resistance levels, and chart quality score. The vision model analyzes each image in isolation, without access to the tweet text, preventing text-biased interpretation where the model might “see” what the text says rather than what the chart shows.

Vision output feeds back into classification confidence through Bayesian late fusion: a bullish chart with marked entry zones on a borderline buy call can promote it into the scored set, while a bearish chart on a buy call reduces confidence. The adjustment is asymmetric by design — contradictory visual evidence penalizes more than confirmatory evidence boosts, and confidence deltas are computed by deterministic code rather than the AI model itself, eliminating a known calibration weakness in vision language models.

Verified Caller Program

Verification means a caller has opted in, reviewed their scorecard, and agreed to public tracking. The process: a caller submits a request, we confirm account ownership via DM, run their full history through the pipeline, and provide a one-week private review window. During this window, callers can flag misclassified calls — each dispute is reviewed manually. After the review period, the scorecard is published with a “Verified” badge. Unverified callers use the same methodology and pipeline — the badge reflects participation in the review process, nothing more. Verification is free with no paid tier, no pay-to-play, and no favorable treatment.

Market Phase Definitions

Caller Track allows filtering scores by market phase to provide context for caller performance under different conditions. Market phases are defined using Total3 — the total cryptocurrency market capitalization excluding Bitcoin and Ethereum — which reflects the broad altcoin market environment callers operate in.

Bull Market Phases

  • Oct 16, 2023 – Mar 31, 2024 (ETF Rally)
  • Nov 1, 2024 – Jan 19, 2025 (Trump Rally)
  • Jun 1, 2025 – Oct 6, 2025 (Breakout Rally)

Bear Market Phases

  • Apr 1, 2023 – Oct 15, 2023 (Post-FTX Winter)
  • Apr 1, 2024 – Oct 31, 2024 (Mid-Cycle Correction)
  • Oct 7, 2025 – Present (Tariff Crash)

Calls made during transition periods between defined phases (e.g., Jan 20 – May 31, 2025) are excluded from both bull and bear filters and only appear in the All Time view.

Market phase definitions are reviewed and updated as market conditions evolve.

Limitations

Price data depends on third-party APIs — some low-cap or very new tokens may lack OHLCV data. Classification is automated and probabilistic; edge cases exist where unusual phrasing is missed or miscategorized. We track X and Telegram only — calls on Discord, private channels, or other platforms are not captured. The scraper captures tweets that exist at time of analysis — tweets deleted or edited before the scrape reflect their final state, not their original content. First-mention wins: if a caller mentioned a token before our scrape window, that earlier mention is missed. Sample sizes are always displayed prominently — a caller with 12 scored calls should be interpreted differently than one with 340. Scores are descriptive, not predictive. A high hit rate does not guarantee future performance.

Our Commitment

This tool has no financial relationship with any caller. We do not accept payment for verification badges, featured placement, or favorable treatment. The data is the data. If you believe a result is incorrect, contact us — we investigate every report.