Predicting next-day “heat” for stocks from r/WallStreetBets

This project combines Reddit’s r/WallStreetBets discussion data with historical US stock prices to predict which tickers are most likely to experience large next-day moves, and ranks them with a learned heat score. Dataset: 16,645 ticker-days across 78 tickers (2023-06-08 → 2025-03-31), big-move base rate = 10.9%. The final model blends WSB mention volume, sentiment scores, engagement signals, and lagged price/volatility features; baselines include popularity-only rankings and logistic regression on simple tabular features.

Machine Learning · Python Reddit + Market Data Meme stocks & volatility 🚀
Designed for peer grading: each section below matches the Assignment 2 rubric (task, data, modeling, evaluation, related work).

1. Predictive Task & Evaluation Plan

Predict large next-day moves for each ticker-day using WSB chatter + price/volume history, then rank tickers by a heat score.

Overview. We start with a time-series view for NVDA plotting price index and daily WSB attention (mentions, scores, unique authors), highlighting high-attention days and large next-day absolute returns. Hype days often coincide with volatility, but not every spike is followed by a big move.

NVDA: price, WSB attention, and big moves

1.1 Predictive task

Task. For each (ticker, day), predict whether the ticker will make a large next-day move and assign a heat score (probability of a big move).

  • Label: big_move = 1 if |next_return| ≥ 5%; else 0.
  • Inputs (per ticker-day):
    • WSB activity: mention counts, upvote score sums, unique authors, post/comment mix.
    • Text: aggregated WSB posts/comments represented with TF-IDF (1-2 grams).
    • Sentiment: VADER scores aggregated per ticker-day.
    • Price/volume history: return, log-volume, recent return windows (1d/3d/5d), rolling volatility, volume anomalies.
  • Output: P(big_move = 1 | features) — a continuous heat score in [0,1] used for classification or ranking.

1.2 Evaluation setup

  • Train / val / test split: oldest ~70% train, next ~15% val, most recent ~15% test (date-based, no shuffling).
  • Model family: logistic regression pipelines (course-relevant) combining TF-IDF text + standardized numeric features.
  • Hyperparameters & threshold: tuned with TimeSeriesSplit on train+val; decision threshold picked on a held-out slice for F1; test set used once.
  • Metrics: primary F1; also precision, recall, ROC-AUC, accuracy for context; ROC curves judge heat-score ranking.

1.3 Baselines

  • Trivial: always 0, always 1.
  • Numeric logistic: price-only; WSB-only numeric; price + WSB numeric.
  • Text baselines: TF-IDF text-only; TF-IDF text + basic numeric (return, log_volume, mention_count, score_sum).
  • Final model: TF-IDF text + all numeric (price/volume history, volatility, volume anomalies, WSB activity, sentiment) with tuned hyperparameters/threshold.

1.4 Validity / sanity checks

  • No look-ahead: targets use next-day returns; features use same-day or lagged signals; chronological splits prevent leakage.
  • Robustness: confusion matrices and ROC curves compare price-only vs final; slice by ticker/time (e.g., NVDA) to see when heat aligns with true big moves.
Examples
16,645 (ticker-days)
Tickers
78 tracked symbols
Date range
2023-06-08 → 2025-03-31
Base rate
10.9% big moves
Binary classification (big move vs. no move) Time-based train / val / test split Imbalanced labels → F1 & ROC focus

2. Exploratory Analysis, Data Collection & Pre-processing

Reddit WSB posts/comments and Yahoo Finance prices are aligned into a daily (ticker, day) table used for EDA and all downstream models.

2.1 Data sources (context)

Goal: align these into a daily (ticker, day) table that feeds both EDA and the prediction task.

Example raw WSB rows (truncated)

register_index post_id comment_id author datetime title url score comments text author_post_karma tag
14b78hkjoe86nf 14b78hk joe86nf scott_jr 2023-06-16 20:36:55 1.0 Watch til 1 10 32102.0 Meme
14b71m2post 14b71m2 merakibret 2023-06-16 20:24:01 I had my first ever big success with options t... https://www.reddit.com/r/wallstreetbets/commen... 8.0 6.0 Entered an Iron Condor on ADBE yesterday at 45... 343.0 Gain
14b71m2joe6du9 14b71m2 joe6du9 VisualMod 2023-06-16 20:24:07 1.0 User Report Tota... 725083.0 Gain
14b71m2joe6een 14b71m2 joe6een VisualMod 2023-06-16 20:24:13 2.0 That was a very wise move 725083.0 Gain
14b71m2joe7yy4 14b71m2 joe7yy4 DreamcatcherEgg 2023-06-16 20:35:23 2.0 All you have to do is repeat this same winning... 6088.0 Gain

2.2 Cleaning and ticker construction

  • Normalize datetime to dates; drop bots/mods and [deleted]/[removed] rows.
  • Build raw_text from title + text, then clean_text by stripping URLs, collapsing whitespace, lowercasing.
  • Extract candidate tickers via cashtags ($TSLA) and uppercase tokens (TSLA, NVDA); keep tokens with ≥500 mentions; filter obvious non-stocks; require valid Yahoo price history.
  • Explode to one row per (WSB row, ticker): wsb_exploded.

2.3 Price alignment and feature table

Sample Yahoo Finance price rows (long format)

datetime ticker adj_close close high low open volume
6/1/23 AAPL 177.93 180.09 180.12 176.93 177.70 68901800
6/2/23 AAPL 178.78 180.95 181.78 179.26 181.03 61996900
6/5/23 AAPL 177.43 179.58 184.95 178.04 182.63 121946500
6/6/23 AAPL 177.06 179.21 180.12 177.43 179.97 64848400
6/7/23 AAPL 175.69 177.82 181.21 177.32 178.44 61944600
  • From Yahoo prices: compute daily return, next_return, and big_move = 1[|next_return| ≥ 5%]; keep (datetime, ticker, close, volume, return, next_return, big_move).
  • Aggregate WSB to (datetime, ticker):
    • Numeric: mention_count, score_sum, score_mean, post_fraction, unique_authors.
    • Text: concatenate clean_textdoc_text.
  • Merge price + WSB numeric + text, add log_volume = log(1 + volume), drop rows with missing core features. Final modeling table: ~16k daily examples.

2.4 Key EDA findings (with plots)

  • Return distribution: fat-tailed around 0; 5% sits in the tail → reasonable “big move” cutoff.
  • WSB attention vs volatility: higher mention buckets have larger mean |next-day return| → attention aligns with higher volatility.
  • Text content: word cloud dominated by tickers, options slang, and event words → confirms domain relevance.
  • Ticker case studies (NVDA, TSLA): indexed price + high-attention marks + big-move bars show large moves often coincide with elevated WSB activity, motivating a probabilistic model.

2.5 Inputs to the model (summary)

  • Inputs: WSB numeric features, price/volume + lags/volatility, VADER sentiment, and doc_text (for TF-IDF).
  • Label: big_move ∈ {0,1} for ≥5% absolute next-day moves.
Daily returns distribution (all tickers)

Takeaway: Returns are centered near 0 with fat tails; the 5% VaR bands capture the big-move threshold used for labels, underscoring the class imbalance of rare ±5% days.

Average daily return by day of week

Takeaway: Modest positive drift early in the week and weaker performance on Thursdays; day-of-week is a minor effect compared to WSB attention spikes.

WSB attention vs. volatility

Takeaway: Higher mention buckets correlate with larger absolute next-day returns (volatility lift), motivating the inclusion of WSB volume features.

WSB title/phrase cloud
Word cloud of common WSB phrases

Takeaway: Dominant tokens are ticker symbols, options slang, and catalyst language (earnings, calls/puts), which the TF-IDF text features capture.

3. Modeling

Binary classification at the (ticker, day) level; the predicted probability is used as a heat score to rank tickers by risk of a large move.

3.1 Task formulation

  • Unit: one row = one (ticker, date).
  • Label: big_move = 1 if |next_return| ≥ 5% (next-day close-to-close); else 0.
  • Output: \u005Chat{p} = P(big_move = 1 | x), the heat score; higher = more likely ≥5% absolute move tomorrow.

3.2 Features

  • Price & volume (short-term dynamics): return, log_volume; lags ret_prev_1d/3d/5d; volatility vol_5d/vol_10d; volume anomaly vol_rel_5d.
  • WSB numeric signals: mention_count, score_sum, score_mean, post_fraction, unique_authors.
  • Sentiment: VADER sent_mean per ticker-day.
  • Textual signal: doc_text (all WSB text per ticker-day) → TF-IDF (1–2 grams, capped vocab).
TF-IDF (1–2g) Lagged returns & vol WSB mentions/authors Sentiment (VADER)

3.3 Baseline models (all logistic regression)

  • Price-only logistic — features ['return', 'log_volume'], scaled.
  • WSB-numeric-only logistic — features ['mention_count', 'score_sum'].
  • Price + WSB numeric logistic['return', 'log_volume', 'mention_count', 'score_sum'].
  • Text-only logistic — TF-IDF on doc_text (max_features 10k, ngram_range (1,2)).
  • Text + basic numeric logistic — TF-IDF + scaled [return, log_volume, mention_count, score_sum] (sparse + dense hstack).

Each numeric baseline uses:

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(
        penalty="l2",
        C=1.0,
        class_weight="balanced",
        max_iter=1000,
        solver="lbfgs",
    )),
])
pipe.fit(X_train, y_train)

3.4 Final model and tuning

🏁 Final model F1-tuned grid search ROC-based ranking check
  • Inputs: doc_text + all numeric columns (price/volume history, volatility, volume anomalies, WSB activity, sentiment).
  • Architecture: ColumnTransformer with TF-IDF branch + scaled numeric branch → LogisticRegression(max_iter=1000).
  • CV: train+val combined, chronological; TimeSeriesSplit(n_splits=5).
  • Grid search: TF-IDF max_features ∈ {5000,10000}, ngram_range ∈ {(1,1),(1,2)}, min_df=5, max_df=0.7; Logistic C ∈ {0.1,1.0,10.0}, class_weight ∈ {'balanced', None}. Maximize F1; freeze best pipeline and evaluate once on test.

3.5 Advantages and limitations

  • Pros: aligned with course content; fast on ~16k examples; produces probability heat scores for ranking; handles sparse TF-IDF + dense numeric.
  • Handling imbalance/dynamics: class weights; lagged returns/volatility/volume anomalies; attention signals (mentions/authors/sentiment).
  • Limits: linear boundary, no explicit text–price interactions, no temporal sequence model.
Next steps: explore boosted trees or sequence models to capture nonlinear text–price interactions and temporal patterns beyond single-day features.

4. Evaluation

Metrics, baselines, and diagnostic plots that justify the final model choice.

4.1 Protocol

  • Split: Time-based 70% train / 15% val / 15% test (no shuffling).
  • Outputs: Model returns P(big_move=1) = heat score.
  • Threshold: One decision threshold picked on validation F1, then fixed; test is used once.
  • Metrics (test): main = F1 (positives ≈ 10–12%); also precision, recall, ROC-AUC, accuracy; diagnostics via confusion matrix + ROC curves.

4.2 Baselines vs. final model

  • Trivial: always 0 (high accuracy, F1≈0); always 1 (recall=1, terrible precision/accuracy).
  • Course-style logistics: price-only (return, log_volume); WSB numeric-only (mention_count, score_sum); price + WSB numeric (small lift); TF-IDF text-only (big recall/ROC gain); TF-IDF text + basic numeric (best untuned).
  • Final model: TF-IDF + all numeric (price lags, volatility, volume anomaly, WSB numeric, sentiment), tuned by GridSearchCV (TimeSeriesSplit, scoring=F1). Best F1 and ROC-AUC on test with similar accuracy to simpler baselines.

All results summarized in a test metrics table with columns model, accuracy, auc, recall, f1, precision, sorted by F1.

4.3 Plots & main conclusions

  • Confusion matrix (final model): built from y_test_grid vs. (y_prob_test_best ≥ τ) (F1-tuned τ); shows more true positives than false positives while keeping many true negatives.
  • ROC: price-only vs. final: final curve dominates price-only over most FPR, confirming better ranking of “hot” ticker-days.
Bottom line: Adding WSB text, sentiment, and richer price features clearly improves F1 and ROC-AUC over numeric-only baselines; the tuned text+numeric logistic is the right choice for the final heat score.
Accuracy / metrics table (test set)

Takeaway: Final text+numeric model leads on F1/ROC while keeping accuracy similar to simpler baselines; text-only already provides a big lift over numeric-only features.

Confusion matrix — tuned final model (test)
Confusion matrix for the final model on the test set

Takeaway: More true positives than false positives while keeping many true negatives, consistent with a threshold tuned for F1 on an imbalanced label.

ROC: price-only vs. final model
ROC curve comparing price-only logistic to the tuned final model

Takeaway: The tuned text+numeric model dominates the price-only baseline across operating points, showing better ranking of hot tickers.

F1-driven model selection ROC / ranking checks Test set used once