Predicting next-day “heat” for stocks from r/WallStreetBets

This project combines Reddit’s r/WallStreetBets discussion data with historical US stock prices to predict which tickers are most likely to experience large next-day moves, and ranks them with a learned heat score. Dataset: 16,645 ticker-days across 78 tickers (2023-06-08 → 2025-03-31), big-move base rate = 10.9%. The final model blends WSB mention volume, sentiment scores, engagement signals, and lagged price/volatility features; baselines include popularity-only rankings and logistic regression on simple tabular features.

Machine Learning · Python Reddit + Market Data Meme stocks & volatility 🚀

See key results →

Designed for peer grading: each section below matches the Assignment 2 rubric (task, data, modeling, evaluation, related work).

1. Predictive Task & Evaluation Plan

Predict large next-day moves for each ticker-day using WSB chatter + price/volume history, then rank tickers by a heat score.

Overview. We start with a time-series view for NVDA plotting price index and daily WSB attention (mentions, scores, unique authors), highlighting high-attention days and large next-day absolute returns. Hype days often coincide with volatility, but not every spike is followed by a big move.

NVDA: price, WSB attention, and big moves

1.1 Predictive task

Task. For each (ticker, day), predict whether the ticker will make a large next-day move and assign a heat score (probability of a big move).

Label: big_move = 1 if |next_return| ≥ 5%; else 0.
Inputs (per ticker-day):
- WSB activity: mention counts, upvote score sums, unique authors, post/comment mix.
- Text: aggregated WSB posts/comments represented with TF-IDF (1-2 grams).
- Sentiment: VADER scores aggregated per ticker-day.
- Price/volume history: return, log-volume, recent return windows (1d/3d/5d), rolling volatility, volume anomalies.
Output: P(big_move = 1 | features) — a continuous heat score in [0,1] used for classification or ranking.

1.2 Evaluation setup

Train / val / test split: oldest ~70% train, next ~15% val, most recent ~15% test (date-based, no shuffling).
Model family: logistic regression pipelines (course-relevant) combining TF-IDF text + standardized numeric features.
Hyperparameters & threshold: tuned with TimeSeriesSplit on train+val; decision threshold picked on a held-out slice for F1; test set used once.
Metrics: primary F1; also precision, recall, ROC-AUC, accuracy for context; ROC curves judge heat-score ranking.

1.3 Baselines

Trivial: always 0, always 1.
Numeric logistic: price-only; WSB-only numeric; price + WSB numeric.
Text baselines: TF-IDF text-only; TF-IDF text + basic numeric (return, log_volume, mention_count, score_sum).
Final model: TF-IDF text + all numeric (price/volume history, volatility, volume anomalies, WSB activity, sentiment) with tuned hyperparameters/threshold.

1.4 Validity / sanity checks

No look-ahead: targets use next-day returns; features use same-day or lagged signals; chronological splits prevent leakage.
Robustness: confusion matrices and ROC curves compare price-only vs final; slice by ticker/time (e.g., NVDA) to see when heat aligns with true big moves.

Examples

16,645 (ticker-days)

Tickers

78 tracked symbols

Date range

2023-06-08 → 2025-03-31

Base rate

10.9% big moves

Binary classification (big move vs. no move) Time-based train / val / test split Imbalanced labels → F1 & ROC focus

2. Exploratory Analysis, Data Collection & Pre-processing

Reddit WSB posts/comments and Yahoo Finance prices are aligned into a daily (ticker, day) table used for EDA and all downstream models.

2.1 Data sources (context)

Reddit (r/WallStreetBets): CSV of posts/comments with text, author, timestamp, score, and tags.
Yahoo Finance: Daily OHLCV prices for a curated set of WSB tickers over 2023–2025.

Goal: align these into a daily (ticker, day) table that feeds both EDA and the prediction task.

Example raw WSB rows (truncated)

register_index	post_id	comment_id	author	datetime	title	url	score	comments	text	author_post_karma	tag
14b78hkjoe86nf	14b78hk	joe86nf	scott_jr	2023-06-16 20:36:55			1.0		Watch til 1 10	32102.0	Meme
14b71m2post	14b71m2		merakibret	2023-06-16 20:24:01	I had my first ever big success with options t...	https://www.reddit.com/r/wallstreetbets/commen...	8.0	6.0	Entered an Iron Condor on ADBE yesterday at 45...	343.0	Gain
14b71m2joe6du9	14b71m2	joe6du9	VisualMod	2023-06-16 20:24:07			1.0		User Report Tota...	725083.0	Gain
14b71m2joe6een	14b71m2	joe6een	VisualMod	2023-06-16 20:24:13			2.0		That was a very wise move	725083.0	Gain
14b71m2joe7yy4	14b71m2	joe7yy4	DreamcatcherEgg	2023-06-16 20:35:23			2.0		All you have to do is repeat this same winning...	6088.0	Gain

2.2 Cleaning and ticker construction

Normalize datetime to dates; drop bots/mods and [deleted]/[removed] rows.
Build raw_text from title + text, then clean_text by stripping URLs, collapsing whitespace, lowercasing.
Extract candidate tickers via cashtags ($TSLA) and uppercase tokens (TSLA, NVDA); keep tokens with ≥500 mentions; filter obvious non-stocks; require valid Yahoo price history.
Explode to one row per (WSB row, ticker): wsb_exploded.

2.3 Price alignment and feature table

Sample Yahoo Finance price rows (long format)

datetime	ticker	adj_close	close	high	low	open	volume
6/1/23	AAPL	177.93	180.09	180.12	176.93	177.70	68901800
6/2/23	AAPL	178.78	180.95	181.78	179.26	181.03	61996900
6/5/23	AAPL	177.43	179.58	184.95	178.04	182.63	121946500
6/6/23	AAPL	177.06	179.21	180.12	177.43	179.97	64848400
6/7/23	AAPL	175.69	177.82	181.21	177.32	178.44	61944600

From Yahoo prices: compute daily return, next_return, and big_move = 1[|next_return| ≥ 5%]; keep (datetime, ticker, close, volume, return, next_return, big_move).
Aggregate WSB to (datetime, ticker):
- Numeric: mention_count, score_sum, score_mean, post_fraction, unique_authors.
- Text: concatenate clean_text → doc_text.
Merge price + WSB numeric + text, add log_volume = log(1 + volume), drop rows with missing core features. Final modeling table: ~16k daily examples.

2.4 Key EDA findings (with plots)

Return distribution: fat-tailed around 0; 5% sits in the tail → reasonable “big move” cutoff.
WSB attention vs volatility: higher mention buckets have larger mean |next-day return| → attention aligns with higher volatility.
Text content: word cloud dominated by tickers, options slang, and event words → confirms domain relevance.
Ticker case studies (NVDA, TSLA): indexed price + high-attention marks + big-move bars show large moves often coincide with elevated WSB activity, motivating a probabilistic model.

2.5 Inputs to the model (summary)

Inputs: WSB numeric features, price/volume + lags/volatility, VADER sentiment, and doc_text (for TF-IDF).
Label: big_move ∈ {0,1} for ≥5% absolute next-day moves.

Daily returns distribution (all tickers)

Takeaway: Returns are centered near 0 with fat tails; the 5% VaR bands capture the big-move threshold used for labels, underscoring the class imbalance of rare ±5% days.

Average daily return by day of week

Takeaway: Modest positive drift early in the week and weaker performance on Thursdays; day-of-week is a minor effect compared to WSB attention spikes.

WSB attention vs. volatility

Takeaway: Higher mention buckets correlate with larger absolute next-day returns (volatility lift), motivating the inclusion of WSB volume features.

WSB title/phrase cloud

Takeaway: Dominant tokens are ticker symbols, options slang, and catalyst language (earnings, calls/puts), which the TF-IDF text features capture.

3. Modeling

Binary classification at the (ticker, day) level; the predicted probability is used as a heat score to rank tickers by risk of a large move.

3.1 Task formulation

Unit: one row = one (ticker, date).
Label: big_move = 1 if |next_return| ≥ 5% (next-day close-to-close); else 0.
Output: \u005Chat{p} = P(big_move = 1 | x), the heat score; higher = more likely ≥5% absolute move tomorrow.

3.2 Features

Price & volume (short-term dynamics): return, log_volume; lags ret_prev_1d/3d/5d; volatility vol_5d/vol_10d; volume anomaly vol_rel_5d.
WSB numeric signals: mention_count, score_sum, score_mean, post_fraction, unique_authors.
Sentiment: VADER sent_mean per ticker-day.
Textual signal: doc_text (all WSB text per ticker-day) → TF-IDF (1–2 grams, capped vocab).

TF-IDF (1–2g) Lagged returns & vol WSB mentions/authors Sentiment (VADER)

3.3 Baseline models (all logistic regression)

Price-only logistic — features ['return', 'log_volume'], scaled.
WSB-numeric-only logistic — features ['mention_count', 'score_sum'].
Price + WSB numeric logistic — ['return', 'log_volume', 'mention_count', 'score_sum'].
Text-only logistic — TF-IDF on doc_text (max_features 10k, ngram_range (1,2)).
Text + basic numeric logistic — TF-IDF + scaled [return, log_volume, mention_count, score_sum] (sparse + dense hstack).

Each numeric baseline uses:

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(
        penalty="l2",
        C=1.0,
        class_weight="balanced",
        max_iter=1000,
        solver="lbfgs",
    )),
])
pipe.fit(X_train, y_train)

3.4 Final model and tuning

🏁 Final model F1-tuned grid search ROC-based ranking check

Inputs: doc_text + all numeric columns (price/volume history, volatility, volume anomalies, WSB activity, sentiment).
Architecture: ColumnTransformer with TF-IDF branch + scaled numeric branch → LogisticRegression(max_iter=1000).
CV: train+val combined, chronological; TimeSeriesSplit(n_splits=5).
Grid search: TF-IDF max_features ∈ {5000,10000}, ngram_range ∈ {(1,1),(1,2)}, min_df=5, max_df=0.7; Logistic C ∈ {0.1,1.0,10.0}, class_weight ∈ {'balanced', None}. Maximize F1; freeze best pipeline and evaluate once on test.

3.5 Advantages and limitations

Pros: aligned with course content; fast on ~16k examples; produces probability heat scores for ranking; handles sparse TF-IDF + dense numeric.
Handling imbalance/dynamics: class weights; lagged returns/volatility/volume anomalies; attention signals (mentions/authors/sentiment).
Limits: linear boundary, no explicit text–price interactions, no temporal sequence model.

Next steps: explore boosted trees or sequence models to capture nonlinear text–price interactions and temporal patterns beyond single-day features.

4. Evaluation

Metrics, baselines, and diagnostic plots that justify the final model choice.

4.1 Protocol

Split: Time-based 70% train / 15% val / 15% test (no shuffling).
Outputs: Model returns P(big_move=1) = heat score.
Threshold: One decision threshold picked on validation F1, then fixed; test is used once.
Metrics (test): main = F1 (positives ≈ 10–12%); also precision, recall, ROC-AUC, accuracy; diagnostics via confusion matrix + ROC curves.

4.2 Baselines vs. final model

Trivial: always 0 (high accuracy, F1≈0); always 1 (recall=1, terrible precision/accuracy).
Course-style logistics: price-only (return, log_volume); WSB numeric-only (mention_count, score_sum); price + WSB numeric (small lift); TF-IDF text-only (big recall/ROC gain); TF-IDF text + basic numeric (best untuned).
Final model: TF-IDF + all numeric (price lags, volatility, volume anomaly, WSB numeric, sentiment), tuned by GridSearchCV (TimeSeriesSplit, scoring=F1). Best F1 and ROC-AUC on test with similar accuracy to simpler baselines.

All results summarized in a test metrics table with columns model, accuracy, auc, recall, f1, precision, sorted by F1.

4.3 Plots & main conclusions

Confusion matrix (final model): built from y_test_grid vs. (y_prob_test_best ≥ τ) (F1-tuned τ); shows more true positives than false positives while keeping many true negatives.
ROC: price-only vs. final: final curve dominates price-only over most FPR, confirming better ranking of “hot” ticker-days.

Bottom line: Adding WSB text, sentiment, and richer price features clearly improves F1 and ROC-AUC over numeric-only baselines; the tuned text+numeric logistic is the right choice for the final heat score.

Accuracy / metrics table (test set)

Takeaway: Final text+numeric model leads on F1/ROC while keeping accuracy similar to simpler baselines; text-only already provides a big lift over numeric-only features.

Confusion matrix — tuned final model (test)

Confusion matrix for the final model on the test set

Takeaway: More true positives than false positives while keeping many true negatives, consistent with a threshold tuned for F1 on an imbalanced label.

ROC: price-only vs. final model

ROC curve comparing price-only logistic to the tuned final model

Takeaway: The tuned text+numeric model dominates the price-only baseline across operating points, showing better ranking of hot tickers.

F1-driven model selection ROC / ranking checks Test set used once

Predicting next-day “heat” for stocks from r/WallStreetBets

1. Predictive Task & Evaluation Plan

1.1 Predictive task

1.2 Evaluation setup

1.3 Baselines

1.4 Validity / sanity checks

2. Exploratory Analysis, Data Collection & Pre-processing

2.1 Data sources (context)

Example raw WSB rows (truncated)

2.2 Cleaning and ticker construction

2.3 Price alignment and feature table

Sample Yahoo Finance price rows (long format)

2.4 Key EDA findings (with plots)

2.5 Inputs to the model (summary)

3. Modeling

3.1 Task formulation

3.2 Features

3.3 Baseline models (all logistic regression)

3.4 Final model and tuning

3.5 Advantages and limitations

4. Evaluation

4.1 Protocol

4.2 Baselines vs. final model

4.3 Plots & main conclusions

5. Related Work & Discussion

WallStreetBets sentiment and markets

Descriptive WSB EDA

Single-stock information transfer (GME)

How we differ