Abstract
This study examines how sentiment intensity and linguistic toxicity differ between X (formerly Twitter) and Bluesky across six topics — four politically polarizing (ICE enforcement, the Israel-Gaza conflict, tariffs, and the job market) and two apolitical controls (coffee and weather). Using approximately 19,000 posts collected via platform scrapers, we apply DistilBERT-based sentiment classification and Toxic-BERT toxicity scoring to each post. We find that platform-level toxicity differences are topic-dependent rather than universal: X users produce more toxic content on identity-charged topics (ICE, Israel), while Bluesky discourse around economic grievances (tariffs, job market) is comparably or more toxic. Algorithmically promoted ("top") feeds are consistently less toxic than chronological ("latest") feeds across both platforms. Statistical tests confirm these differences are significant, though effect sizes are generally small.
1 · Introduction
The migration of users from X to Bluesky following Twitter's 2022 acquisition raised public questions about whether platform design and community norms shape the tenor of political discourse. Bluesky's decentralized, invite-based early growth attracted a distinct user base — predominantly journalists, academics, and tech workers — potentially producing different discourse patterns than X's open, algorithmically driven feed.
This study asks: How do sentiment intensity and linguistic toxicity differ between X and Bluesky when users discuss polarizing topics?
We operationalize this through three measurable constructs:
- Sentiment polarity — whether posts are positive or negative
- Sentiment intensity — the strength and confidence of that sentiment
- Toxicity — model-estimated probability of harmful language, including subtypes (insult, threat, obscenity, identity attack)
Two apolitical topics (coffee, weather) serve as controls to distinguish genuine platform effects from content-driven effects.
2 · Data
2.1 · Collection
Posts were scraped from X and Bluesky for six topics using platform-specific scrapers. For each topic and platform, two feed types were collected: latest (chronological) and top (algorithmically promoted), yielding approximately 800 posts per cell. The total corpus spans approximately 19,000 posts across 24 collection cells (6 topics × 2 platforms × 2 feed types).
2.2 · Preprocessing
Topic-specific keyword filters removed off-topic noise — for example, ice cream and snow references from the ICE enforcement topic. URLs were stripped and whitespace normalized. Each post was tagged with source platform, feed type, and topic.
2.3 · Corpus composition
| Topic | Category | Posts retained |
|---|---|---|
| Israel-Gaza | Polarizing | 3,197 |
| Weather | Control | 3,197 |
| Job Market | Polarizing | 3,194 |
| Tariffs | Polarizing | 3,191 |
| Coffee | Control | 3,182 |
| ICE Enforcement | Polarizing | 2,973 |
3 · Methods
3.1 · Sentiment analysis
We use distilbert-base-uncased-finetuned-sst-2-english via HuggingFace Transformers. Each post receives a binary label (POSITIVE / NEGATIVE) and a confidence score in [0.5, 1.0]. We construct a signed sentiment score by assigning negative posts a negative value, mapping the full range to [−1, 1] — capturing both direction and intensity in a single measure.
3.2 · Toxicity analysis
We use Detoxify's original model (Toxic-BERT), which outputs six scores per post: overall toxicity, severe toxicity, obscenity, insult, threat, and identity attack. The overall toxicity score is the primary outcome variable.
3.3 · Statistical testing
Because toxicity distributions are heavily right-skewed and non-normal — with most posts scoring near zero and a long tail of highly toxic content — we use the Mann-Whitney U test for all platform comparisons rather than parametric alternatives. Effect size is reported as rank-biserial correlation r = 1 − (2U) / (n₁ · n₂), where |r| < 0.3 = small, 0.3–0.5 = medium, > 0.5 = large.
4 · Results
4.1 · Sentiment overview
Across all topics, political content is overwhelmingly negative. Tariffs had the lowest rate of positive posts (16.8%), followed by ICE enforcement (19.3%) and Israel-Gaza (21.9%). The two control topics were markedly more positive: coffee (41.0%) and weather (44.7%).
% Positive posts by topic
4.2 · Platform differences in sentiment
The largest sentiment gap between platforms appears on the Israel-Gaza topic: Bluesky users post positively only 15.4% of the time, compared to 28.4% on X — a 13-percentage-point difference. On coffee and weather (the controls), platforms are essentially identical (~41% and ~44–45% positive respectively), confirming the gap is content-driven rather than a platform artifact.
Signed sentiment intensity (mean confidence mapped to [−1, 1], where more negative = more intensely negative) reveals a consistent pattern: all six topics trend negative on both platforms, but political topics are far more intensely negative. The chart below shows mean negativity intensity — higher bars mean posts are more confidently negative.
Mean sentiment negativity intensity by topic and platform (higher = more intensely negative)
Signed sentiment intensity reveals that political posts are not only more negative but more confidently negative. Bluesky users express stronger negativity than X on Israel (−0.68 vs −0.42) and Job Market (−0.55 vs −0.42), while ICE and Tariffs are similarly intense on both platforms.
4.3 · Toxicity by platform and topic
The central finding of this study is that X is not uniformly more toxic than Bluesky. The direction of the difference depends on the topic:
Mean toxicity by topic and platform
Axis starts at 2% to highlight differences. Hover bars to compare values.
Control topics show near-equal toxicity across platforms, confirming that observed differences on political topics reflect real discourse differences rather than platform-level measurement artifacts.
4.4 · Statistical significance and effect sizes
Mann-Whitney U tests (non-parametric, appropriate for the heavily right-skewed toxicity distributions) were run for each topic comparing X and Bluesky. Five of six topics show a statistically significant toxicity difference. The single exception is Job Market (p = 0.207), where the platform gap in means does not hold at the rank level — suggesting the mean difference is driven by outliers rather than a systematic shift. ICE is the only topic with a medium effect size (|r| = 0.30), all others are small.
For signed sentiment intensity, three topics show significant platform differences: Tariffs, Job Market, and Israel — all in the direction of X being more positive (less intensely negative) than Bluesky, consistent with Bluesky's more activist user base on economic and geopolitical topics.
Rank-biserial r: toxicity platform effect by topic
* p < 0.05 · effect size: |r| <0.3 small, 0.3–0.5 medium
4.5 · Feed type: latest vs. top
Algorithmically promoted (top) feeds are consistently less toxic than chronological (latest) feeds on politically charged topics:
Mean toxicity: chronological (latest) vs algorithmically promoted (top)
Axis starts at 3% to highlight differences. Hover bars to compare values.
The reduction in toxicity between latest and top feeds occurs on political topics but not on control topics, suggesting that platform algorithms specifically dampen the most charged content in what they surface prominently — not merely that popular posts happen to be calmer.
4.6 · Toxicity subtypes
In the high-toxicity slice (posts above the 95th percentile, threshold ≈ 0.43), the dominant subtypes are obscenity (mean score 0.46) and insult (0.35), with threats (0.05) and severe toxicity (0.05) being comparatively rare. Within this slice for the ICE topic, X accounts for 116 of 150 posts — a 3.4:1 ratio relative to Bluesky's 34, despite the two platforms contributing roughly equal post counts overall.
5 · Discussion
5.1 · Platform effect is topic-conditional
The assumption that X is categorically more toxic than Bluesky is only partially supported. For topics involving immigration enforcement and geopolitical conflict — where discourse on X skews toward outrage and engagement-driven content — X is significantly more toxic. For economic topics like tariffs and job loss, Bluesky's discourse is equally or more charged. This may reflect the platform's early user base of tech workers and journalists who feel economic precarity acutely and express it with comparable intensity.
5.2 · What algorithmic curation does
The consistent reduction in toxicity between latest and top feeds is a noteworthy and somewhat counterintuitive finding. If confirmed across a larger sample, it suggests that recommendation algorithms — often blamed for amplifying divisive content — may in practice filter out the most harmful posts from prominent placement, even if they simultaneously boost engagement through other mechanisms.
5.3 · Limitations
Several limitations qualify these findings. The sentiment model (DistilBERT fine-tuned on SST-2, a movie review dataset) may misread political sarcasm, irony, or domain-specific terminology. The toxicity model (Toxic-BERT trained on Wikipedia talk pages) differs in register from social media text. Each collection cell was capped at ~800 posts, limiting statistical power for subgroup analyses. Finally, no demographic data is available; observed platform differences may partly reflect user composition rather than platform norms per se.
6 · Conclusion
Platform identity does not straightforwardly predict toxicity or sentiment intensity. X is more toxic on identity-charged political topics; Bluesky matches or exceeds it on economic ones. Algorithmic promotion consistently surfaces less toxic content than chronological feeds, across both platforms and most topics. Taken together, these findings suggest that future platform comparisons should be topic-stratified rather than treating platform as a global moderator of discourse quality — a platform may be simultaneously more civil than its rival on some topics and equally charged on others.
References
- [1]Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- [2]Hanu, L., & Unitary team. (2020). Detoxify. GitHub. github.com/unitaryai/detoxify
- [3]Socher, R., Perelygin, A., Wu, J., et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of EMNLP 2013.
- [4]Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60.
- [5]Kerr, G., & Kelleher, J. D. (2023). Comparing discourse across social media platforms: A review. Social Media + Society.
Key Metrics
Tools & Models
