HomeCapstoneSentiment Intensity and Linguistic Toxicity in Political Discourse
Data Science
Complete

Sentiment Intensity and Linguistic Toxicity in Political Discourse

A Cross-Platform Comparison of X and Bluesky

Do platform norms shape how toxic political discourse gets? 19,000 posts, 6 topics, two platforms — the answer is more complicated than expected.

April 2026Three MonthsSolo ProjectDATA 480 – Capstone · Nevada State University
Sentiment Intensity and Linguistic Toxicity in Political Discourse
NLPSentiment AnalysisToxicity DetectionSocial MediaBERT

Abstract

This study examines how sentiment intensity and linguistic toxicity differ between X (formerly Twitter) and Bluesky across six topics — four politically polarizing (ICE enforcement, the Israel-Gaza conflict, tariffs, and the job market) and two apolitical controls (coffee and weather). Using approximately 19,000 posts collected via platform scrapers, I applied DistilBERT-based sentiment classification and Toxic-BERT toxicity scoring to each post. Results show that platform-level toxicity differences are topic-dependent rather than universal: X users produce more toxic content on identity-charged topics (ICE, Israel), while Bluesky discourse around economic grievances (tariffs, job market) is comparably or more toxic. Algorithmically promoted ("top") feeds are consistently less toxic than chronological ("latest") feeds across both platforms. Mann-Whitney U tests confirm these differences are statistically significant, though effect sizes are generally small, with ICE enforcement being the sole exception at a medium effect (|r| = 0.30).

1 · Introduction

The migration of users from X to Bluesky following Twitter's 2022 acquisition raised public questions about whether platform design and community norms shape political discourse. Prior work has established X as a reliable source for measuring public sentiment on real-world events, from geopolitical crises (Das et al., 2025) to consumer perception of public services (Booranakittipinyo et al., 2024). How those patterns change when the same type of discourse moves to a structurally different platform, however, remains underexplored (Kerr & Kelleher, 2023). Bluesky's decentralized, invite-based early growth attracted a distinct user base: predominantly journalists, academics, and tech workers, potentially producing different discourse patterns than X's open, algorithmically driven feed.

This study asks: How do sentiment intensity and linguistic toxicity differ between X and Bluesky when users discuss polarizing topics? I look to answer this question through three measurable constructs:

  • Sentiment polarity - whether posts are positive or negative
  • Sentiment intensity - the strength and confidence of that sentiment
  • Toxicity - model-estimated probability of harmful language, including subtypes (insult, threat, obscenity, identity attack).

Two apolitical topics (coffee, weather) serve as controls to distinguish genuine platform effects from content-driven effects. The central hypothesis is that platform design shapes discourse. The central finding, however, is that the answer is more conditional than the hypothesis implies.

2 · Data

2.1 · Collection

The original plan was to use official APIs from both platforms. However, X's developer API changed from a flat monthly rate to a usage-based model with strict daily pull limits, making this approach impractical without significant cost or time.

So rather than struggling with the official APIs, I turned to Apify, a third-party web scraping platform. It provides scrapers for both X and Bluesky with keyword filtering, feed type selection (top vs. latest), and structured JSON output.

I was able to skip the daily collection limits and building the dataset took hours instead of weeks of incremental pulls (queuing topics, monitoring output, and iterating on keyword filters to reduce off-topic noise). The final corpus spans approximately 19,000 posts across 24 collections (6 topics × 2 platforms × 2 feed types), with roughly 800 posts per cell.

Noteworthy

Two feed types were collected per topic per platform: latest (chronological, unfiltered) and top (algorithmically promoted). This distinction became one of the study's more surprising findings and is discussed separately in Section 4.5.

2.2 · Preprocessing

The raw JSON output from Apify contained far more than post text: full user objects, engagement counts, embedded media metadata, reply chains, and retweet wrappers. The first pass stripped everything down to: post text, platform, feed type, topic, and a timestamp. Retweets (posts beginning with "RT @") were dropped as they could have inflated counts for whichever accounts happened to go viral that week. Very short posts (under five tokens after cleaning) were also removed, as they might not carry enough information for the sentiment or toxicity models.

There was also a need to apply keyword filters. The ICE enforcement query returned a large volume of off-topic content like "ice sculpture," "ice cream," "ice cold," hockey game recaps, and weather reports about ice storms. These were filtered by requiring additional terms like "immigration," "deportation," "border," or "enforcement." The job market topic had a similar problem where actual job listings ("now hiring," "apply today") and LinkedIn-style promotional posts were included. These were filtered out by removing posts with application URLs and common recruiter phrasing. Tariffs and Israel-Gaza were cleaner where off-topic noise was low. Coffee and weather were intentionally broad as the goal was to capture casual, low-stakes conversation.

After filtering, URLs were stripped, whitespace normalized, and each post tagged with its source platform, feed type, and topic.

2.3 · Corpus composition

TopicCategoryPosts retained
Israel-GazaPolarizing3,197
WeatherControl3,197
Job MarketPolarizing3,194
TariffsPolarizing3,191
CoffeeControl3,182
ICE EnforcementPolarizing2,973

3 · Methods

3.1 · Sentiment analysis

The two most common alternatives for social media sentiment, VADER and TextBlob, were both considered and rejected. Prior research has applied both tools to social media toxicity tasks (Singgalen, 2024), demonstrating their utility in controlled settings but also their sensitivity to domain mismatch. I chose DistilBERT (Sanh et al., 2019) fine-tuned on the Stanford Sentiment Treebank (Socher et al., 2013) because the softmax output gives a per-class probability and a confidence score that neither VADER nor TextBlob had. This confidence gives us the signed intensity measure: a post classified NEGATIVE with 0.97 confidence is treated differently from one classified NEGATIVE with 0.52 confidence, and that distinction is central to the platform comparison in Section 4.1. What all three models do not do, and what no off-the-shelf model reliably detects, is sarcasm. This is a limitation discussed in Section 5.3.

I also conducted a test to see if the models could detect sarcasm. Each post receives a binary label (POSITIVE / NEGATIVE) and a confidence score in [0.5, 1.0]. I constructed a signed sentiment score by assigning negative posts a negative value, mapping the full range to [−1, 1], capturing both direction and intensity in a single measure.

Sentiment model selection

Why rejected:Relies on a fixed lexicon and hand-tuned heuristics. Handles capitalization and punctuation well but fails on political sarcasm and domain-specific language.
Type
Rule-based (lexicon + heuristics)
Confidence Score
No
Handles Negation
Partial
Handles Sarcasm
Poor
Training Domain
Hand-tuned social media lexicon
Domain Gap
Medium

Test examples

3/5 correct
deportation flights are great, actually
ExpectedNEGATIVE→ gotPOSITIVE+0.625✗ Wrong

Scores "great" as strongly positive. The sarcastic "actually" has no lexicon entry and is ignored entirely.

3.2 · Toxicity analysis

The main alternative considered was the Perspective API from Google. Beyond the problems of using an authenticated API call per post which is infeasible for 19,000 posts in an offline pipeline, Google has announced that Perspective API is being sunset after 2026. I also looked at HateSonar, but it only outputs a single binary hate/not-hate label with no subtype breakdown.

Detoxify's Toxic-BERT (Hanu & Unitary team, 2020) runs fully locally and was trained on the Jigsaw Toxic Comment Classification which is a large, well-documented dataset of online comments covering a wide range of harmful language styles. It outputs six distinct scores per post rather than a single toxicity flag: overall toxicity, severe toxicity, obscenity, insult, threat, and identity attack. That subtype ended up being useful for the high-toxicity slice analysis in Section 4.6 shows that most extreme content is driven by obscenity and insults rather than threats, which would have been invisible with a single-score model.

Toxicity model selection

Why rejected:Lexicon-based matching means obfuscated text ("f***ing") evades detection entirely, and implicit toxicity with no explicit slur is also missed. Even when it fires correctly, every result is the same binary flag — no severity, no subtype, no continuous score for quantitative comparison.
Type
Binary classifier (SVM)
Output
Binary (hate / not-hate)
Runs Locally
Yes
Open Source
Yes
Subtype Breakdown
No
Training Data
Stormfront forums + Twitter

Test examples

3/5 correct
send them all back, I don't care what happens to them
ExpectedTOXIC→ gotCLEAN0.62✗ Wrong

No explicit slurs. Implicit dehumanization with no pattern match — classified as neither hate nor offensive language.

3.3 · Statistical testing

Three tests were considered for platform comparisons: Student's t-test, Welch's t-test, and Mann-Whitney U (Mann & Whitney, 1947). The t-test has a practical problem with this data. The toxicity distributions are heavily right-skewed where over 90% of posts score below 0.05, with a long tail reaching near 1.0. The t-test operates on means, which are pulled upward by that tail. A single viral high-toxicity post can shift the platform mean enough to affect a p-value, without any real difference in how the typical post on each platform behaves. Mann-Whitney U avoids this entirely by ranking all observations so an outlier at 0.99 receives one rank instead of 100× the weight of a post at 0.01.

The effect size is reported as rank-biserial r = 1 − (2U) / (n₁ · n₂). A positive r means X posts tend to rank higher in toxicity than Bluesky posts; negative means the reverse. The magnitude scale is: |r| < 0.1 = negligible, 0.1–0.3 = small, 0.3–0.5 = medium, > 0.5 = large. The rank-biserial r is bounded to [−1, 1] and has a direct probabilistic interpretation where r = 0.3 meaning a randomly drawn X post ranks above a randomly drawn Bluesky post 65% of the time.

Why the mean misleads — live distribution visualizer

Mean0.0708
Median0.0183
n = 400
0.000.250.500.751.00toxicity score →
0 / 50
baseline (n = 400)+50 high-toxicity posts
Insight: Drag the slider to inject viral high-toxicity posts. Watch how differently the mean and median respond.

4 · Results

4.1 · Sentiment overview

Across all topics, political content is overwhelmingly negative. Tariffs had the lowest rate of positive posts (16.8%), followed by ICE enforcement (19.3%) and Israel-Gaza (21.9%). The two control topics were markedly more positive: coffee (41.0%) and weather (44.7%).

% Positive posts by topic

Weather
44.7%
Coffee
41%
Job Market
25.3%
Israel-Gaza
21.9%
ICE Enforcement
19.3%
Tariffs
16.8%

4.2 · Platform differences in sentiment

The largest sentiment gap between platforms appears on the Israel-Gaza topic where Bluesky users post positively only 15.4% of the time, compared to 28.4% on X (13% difference). On coffee and weather (the controls), platforms are essentially identical (≈41% and ≈44–45% positive respectively), confirming the gap is content-driven rather than a platform artifact.

Signed sentiment intensity (mean confidence mapped to [−1, 1]), where more negative values indicate more intensely negative posts reveals a consistent pattern where all six topics trend negative on both platforms, but political topics are far more intensely negative. The table below reports mean negativity intensity as a percentage for each topic and platform.

Mean sentiment negativity intensity by topic and platform (higher = more intensely negative)

XBluesky
11%9%
Weather
17%18%
Coffee
42%55%
Job Mkt
42%68%
Israel
58%63%
ICE
62%69%
Tariffs
Key Finding

Signed sentiment intensity reveals that political posts are not only more negative but more confidently negative. Bluesky users express stronger negativity than X on Israel (−0.68 vs. −0.42) and Job Market (−0.55 vs. −0.42), while ICE and Tariffs are similarly intense on both platforms.

4.3 · Toxicity by platform and topic

The central finding of this study is that X is not uniformly more toxic than Bluesky. The direction of the difference depends on the topic. The table below shows mean toxicity scores disaggregated by platform and topic.

Mean toxicity by topic and platform

XBluesky
9%3%
ICE
14%5%
Israel
9%14%
Tariffs
7%11%
Job Mkt
7%8%
Coffee
5%4%
Weather

Axis starts at 2% to highlight differences. Hover bars to compare values.

Control topics show near-equal toxicity across platforms, confirming that observed differences on political topics reflect real discourse differences rather than platform-level measurement artifacts.

4.4 · Statistical significance and effect sizes

Mann-Whitney U tests were run for each topic comparing X and Bluesky. Five of six topics show a statistically significant toxicity difference. The single exception is Job Market (p = 0.207), where the platform gap in means does not hold at the rank level suggesting the mean difference is driven by outliers rather than a systematic shift. ICE is the only topic with a medium effect size (|r| = 0.30); all others are small.

For signed sentiment intensity, three topics show significant platform differences where Tariffs, Job Market, and Israel all in the direction of X being more positive (less intensely negative) than Bluesky, consistent with Bluesky's more activist user base on economic and geopolitical topics.

Rank-biserial r: toxicity platform effect by topic

ICE
-0.301*
Israel
-0.193*
Weather
-0.103*
Coffee
-0.059*
Job Market
-0.026
Tariffs
+0.09*
← X more toxicBluesky more toxic →

* p < 0.05  ·  effect size: |r| <0.3 small, 0.3–0.5 medium

4.5 · Feed type: latest vs. top

Algorithmically promoted (top) feeds are consistently less toxic than chronological (latest) feeds on politically charged topics. The table below shows this pattern holds for ICE, Israel-Gaza, and Tariffs, but not for the apolitical coffee control.

Mean toxicity: chronological (latest) vs algorithmically promoted (top)

LatestTop
7%4%
ICE
12%7%
Israel
12%10%
Tariffs
7%7%
Coffee

Axis starts at 3% to highlight differences. Hover bars to compare values.

Noteworthy

The reduction in toxicity between latest and top feeds occurs on political topics but not on control topics, suggesting that platform algorithms specifically dampen the most charged content in what they surface prominently, not merely that popular posts happen to be calmer.

4.6 · Toxicity subtypes

In the high-toxicity slice (posts above the 95th percentile, threshold ≈ 0.43), the dominant subtypes are obscenity (mean score 0.46) and insult (0.35), with threats (0.05) and severe toxicity (0.05) being comparatively rare. Within this slice for the ICE topic, X accounts for 116 of 150 posts (3.4:1 ratio relative to Bluesky's 34) despite the two platforms contributing roughly equal post counts overall.

This shows that extreme content in political discourse is primarily driven by crude language and personal insults rather than explicit threats. Identity attacks are concentrated on immigration and geopolitical topics (ICE, Israel) which is consistent with the toxicity gap between platforms on those subjects.

5 · Discussion

5.1 · Platform effect is topic-conditional

The assumption that X is more toxic than Bluesky is only partially supported. For topics involving immigration enforcement and geopolitical conflict where discourse on X skews toward outrage and engagement-driven content where X is significantly more toxic. For economic topics like tariffs and job loss, Bluesky's discourse is equally or more charged. This may reflect the platform's early user base of tech workers and journalists who feel economicly inclined and express it with comparable intensity.

The finding that Bluesky users express more intensely negative sentiment on Israel, Job Market, and Tariffs, despite lower toxicity scores, suggests that negativity and toxicity can be dissociable constructs. This means that it is possible to hold and express strong negative opinions without producing content that scores high on toxic language measures.

5.2 · What algorithmic curation does

The consistent reduction in toxicity between latest and top feeds (Section 4.5) is noteworthy and somewhat counterintuitive. If confirmed across a larger sample, it suggests that recommendation algorithms, often blamed for amplifying divisive content, may in practice filter out the most harmful posts from prominent placement, even if they also boost engagement through other mechanisms. The fact that this dampening effect does not appear on control topics (coffee) is particularly informative as it implies the algorithm is not merely selecting for popularity since popular coffee posts are no less toxic than unpopular ones.

5.3 · Limitations

Several limitations qualify these findings.

  • Model domain gap. The sentiment model (DistilBERT fine-tuned on SST-2, a movie review dataset) may misread political sarcasm, irony, or domain-specific terminology. The sarcasm failure demonstrated in Table 3, where all three models classify "deportation flights are great, actually" as strongly POSITIVE, illustrates this point.
  • Toxicity training data. Toxic-BERT was trained on Wikipedia talk page comments and Jigsaw competition data, which differs in register from social media text. It may under-flag implicit toxicity common in political social media (veiled dehumanization, dog whistles) while correctly flagging explicit profanity.
  • Sample size per cell. Each collection cell was capped at ≈ 800 posts, limiting statistical power for subgroup analyses (e.g., feed type within platform within topic).
  • No demographic data. Observed platform differences may partly reflect user composition rather than platform norms per se. Bluesky's early user base was not representative of the general population; the platform has diversified since its public launch but demographic data is not available at the post level.
  • Temporal snapshot. Data was collected during a specific window in early 2026. Discourse around ICE enforcement and Israel-Gaza may be particularly time-sensitive; findings may not generalize to other periods.

5.4 · LLM-based sarcasm audit

Section 5.3 identifies sarcasm detection as a known failure mode of DistilBERT. To quantify how often this failure actually produces a mislabeled post in practice, I conducted a targeted validation using a local large language model: llama3.2, run via Ollama on the same machine as the inference pipeline, requiring no external API calls and producing fully reproducible results.

Procedure

A random sample of 250 posts was drawn from the ICE enforcement topic, the topic most likely to contain political sarcasm using a fixed random seed (42) for reproducibility. The original label distribution in the sample reflected the broader ICE corpus: 212 NEGATIVE (84.8%) and 38 POSITIVE (15.2%).

Each post was sent to llama3.2 with a JSON response containing five fields: a boolean sarcasm flag, a sarcasm confidence score (0–1), the model's assessed true sentiment (POSITIVE / NEGATIVE / NEUTRAL), an estimated toxicity score (0–1), and a one-sentence reasoning explanation. The model was given no information about the original pipeline's label ensuring independence.

Results

Table 11 summarizes the key outcomes.

MetricCount (%)
Posts with valid LLM response244 / 250 (97.6%)
Sarcasm detected by llama3.244 / 244 (18.0%)
Sentiment label disagreements (flips)42 / 244 (17.2%)
Clear false flags (sarcastic + flip)5 / 244 (2.0%)
Sarcastic but label agrees39 / 244 (16.0%)

The 5 clear false flags were posts where llama3.2 detected sarcasm and disagreed with the original sentiment label representing the most actionable errors. All 5 were originally labeled POSITIVE by DistilBERT and re-labeled NEGATIVE by llama3.2. This is consistent with the sarcasm failure pattern where a post uses positive language to express something negative and a lexicon-free classifier falsly flags it as positive.

Table 12 shows the three most instructive examples from that set.

Post textDistilBERTllama3.2LLM reasoning
@WMCActionNews5 Looks like a good place for an ICE raid.POSNEG (conf 0.85)Making a negative comment about an ICE raid
ICE currently incarcerates about 70,000 people on any given night, holding them across 224 detention facilities. The number has nearly doubled over the past year…POSNEG (conf 0.85)Criticizing Trump's deportation agenda with a rhetorical question
This is the Metropolitan Detention Center where ICE takes immigrants to be processed for deportation, and I'm absolutely here for it. If ICE wants to act like Russia, they can get treated like Russia.POSNEG (conf 0.00)Using a threat to express disapproval of ICE's actions

One of the five false flags also illustrates the keyword-filter noise problem described in Section 2.2: the post "every time I go on the floor for a concert it's always so wild knowing that I'm standing on ice" was retained by the ICE keyword filter despite referring to a frozen venue floor rather than the agency. The original pipeline labeled it POSITIVE (correctly), but llama3.2 flagged it as sarcastic NEGATIVE. This is also a limitation with the LLM, as even it can be wrong when it comes to out-of-domain content.

Interpretation

A 2% false-flag rate may seem small, but its impact is uneven. In this study, every confirmed false flag appeared in the POSITIVE class, even though POSITIVE posts make up only 19.3% of ICE content. In other words, sarcasm errors mainly inflate positivity estimates. Because POSITIVE posts are relatively rare, even a small number of mistakes can noticeably affect platform-level comparisons. Applied to the full 2,973-post ICE dataset, a 2% rate suggests about 60 likely false positives.

The larger 17.2% disagreement rate between DistilBERT and llama3.2 (42 flips) shows that uncertainty is not only about sarcasm. Just 5 of the 42 flips were tied to detected sarcasm; the other 37 likely come from domain mismatch, ambiguous wording, and the difficulty of classifying political language with a model trained on movie reviews. llama3.2 is not a gold standard and can also make mistakes, but the overall agreement rate (about 83%) is high enough to treat disagreements as meaningful review targets rather than random noise.

6 · Conclusion

Platform identity does not straightforwardly predict toxicity or sentiment intensity. X is more toxic on identity-charged political topics; Bluesky matches or exceeds it on economic ones. Algorithmic promotion consistently surfaces less toxic content than chronological feeds, across both platforms and most topics. Taken together, these findings suggest that future platform comparisons should be topic-stratified rather than treating platform as a global moderator of discourse quality. This means that a platform may be simultaneously more civil than its rival on some topics and equally charged on others.

This study also demonstrates that sentiment intensity and linguistic toxicity are distinct dimensions of political discourse, and that measuring only one produces an incomplete picture. If possible, future work should address the sarcasm detection problem, likely requiring domain-adapted fine-tuning or LLM-in-the-loop re-evaluation rather than off-the-shelf classifiers. It should also extend the dataset across longer time horizons to account for event-driven volatility in political discourse.

References

  1. [1]Booranakittipinyo, A., Li, R. Y. M., & Phakdeephirot, N. (2024). Travelers' perception of smart airport facilities: An X (Twitter) sentiment analysis. Journal of Air Transport Management, 118, 102600.
  2. [2]Das, S., Mondal, S., Majerova, J., Vartiak, L., & Vrana, V. G. (2025). Tweet sentiments: Understanding X (Twitter) users' perceptions of the Russia–Ukrainian crisis on consumer behavior and the economy. International Journal of Consumer Studies, 49(1), 1–23.
  3. [3]Hanu, L., & Unitary team. (2020). Detoxify. GitHub. github.com/unitaryai/detoxify
  4. [4]Kerr, G., & Kelleher, J. D. (2023). Comparing discourse across social media platforms: A review. Social Media + Society, 9(2).
  5. [5]Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60.
  6. [6]Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  7. [7]Singgalen, Y. A. (2024). Sentiment and toxicity analysis of digital content using Perspective, VADER, and TextBlob: Tourism and birdwatching. KLIK: Kajian Ilmiah Informatika dan Komputer, 5(1), 142–153.
  8. [8]Socher, R., Perelygin, A., Wu, J., et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of EMNLP 2013.

Key Metrics

~19KPosts collected
6Topics
2Platforms
24Collection cells
Max tox. gap
M-W UStat. test

Tools & Models

PythonDistilBERTToxic-BERTPandasSciPyHuggingFace
All Capstones