Cursing in English on Twitter

From Knoesis wiki
Revision as of 05:29, 26 February 2014 by Wenbo (Talk | contribs) (Cursing Lexicon Coding)

Jump to: navigation, search

Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth

Cursing is not uncommon during conversations in the physical world. On social media, people can instantly chat with friends without face-to-face interaction, usually in a more public fashion and broadly disseminated through highly connected social network. Will these distinctive features of social media lead to a change in people’s cursing behavior? In this paper, we examine the characteristics of cursing activity on a popular social media platform – Twitter, involving the analysis of about 51 million tweets and about 14 million users. In particular, we explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including the ubiquity, utility, and contextual dependencies of cursing.

Introduction

Do you curse? Do you curse on social media? How often do you see people cursing on social media (e.g., Twitter)? Cursing, also called swearing, profanity, or bad language, is the use of certain words and phrases that are considered by some to be rude, impolite, offensive, obscene, or insulting <ref> "Profanity - Wikipedia, the free encyclopedia", Wikipedia, March 2013</ref>. In this paper, we use cursing, profanity and swearing interchangeably. As Jay <ref name="The utility and ubiquity of taboo words">Jay, T. The utility and ubiquity of taboo words. Perspectives on Psychological Science 4, 2 (2009), 153–161.</ref> pointed out, cursing is a “rich emotional, psychological and sociocultural phenomenon”, which has attracted many researchers from related fields such as psychology, sociology, and linguistics <ref>Jay, T. Do offensive words harm people? Psychology, public policy, and law 15, 2 (2009), 81.</ref> <ref>Jay, T., and Janschewitz, K. The pragmatics of swearing. Journal of Politeness Research. Language, Behaviour, Culture 4, 2 (2008), 267–288.</ref>.

Over the last decade, social media has become an integral part of our daily lives. According to the 2012 Pew Internet & American Life Project report <ref> "Pew Internet: Social Networking (full detail)", PewResearch Internet Project, February 2013</ref>, 69% of online adults use social media sites and the number is steadily increasing. Another Pew study in 2011 <ref> "How American teens navigate the new world of “digital citizenship”", PewResearch Internet Project, November 2011.</ref> shows that 95% of all teens with ages 12-17 are now online and 80% of those online teens are users of social media sites. People post on these sites to share their daily activities, happenings, thoughts and feelings with their contacts, and keep up with close social ties, which makes social media both a valuable data source and a great target for various areas of research and practice, including the study of cursing. While the CSCW community has made great efforts to study various aspects (e.g., credibility <ref>Morris, M. R., Counts, S., Roseway, A., Hoff, A., and Schwarz, J. Tweeting is believing?: understanding microblog credibility perceptions. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, ACM (2012), 441–450.</ref>, privacy <ref>Almuhimedi, H., Wilson, S., Liu, B., Sadeh, N., and Acquisti, A. Tweets are forever: a large-scale quantitative analysis of deleted tweets. In Proceedings of the 2013 conference on Computer supported cooperative work, ACM (2013), 897–908.</ref>) of social networking and social media, our understanding of cursing on social media still remains very limited.

The communication on social media has its own characteristics which differentiates it from offline interaction in the physical world. Let us take Twitter for example. The messages posted on Twitter (i.e., tweets) are usually public and can spread rapidly and widely through the highly connected user network, while the offline conversations usually remain private among the persons involved. In addition, we may find that more of our actual exchange of words in the physical world happens through face-to-face oral communication, while on Twitter we mostly communicate by writing/typing without seeing each other. Will such differences lead to a change in people’s cursing behavior? Will the existing theories on swearing during the offline communication in physical world still be supported if tested on social media?

To address such differences, this paper examines the use of English curse words on the micro-blogging platform Twitter. We collected a random sampling of all public tweets and the data of relevant user accounts every day for four weeks. We first identified English cursing tweets in the collection, and extracted numerous attributes that characterize users and users’ tweeting behaviors. We then evaluated the effect of these attributes with respect to the cursing behaviors on Twitter. This exploratory study aims to improve our understanding of cursing on social media by exploring a set of questions that have been identified as crucial in previous cursing research on offline communication. The answers to these questions may also have valuable implications for the studies of language acquisition, emotion, mental health, verbal abuse, harassment, and gender difference <ref name="The utility and ubiquity of taboo words"/>.

Specifically, we examine four research questions:

  • Q1 (Ubiquity): How often do people use curse words on Twitter? What are the most frequently used curse words?
  • Q2 (Utility): Why do people use curse words on Twitter? Previous studies <ref name="The utility and ubiquity of taboo words"/> found that the main purpose of cursing is to express emotions. Do people curse to express emotions on Twitter? What are the emotions that people express using curse words?
  • Q3 (Contextual Variables): Does the use of curse words depend on various contextual variables such as time (when to curse), location (where to curse), or communication type (how to curse)?
  • Q4: Who says curse words to whom on Twitter? Previous research <ref>Jay, T. Why we curse: A neuro-psycho-social theory of speech. John Benjamins Publishing, 2000.</ref> <ref>Kamvar, S. D., and Harris, J. We feel fine and searching the emotional web. In Proceedings of the fourth ACM international conference on Web search and data mining, ACM (2011), 117–126.</ref> suggested that gender and social rank of people play important roles in cursing; do they also affect people using or hearing curse words on Twitter?

Method and Analysis

Data Collection

Twitter provides a small random sample of all public tweets via its sample API in real time <ref>https://dev.twitter.com/docs/api/1.1/get/statuses/sample</ref>. Using this API, we continuously collected tweets for four weeks from March 11th 2013 to April 7th 2013. We kept only the users who specified ‘en’ as their language in profiles. Further, we utilized Google Chrome Browser’s embedded language detection library to remove non-English tweets <ref>https://pypi.python.org/pypi/chromium_compact_language_detector/0.2</ref>. In total, we gathered about 51M tweets from 14M distinct user accounts.

Cursing Lexicon Coding

We asked two college students who are native English speakers to independently annotate potential curse words that were collected from Internet. In the end, we kept only 788 words that are considered to be curse words in most cases by two annotators. Besides correctly spelled words, (e.g., fuck, ass), the lexicon also included different variations of curse words, e.g., a55, @$$, $h1t, b!tch, bi+ch, c0ck, f*ck, l3itch, p*ssy, and dik.

We call a tweet cursing tweet if it contains at least one curse word. Twitter users may use different variations of the same word, so we first simply compare words in a tweet against all the curse words in the lexicon. If there is no match, we remove repeating letters in the words (e.g., fuckk → fuck) of a tweet and repeat the matching process. We also convert digits or symbols in a word to their original letters: e.g., 0 → o, 9 → g, ! → i. Moreover, based on our observations, the following symbols, ' ', '%', '-', '.', '#', '\', '’', are frequently used to mask curse words: f ck, f%ck, f.ck, f#ck, f’ck → fuck. We apply the edit distance approach similar to <ref>Sood, S., Antin, J., and Churchill, E. Profanity use in online communities. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, ACM (2012), 1481–1490.</ref> to spot curse words with mask symbols. Namely, if the edit distance between a candidate word (f ck) and a curse word (fuck) equals the number of mask symbols (1 in this case) in the candidate word, then it is a match. Table 1 provides an overview of the per-user count of the number of overall tweets and cursing tweets in our data collection.

To evaluate the accuracy of this lexicon-based method to spot cursing tweets, we drew a random sample of 1000 tweets, and asked two annotators to manually label them as cursing or non-cursing independently. Finally, there were 118 tweets labeled as cursing tweets for which both annotators agreed on their labels, and the other 882 tweets were labeled as non-cursing ones. We then tested the lexicon-based spotting approach on this labeled dataset, and the results showed that this lexicon-based method achieved a precision of 98.84%, a recall of 72.03% and F1 score of 83.33%. As expected, this lexicon-based approach for profanity detection provides high precision but lower recall, which is mainly due to the variations in curse words (e.g., due to misspellings and abbreviations) and context sensitivity of cursing. Though we believe that, for this work, high-precision is preferred and recall of 72.03% is considered reasonable, more sophisticated classification methods that can further improve the recall remain an interesting topic for future work.

Cursing Frequency and Choice of Curse Words

Cursing vs. Emotion

Cursing vs. Time

Cursing vs. Message Type

Cursing vs. Location

Cursing vs. Gender

Limitations

Conclusion

Acknowledgments

References

<references/>