Predicting Introversion/Extroversion Based on Your Writing?

6 min readApr 10, 2021

Introversion/extroversion/ambiversion is a non-linear, multi-faceted spectrum, which many people overlook — one can achieve energy and be drained by the same activity, one’s behavior may change drastically dependent on certain situations, etc.
That being said, the concept of “energy” itself is fascinating — what exactly is one’s energy, and how does that affect others when you interact?

I am aiming to better understand possible differences in online writing style between those self-identified with introversion and extroversion tendencies, writing conversationally about themselves. For the sake of simplicity, I will refer to the people in the dataset as “introverts” and extroverts”, though it’s important to emphasize that I don’t believe in binary labelling, especially for this particular situation with many nuances.

By creating a model using NLP techniques that can accurately predict the level of extroversion/introversion based on text data, the following goals can be achieved:

Create foundation for more comprehensive friend-finding algorithm

Capitalize on marketing demographics and potentially A/B test different strategies for various products/services across industries

The dataset:

We used a dataset from Personality Cafe.net, which contains almost 9000 entries of the following datapoints.

We used NLP techniques in order to filter out certain meaningless words and punctuation in the data.

personality_type = self reported MBTI personality type, ie “ENTJ”

posts = raw text of all of the posts a given user has made on this website

We also explored some OkCupid profile data.

First, we remove all of the personality information that we don’t currently care for. Here we filtered for whether a given personality type [ie ENTJ] contained E or I.

We set the default class 0 equal to introversion tendencies because the classes are imbalanced — significantly more introverts than extroverts.

#define a function to appropriately label extrovert/introvert tendenciesdef label(energy_in):energy_out = 0if energy_in == 'I':energy_out = 0elif energy_in == 'E':energy_out = 1return energy_outdf['energyquant'] =  df['energy'].apply(lambda  x: label(x))df.head()

We can see in the confusion matrix that at first, without correcting for our class imbalance, our Naive Bayes model predicts no positives [or extroverts].

nb_matrix = confusion_matrix(y_test, nb_test_preds)group_names = ['True Neg','False Pos','False Neg','True Pos']group_counts = ["{0:0.0f}".format(value) for value innb_matrix.flatten()]group_percentages = ["{0:.2%}".format(value) for value innb_matrix.flatten()/np.sum(nb_matrix)]labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 inzip(group_names,group_counts,group_percentages)]labels = np.asarray(labels).reshape(2,2)sns.heatmap(nb_matrix, annot=labels, fmt='', cmap='Blues')

We then ran SMOTE to balance out the classes:

We can now see that the new confusion matrix reflects a new Naive Bayes model that is actually predicting some data points as extroverts, under the false positives and true positives. This is already a great improvement.

Deciding what metrics to optimize for:

Let’s optimize for F1 score.

Our use case is trying to figure out how to match people as part of a friendship algorithm, so it doesn’t make a difference whether we mislabel an introvert, or mislabel an extrovert more often.

The F1 score is simply a combination of precision and recall, the two main methods of measuring mislabelling errors, so it is our best option here.

To quote the book “Quiet: The Power of Introverts in a World That Can’t Stop Talking” by Susan Cain:

“Introversion is different than being shy (fear of social judgment), it is more about how one responds to stimulation, including social stimulation.
Introverts feel more content with less outside stimulation (for example, talking with a close friend or reading a book) compared to extroverts who enjoy more outside stimulation (for example, going to parties and listening to loud music).”

We choose to not make any dangerous assumptions of correlative behavior [ie some may believe extroverts are more adaptable], as many of these have not been significantly proven.

We also choose to disregard accuracy for the most part — since it does not account for situations with class imbalances and would be misleading.

According to DeepAI.org:

“The F1-score is useful where there is a large class imbalance, such as if 10% of apples on trees tend to be unripe.
In this case the accuracy would be misleading, since a classifier that classifies all apples as ripe would automatically get 90% accuracy but would be useless for real-life applications.“

We then ran a number of adjustments to our Naive Bayes model, including adjusting our train-test-split size to .33 from the default .25, and also experimented with a Random Forest model, tuning features such as its number of samples and max depth.

We can see our final results from our Naive Bayes model broken down here, keeping in mind 0 is the introvert class:

Some interesting findings in language differences:

We separated the groups into extroverts and introverts: When comparing all top nouns, all top adverbs, all top adjectives and all top words for each group of introverts vs extroverts, we found that the most frequently used words were quite similar between the groups, with slight variations in order.

However, when the introverts and extroverts are combined, the list of top words is significantly different than either of the 2 groups’ separate lists — for all top words, top nouns, etc.

Since all top words are similar when comparing introverts vs extroverts, we do not recommend attempting to create separate marketing campaigns at this point. However, it is worth further investigating most relevant top words, instead of simply the top words in each group.

We looked at combinations of words, which provide more information. We searched for the top bigrams in each group, which are the top pairs of words that most frequently occur together.

There is some amount of similarity between bigrams of introverts vs extroverts; however there are several phrases that are unique to each group, that do not appear in the top 50 bigrams of the other group. For example, “too” with “!” shows up very frequently relatively in the extroverts, and does not appear in the top 50 results of introverts at all.

We also looked at the 2 groups’ bigrams with the top 50 mutual information scores, which measures how dependent one word is on another. For example, the words “San” and “Francisco” are highly likely to occur together, so they would have a relatively high mutual information score.

We found that many of the introverts’ top mutual information scores contain literary references, more so than those of the extroverts’. We suspect the nature of the website, Personality Cafe, from which we obtained the data, has some effect on this — perhaps introverts may be more likely to use a personality website to discuss their lives, and/or discuss literary references in their everyday online posts. Regardless, this certainly warrants further discovery through obtaining a diverse portfolio of more datasets.

Onto future work! I will be incorporating this data as part of a larger project on building/improving a friendship-prediction algorithm. What do you think makes you want to be closer friends with someone? Drop your perspective in the comments.

If you enjoyed this article, click here to get free access to my guides to improve your friendships with others & yourself + meet new people that actually get you.

Predicting Introversion/Extroversion Based on Your Writing?

Written by Alaska Lam