Movie Popularity vs Actual Net Profit

Alaska Lam
3 min readApr 10, 2021

IMDB has an obnoxious amount of data — so much data that it’s hard to know what to do with it all. There are so many films that come out each year that cause millions of potential viewers to roll their eyes at the sheer ridiculousness of the plot, combination of actors, or even the title. **Feel free to give them a shout in the comments — my personal favorite is Frozen [not the Pixar film], a campy b-horror flick that shows what would happen if a group got stranded for days on a ski chair lift. Wolves somehow appear at night and attack on an otherwise perfectly habitable ski resort…

For the first group project we ever did in the Flatiron Data Science bootcamp, my colleagues Jeremy Lee & Zach Paul and I explored what exactly predicts “success” for a given movie.

We used data directly from IMDB, from 2019 and earlier to calculate different approaches for this.

Which movie genres are most commonly produced and does quantity equate to higher net profits?**

The first part of this question is relatively simple — let’s take a look at some of the code we wrote.

We took our main dataframe and split the data based on its listed movie genre[s].

#Create a genre table that separates each value in the genre column in their own rows.imdb_budgets_df['Genre'] = imdb_budgets_df['Genre'].str.split(', ')imdb_budgets_df1 = imdb_budgets_df['Genre'].apply(pd.Series)imdb_budgets_df2 = pd.merge(imdb_budgets_df, imdb_budgets_df1, right_index = True, left_index = True)imdb_budgets_df3 = imdb_budgets_df2.drop(['Genre'], axis = 1)genre_budgets_df = imdb_budgets_df3.melt(id_vars=['Movie', 'Year'], value_vars=[0, 1, 2] ,var_name = ['X'])genre_budgets_df = pd.merge(genre_budgets_df, imdb_budgets_df)genre_budgets_df = genre_budgets_df.drop(['Genre', 'X'], axis=1)genre_budgets_df = genre_budgets_df.drop_duplicates()genre_budgets_df = genre_budgets_df.rename(columns={'value': 'Genre'})genre_budgets_df = genre_budgets_df.dropna()

Then, we group by genre *and* descending counts of all movies in each genre.

#Do a count of all movies grouped by genre.m_by_genre = genre_budgets_df.groupby('Genre', as_index=False)['Movie'].count().sort_values(by='Movie', ascending=False)

We can see that drama, comedy and action are the top 3 common genres for all movies released 2019 and earlier.

#Plot the above findings.plt.figure(figsize=(14,7))ax3 = sns.barplot(x=m_by_genre['Movie'], y=m_by_genre['Genre'], palette='GnBu_d')plt.xlabel('Movie Count', fontsize=12)plt.ylabel('Genre', fontsize=12)plt.title('Movie Count By Genre', fontsize=14)plt.savefig('CountGenre');

But what about the genres that were actually most profitable?

#Once again group the movies by genre, showing the average net profit and profit margin for each.p_by_genre = genre_budgets_df.groupby('Genre', as_index=False)[['Adjusted_Profit', 'Profit_Margin']].median().sort_values(by='Adjusted_Profit', ascending=False)
#Plot the above findings.plt.figure(figsize=(14,7))ax4 = sns.barplot(x=p_by_genre['Adjusted_Profit'], y=p_by_genre['Genre'], palette='GnBu_d')plt.xlabel('Net Profit (Hundreds of Millions)', fontsize=12)plt.ylabel('Genre', fontsize=12)plt.title('Net Profit By Genre', fontsize=14)plt.savefig('NetProfitGenre');

We can see that animation, sci-fi and adventure by and large have the highest average net profit by genre, in hundreds of millions.

According to our results, there exist plentiful opportunities to break into the sci-fi and animation genres, because the market is not as saturated as that of drama, comedy & action; furthermore, the average profit margins of sci-fi and animation are among the highest.

Seriously, though. Drop your comments below about what your favorite bad movie is, and why you think it bombed so badly. Was it a combination of actors? Combination of genres that just didn’t go together? Or a title that scared people off from the start with sheer visceral disgust?

--

--

Alaska Lam

polymath using data science to build friendship + camaraderie