This exercise examines word counts from CNN.com and Foxnews.com

Install and then import the following modules

In [1]:
import pandas as pd
import numpy as np
import re
from bokeh.charts import Bar, Scatter, output_notebook, show, output_file
from bokeh.charts.attributes import CatAttr, color
from bokeh.models import HoverTool, Range1d, Span, LabelSet, ColumnDataSource, Title, NumeralTickFormatter
from bokeh.plotting import figure
import matplotlib.pyplot as plt
/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Read the file "counts.csv" and remove any rows missing a value for "term"

In [2]:
counts_file = "counts.csv"
df_counts = pd.read_csv(counts_file)
df_counts = df_counts.dropna(subset = ['term'])

Normalize words counts by the site's total word counts

Hint: use ".sum()"
In [3]:
df_counts['CNN'] = (df_counts['CNN'] / df_counts['CNN'].sum()) * 100
df_counts['Fox'] = (df_counts['Fox'] / df_counts['Fox'].sum()) * 100
In [4]:
df_counts[:2]
Out[4]:
term CNN Fox
0 school 0.236407 0.52356
1 wednesday 0.236407 0.69808

Reshape the data to have a single column with the term percentage and a column indicating the website

In [5]:
df_counts = pd.melt(df_counts, id_vars = 'term', var_name = 'site', value_name = 'term_pct')
In [6]:
df_counts[:2]
Out[6]:
term site term_pct
0 school CNN 0.236407
1 wednesday CNN 0.236407

Find the top 5 most common words in CNN and Fox separately.

In [7]:
TOP_NUMBER = 5
top_CNN = df_counts[df_counts['site'] == 'CNN'].sort_values(by = 'term_pct', ascending = False)[: TOP_NUMBER]
top_Fox = df_counts[df_counts['site'] == 'Fox'].sort_values(by = 'term_pct', ascending = False)[: TOP_NUMBER]

top_CNN_term = top_CNN['term'].tolist()
top_Fox_terms = top_Fox['term'].tolist()

Create a list "top_terms" by combining the CNN and Fox lists. Remove duplicates.

In [8]:
top_terms = list(set(top_CNN_term + top_Fox_terms))
In [9]:
top_terms
Out[9]:
['week', 'clinton', 'look', 'hillary', 'donald', 'th', 'new', 'first', 'trump']

Create a dataframe called "plot_data" with only terms in the list top_terms.

In [10]:
plot_data = df_counts.loc[df_counts['term'].isin(top_terms)]
plot_data['term'] = plot_data['term'].str.title()
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Use Bokeh to make a bar chart with the percentage each term has by website.

In [11]:
# Let's plot this with Bokeh, making an HTML file
p = Bar(plot_data, label=CatAttr(columns=['term'], sort=True), values='term_pct',
         group = "site", legend = "top_right", tools="previewsave", height=600, width=900,
        title="Top Terms for CNN and Fox", xlabel="Term", ylabel="Percentage of Terms")

# Fix bar width issue
for r in p.renderers:
    try:
        r.glyph.width = 0.33
    except AttributeError:
        pass

msg = """Note: Data are from CNN.com and Foxnews.com.  Common and one-letter words have been excluded."""
caption = Title(text=msg, align='left', text_font_size='8pt')
p.add_layout(caption, 'below')

output_file("term_pct.html")
show(p)

Make a similar plot with Matplotlib

In [12]:
# We can make a similar plot using Matplotlib (ggplot is buggy), producing a PNG image
%matplotlib inline

plot_data = plot_data.sort_values(by = 'term')
cnn_data = plot_data.loc[plot_data['site'] == 'CNN']
fox_data = plot_data.loc[plot_data['site'] == 'Fox']
cnn = cnn_data['term_pct'].tolist()
fox = fox_data['term_pct'].tolist()
ind = np.arange(len(cnn))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(ind, cnn, width, color='r')
rects2 = ax.bar(ind + width, fox, width, color='y')

# add some text for labels, title and axes ticks
ax.set_title('Term Frequency by News Source', fontsize = 10)
ax.set_ylabel('Percentage of Terms', fontsize = 8)
ax.set_xticks(ind + width)
ax.set_xticklabels(tuple(cnn_data['term'].tolist()), fontsize = 4, rotation = 45)

ax.legend((rects1[0], rects2[0]), ('CNN', 'Fox'), prop={'size':6})

fig.savefig('term_pct.png', dpi = 250)