NYC measles cases by neighborhood (2018 - 2019)

Import libraries

In [1]:
from bokeh.io import export_png, export_svgs
from bokeh.models import ColumnDataSource, LabelSet, Range1d, Title
from bokeh.plotting import figure, output_notebook, show
from bokeh.tile_providers import get_provider, Vendors

from datetime import datetime

import math
import os
import pandas as pd
import re

Notes about additional bokeh.io dependencies:

  • In order to use the export() functions, users have to install some additional dependencies. These dependencies can be installed via conda:

conda install selenium phantomjs pillow

Read all data

This section reads all the necessary data, namelly the measles by neighborhood data and the neighborhood coordinates data.

The measles by neighborhood data has been manually collected from the NYC Health Measles webpage and saved in a CSV file.

The neighborhood coordinates data was collected with simple google queries (e.g., "Williamsburg, NY coordinates") and saved in a CSV file.

In [2]:
# Define data files paths
measles_data_file = os.path.join('..', 'data', 'nyc-health', 'final', 'nyc-measles-confirmed-cases-by-neighborhood.csv')
coordinates_data_file = os.path.join('..', 'data', 'nyc-neighborhoods-coordinates.csv')

# Read measles data
if os.path.exists(measles_data_file):    
    measles_df = pd.read_csv(measles_data_file)
    print('Done: read the measles data')
else:
    print('ERROR: measles data file not found:', measles_data_file)

# Read coordinates data
if os.path.exists(coordinates_data_file):    
    coordinates_df = pd.read_csv(coordinates_data_file)
    print('Done: read the coordinates data')
else:
    print('ERROR: coordinates data file not found:', measles_data_file)
Done: read the measles data
Done: read the coordinates data

Check measles data

The measles data is shown below. These are the details about the two data columns:

  • neighborhood: neighborhood name
  • confirmed cases (start_date to end_date): number of confirmed measles cases in that neighborhood during the outbreak (from start_date to update_date)

Note that some of the data entries/rows are for a set of neighborhoods (e.g., "Brighton Beach/Coney Island"), because this is how NYC Health reports the data.

Also, note that the last row is a TOTAL count.

In [3]:
# Show all the measles data
measles_df
Out[3]:
neighborhood confirmed cases (2018-09-01 to 2019-08-19)
0 Bensonhurst 3
1 Borough Park 121
2 Brighton Beach/Coney Island 4
3 Chelsea/Clinton 1
4 Crown Heights 8
5 Far Rockaway 1
6 Flatbush 1
7 Flushing 3
8 Jamaica 2
9 Midwood/Marine Park 5
10 Port Richmond 3
11 Red Hook 1
12 Sunset Park 16
13 West Queens 1
14 Williamsburg 473
15 Willowbrook 6
16 TOTAL 649

Check coordinates data

The neighborhood coordinates data is shown below, and it contains three simple columns: neighborhood name, latitute and longitude.

Note that some of the data entries/rows are for a set of neighborhoods (e.g., "Brighton Beach/Coney Island"). This is because NYC Health reports the measles data using those neighborhood sets. The latitude/longitude of a set of neighborhoods is the average of the individual neighborhoods latitudes/longitudes.

In [4]:
# Show the first rows of the coordinates data
coordinates_df.head()
Out[4]:
neighborhood latitude longitude
0 Bensonhurst 40.6139 -73.9922
1 Borough Park 40.6350 -73.9921
2 Brighton Beach 40.5781 -73.9597
3 Coney Island 40.5755 -73.9707
4 Brighton Beach/Coney Island 40.5768 -73.9652

Extract reference dates

Extract reference dates from the original column names, and transform them into nicelly formatted strings.

Extract reference dates from the original column names, and transform them into nicelly formatted strings.

In [5]:
measles_original_cols = measles_df.columns

measles_original_cols
Out[5]:
Index(['neighborhood', 'confirmed cases (2018-09-01 to 2019-08-19)'], dtype='object')
In [6]:
[start_month, end_month] = list(map(
    lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%b %Y'), 
    re.findall(r'\d{4}-\d{2}-\d{2}', measles_original_cols[1])
))

[start_month, end_month]
Out[6]:
['Sep 2018', 'Aug 2019']

Rename measles data columns

Three of the columns names contain useful date information which we will want to shown in the visualizations, to provide context. However, that makes those column names very long/verbose. So, now that we already saved the original column names, let's rename them with shorter names, to make it easier to reference them.

In [7]:
measles_new_cols = ['neighborhood', 'cases']

measles_df.columns = measles_new_cols

measles_df.head()
Out[7]:
neighborhood cases
0 Bensonhurst 3
1 Borough Park 121
2 Brighton Beach/Coney Island 4
3 Chelsea/Clinton 1
4 Crown Heights 8

Extract total cases

In [8]:
# Extract the reported total cases
total_cases = measles_df[measles_df['neighborhood'] == 'TOTAL']['cases'].values[0]

# Show the total cases
total_cases
Out[8]:
649
In [9]:
# Confirm the total cases is correct
measles_df[measles_df['neighborhood'] != 'TOTAL']['cases'].sum()
Out[9]:
649

Merge measles and coordinates data

In [10]:
# Merge the measles and coordinates data into a single dataframe
df = pd.merge(measles_df, coordinates_df, on='neighborhood')

# Show the data
df
Out[10]:
neighborhood cases latitude longitude
0 Bensonhurst 3 40.6139 -73.9922
1 Borough Park 121 40.6350 -73.9921
2 Brighton Beach/Coney Island 4 40.5768 -73.9652
3 Chelsea/Clinton 1 40.7552 -73.9966
4 Crown Heights 8 40.6694 -73.9422
5 Far Rockaway 1 40.5999 -73.7448
6 Flatbush 1 40.6415 -73.9594
7 Flushing 3 40.7675 -73.8331
8 Jamaica 2 40.7027 -73.7890
9 Midwood/Marine Park 5 40.6159 -73.9466
10 Port Richmond 3 40.6355 -74.1255
11 Red Hook 1 40.6734 -74.0083
12 Sunset Park 16 40.6527 -74.0093
13 West Queens 1 40.7443 -73.8882
14 Williamsburg 473 40.7081 -73.9571
15 Willowbrook 6 40.6032 -74.1385

Add Mercator coordinates

The cell below defines functions to transfrom latitude and longitude to the corresponding Mercator X and Y coordinates.

In [11]:
# Function to calculate the Mercator X coordinate for a given longitude
def mercator_x(lon):
    r_major = 6378137.0
    x = r_major * math.radians(lon)
    return x

# Function to calculate the Mercator Y coordinate for a given latitude and longitude
def mercator_y(lat, lon):
    x = mercator_x(lon)
    scale = x/lon
    y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + lat * (math.pi/180.0)/2.0)) * scale
    return y

# Function to calculate the Mercator X and Y coordinates for a given latitude and longitude
def mercator_x_y(lat, lon):
    x = mercator_x(lon)
    y = mercator_y(lat, lon)
    return [x, y]

#def mercator_x_y(lat, lon):
#    r_major = 6378137.0
#    x = r_major * math.radians(lon)
#    scale = x/lon
#    y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 + lat * (math.pi/180.0)/2.0)) * scale
#    return [x, y]

Test the mercator_x_y function for the NYC coordinates:

In [12]:
# NYC Mercator X and Y coordinates range:
# nyc_mercator_x_range = (-8242000, -8210000)
# nyc_mercator_y_range = (4965000, 4990000)

# NYC latitude and longitude
nyc_lat = 40.7128
nyc_lon = -74.0060

mercator_x_y(nyc_lat, nyc_lon)
Out[12]:
[-8238310.235647004, 4970071.579142425]

Compute the Mercator coordinates for all neighborhoods:

In [13]:
df['mercator_x'] = df['longitude'].apply(mercator_x)
df['mercator_y'] = df.apply(lambda r: mercator_y(r['latitude'], r['longitude']), axis=1)
In [14]:
# Show the beginning of the dataframe to see the new columns (mercator_x and mercator_y)
df.head()
Out[14]:
neighborhood cases latitude longitude mercator_x mercator_y
0 Bensonhurst 3 40.6139 -73.9922 -8.236774e+06 4.955558e+06
1 Borough Park 121 40.6350 -73.9921 -8.236763e+06 4.958652e+06
2 Brighton Beach/Coney Island 4 40.5768 -73.9652 -8.233768e+06 4.950119e+06
3 Chelsea/Clinton 1 40.7552 -73.9966 -8.237264e+06 4.976300e+06
4 Crown Heights 8 40.6694 -73.9422 -8.231208e+06 4.963700e+06

Create map

In [15]:
# Function to define the bubble size
def cases_radius(cases):
    if cases < 10:
        return 25
    elif cases < 100:
        return 35
    elif cases < 200:
        return 45
    else:
        return 55

# Calculate the bubble size for all neighborhoods:
df['cases_radius'] = df['cases'].apply(cases_radius)

# Function to define the x-axis offset for the label showing the number of cases
# Used to put the case number labels in the center of the bubbles
def cases_label_x_offset(cases):
    if cases < 10:
        return -6
    elif cases < 100:
        return -12
    elif cases < 200:
        return -14
    else:
        return -16

# Function to define the y-axis offset for the label showing the number of cases
# Used to put the case number labels in the center of the bubbles
def cases_label_y_offset(cases):
    if cases < 10:
        return -8
    elif cases < 100:
        return -10
    elif cases < 200:
        return -10
    else:
        return -10

# Calculate the 'cases_label_x_offset' and 'cases_label_y_offset' for all neighborhoods:
df['cases_label_x_offset'] = df['cases'].apply(cases_label_x_offset)
df['cases_label_y_offset'] = df['cases'].apply(cases_label_y_offset)

# Create new dataframe columns for the x and y offsets for the labels with neighborhood names
df['neighborhood_label_x_offset'] = [0] * len(df)
df['neighborhood_label_y_offset'] = [0] * len(df)

# Manually adjust the 'neighborhood_label_x_offset' and 'neighborhood_label_y_offset' for all neighborhoods
# Place the neighborhood labels in a "nice" position, so that they don't overlap with each other or with the bubbles
cols = ['neighborhood_label_x_offset', 'neighborhood_label_y_offset']
df.loc[df['neighborhood'] == 'Bensonhurst',                 cols] = [-80, -30]
df.loc[df['neighborhood'] == 'Borough Park',                cols] = [-125, -10]
df.loc[df['neighborhood'] == 'Brighton Beach/Coney Island', cols] = [-115, -35]
df.loc[df['neighborhood'] == 'Chelsea/Clinton',             cols] = [-15, 15]
df.loc[df['neighborhood'] == 'Crown Heights',               cols] = [15, -10]
df.loc[df['neighborhood'] == 'Far Rockaway',                cols] = [-80, -30]
df.loc[df['neighborhood'] == 'Flatbush',                    cols] = [15, -10]
df.loc[df['neighborhood'] == 'Flushing',                    cols] = [15, -10]
df.loc[df['neighborhood'] == 'Jamaica',                     cols] = [15, -10]
df.loc[df['neighborhood'] == 'Midwood/Marine Park',         cols] = [-10, -30]
df.loc[df['neighborhood'] == 'Port Richmond',               cols] = [-45, -30]
df.loc[df['neighborhood'] == 'Red Hook',                    cols] = [-85, -10]
df.loc[df['neighborhood'] == 'Sunset Park',                 cols] = [-110, -10]
df.loc[df['neighborhood'] == 'West Queens',                 cols] = [-40, -30]
df.loc[df['neighborhood'] == 'Williamsburg',                cols] = [30, -10]
df.loc[df['neighborhood'] == 'Willowbrook',                 cols] = [-25, -30]

output_notebook()

#tile_provider = get_provider(Vendors.CARTODBPOSITRON)
tile_provider = get_provider(Vendors.CARTODBPOSITRON_RETINA)

# Create figure
p = figure(
    x_axis_type = 'mercator',
    y_axis_type = 'mercator',
    height = 650,
    width = 650
)

# Define figure title
title = Title(
    text = 'NYC measles cases by neighborhood',
    align = 'center',
    text_font_size = '18pt',
    text_font_style = 'normal'
)

# Define contextual notes to add to the figure
note_1 = Title(
    text = str(total_cases) + ' total confirmed cases from ' + start_month + ' to ' + end_month,
    align = 'center',
    text_font_size = '12pt',
    text_font_style = 'normal'
)
note_2 = Title(
    text = 'Community transmission was declared over on Sep 3, 2019',
    align = 'center',
    text_font_size = '12pt',
    text_font_style = 'normal'
)

# Define credits to add to the figure
credits = Title(
    text = 'Data: NYC Health, Map: OpenStreetMap/CartoDB, Image: carlos-afonso.github.io/measles',
    align = 'center',
    text_font = 'consolas',
    text_font_size = '10pt',
    text_font_style = 'normal'
)

# Add the title, notes, and credits to the figure
p.add_layout(title, 'above')
p.add_layout(note_1, 'below')
p.add_layout(note_2, 'below')
p.add_layout(Title(text=' ', text_font_size='0pt'), 'below') # Adds some empty space
p.add_layout(credits, 'below')

# Add tile provider
p.add_tile(tile_provider)

# Make axis invisible
p.axis.visible = False

# Define the dataframe (df) as the data source
source = ColumnDataSource(df)

# Add the bubbles / circles
p.circle(
    x = 'mercator_x',
    y = 'mercator_y',
    alpha = 0.3,
    fill_color = 'red',
    line_color = 'red',
    size = 'cases_radius',
    source = source
)

# Add labels with the number of cases
p.add_layout(LabelSet(
    x = 'mercator_x',
    y = 'mercator_y',
    text = 'cases',
    #level = 'glyph',
    x_offset = 'cases_label_x_offset',
    y_offset = 'cases_label_y_offset',
    source = source,
    render_mode = 'canvas',
    #text_color = 'black',
    text_font_size = '12pt'
))

# Add labels with the neighborhood names
p.add_layout(LabelSet(
    x = 'mercator_x',
    y = 'mercator_y',
    text = 'neighborhood',
    #level = 'glyph',
    x_offset = 'neighborhood_label_x_offset',
    y_offset = 'neighborhood_label_y_offset',
    source = source,
    render_mode = 'canvas',
    #text_color = 'black',
    text_font_size = '12pt'
))

# Remove the toolbar, to remove plot interactivity
# In this case it's better to have just a static image, to keep the focus on the area of interest
p.toolbar_location = None

show(p)
Loading BokehJS ...

Save map

Note: The export_png below produces a relativelly low resolution image. To get a higher resolution PNG image we can use a manual method: open the SVG version on a browser, zoom in/out as needed, take a screenshot, and then crop the schreenshot.

In [16]:
# Define output file name
output_file = os.path.join('..', 'images', 'nyc-measles-cases-by-neighborhood-map-final')

# Save as PNG
export_png(p, output_file + '.png')

# Save as SVG
p.output_backend = 'svg'
export_svgs(p, output_file + '.svg')
Out[16]:
['..\\images\\nyc-measles-cases-by-neighborhood-map-final.svg']

Export notebook as HTML

In [17]:
# Export this notebook as a static HTML page
os.system('jupyter nbconvert --to html nyc-measles-cases-by-neighborhood-final.ipynb')
Out[17]:
0