Course Schedule

** Welcome

*** { @unit = “”, @title = “Course Overview”, @foldout }

Welcome Back!

CPP 527 is the second course in the Foundations of Data Science sequence. This semester extends work done in CPP 526 by introducing programming topics like control structures, loops, and regular expressions that are necessary for building simulations and specialized applications in R. We will also cover the foundations of document design using both GitHub pages (free websites like this one) and customized RMD templates so that you can begin developing custom reporting formats so that enable you to better structure results from analytical projects or automate tasks.

*** { @unit = “”, @title = “Getting Help”, @assignment, @foldout }

Getting Help

Note that the discussion board is hosted by the GitHub issues feature. It is a great forum because:

You can format code and math using standard markdown syntax.
You can cut and paste images directly into the message.
You can direction responses using @username mentions.

Please preview your responses before posting to ensure proper formatting. Note that you format code by placing fences around the code:

```
# your code here
lm( y ~ x1 + x2 )
```

The fences are three back-ticks. These look like quotation marks, but are actually the character at the top left of your keyboard (if you have a US or European keyboard).

GitHub does not have a native math rendering language (RMD documents, on the other hand, support formulas). So you have two options, type formulas as regular text and use code formatting to make them clear (this option is usually sufficient). Or you can type your formula in a formula editor and copy and paste an image of the nicely-formatted example.

```
y = b0 + b1•X1 + b2•X2 + e

b1 = cov(x,y) / var(x)
```

** Week 1 - Control Structures

*** { @unit = “”, @title = “Unit Overview”, @foldout }

Description

This section introduces control structures that serves to incorporate decision-making into computer code. It enables things like if-then logic to determine what code should be used based upon specified conditions.

Learning Objectives

Once you have completed this section you will be able to

implement if-else statements
use while loops
use functions as steps in problem-solving

Lab

Your assignment this week will be to design computer code to simulate the steps in the game show Let’s Make a Deal.

*** { @unit = “”, @title = “Readings”, @reading, @foldout }

Review

Please revisit the following chapter from last semester:

Functions

Assigned Reading

Required:

Quick Reference on Control Structures

Control Structures in R

Planning Your Code with Pseudo-Code

Typically as you start a specific task in programming there are two things to consider.

(1) What are the steps needed to complete the task? (2) How do I implement each step? How do I translate them into the appropriate functions and syntax?

It will save you a huge amount of time if you separate these tasks. First, take a step back from the problem that think about the steps. Write down each step in order. Make some notes about what data is needed from the previous step, and what the return result will be from the current step.

Think back to the cooking example. If we are going to bake cookies our pseudo-code would look something like this:

Preheat the oven.
In a large bowl, mix butter with the sugars until well-combined.
Stir in vanilla and egg until incorporated.
Addflour, baking soda, and salt.
Stir in chocolate chips.
Bake.

Note that it lacks many necessary details. How much of each ingredient? What temperature does the oven need to be? How long do we bake for?

Once we have the big picture down and are comfortable with the process then we can start to fill in these details:

Preheat the oven.
- Preheat to 375 degrees
In a large bowl, mix butter with the sugars until well-combined.
- 1/3 cup butter
- 1/2 cup sugar
- 1/4 cup brown sugar
- mix until the consistency of wet sand

Note that in computer programming terms butter, sugar, and brown sugar are the inputs or arguments needed for a function. The wet sand mixture is the return value of the process.

In the final step, we will begin to implement code.

# 1. Preheat the oven. 
#    - preheat to 375 degrees 

preheat_oven <- function( temp=375 )
{
   start_oven( temp )
   return( NULL )
}


# 2. In a large bowl, mix butter with the sugars until well-combined. 
#    - 1/3 cup butter    
#    - 1/2 cup sugar    
#    - 1/4 cup brown sugar   
#    - mix until the consistency of wet sand 

mix_sugar <- function( butter=0.33, sugar=0.5, brown.sugar=0.25 )
{
   sugar.mixture <- mix( butter, sugar, brown.sugar )
   return( sugar.mixture )
}


# 3. Stir in vanilla and egg until incorporated. 
#    - add to sugar mixture 
#    - mix until consistency of jelly 

add_wet_ingredients <- function( sugar=sugar.mixture, eggs=2, vanilla=2 )
{
   # note that the results from the previous step are the inputs into this step
}

We are describing here the process of writing pseudo-code. It it the practice of:

Breaking an analytical task into discrete steps.
Noting the inputs and logic needed at each step.
Implementing code last.

Pseudo-code helps you start the process and work incrementally. It is important because the part of your brain that does general problem-solving (creating the basic recipe) is different than the part that drafts specific syntax in a langauge and de-bugs the code. If you jump right into the code it is easy to get lost or derailed.

More importantly, pseudo-code captures the problem logic, and thus it is independet of any specific computer language. When collaborating on projects one person might generate the system logic, and another might implement. So it is important to practice developing a general overview of your task at hand.

Here are some helpful examples:

*** { @unit = “Due Jan 21th”, @title = “Lab 01”, @assignment, @foldout }

Lab-01 - Control Structures

This lab is based upon the famous Monty Hall Problem.

Although there was much debate about the correct solution when it was initially introduced there are many concise explanations of the proper solution:

The Monty Hall Problem is a great example of a mathematical problem that might be hard to solve using a mathematical proof, but it is fairly easy to solve using simulation. Since it is just a game with simple and explicit rules we can build our own virtual version. Then we can compare how outcomes differ when we deploy the two different strategies for selecting doors.

In Lab 01 we will use control structures to build a virtual version of the game. In Lab 02 we will use simulation to play the game thousands of times so that we can get stable estimates of the payoff of each strategy.

LAB-01 Instructions

Submit Solutions to Canvas:

SUBMIT LAB

** Week 2 - Simulations

*** { @unit = “”, @title = “Unit Overview”, @lecture, @foldout }

Description

This section introduces loops. We will use them to create simulations.

Learning Objectives

Once you have completed this section you will be able to

use a loop responsibly in your code
select appropriate iterators
be mindful of the collector vector needed for the loop

*** { @unit = “”, @title = “Readings”, @reading, @foldout }

Assigned Reading

Required:

Building Simulations in R: Mastering Loops

Creating Animations with Loops

Background reading:

Why Americans Are So Damn Unhealthy, In 4 Shocking Charts

Buzzfeed Replication Files

*** { @unit = “FRI Jan 24th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }

How can you make interesting animations in R?

We covered a very basic animation - a random walk - in the lecture notes.

Start game with $10 in cash and see how long you last. At each step you flip a coin and win a dollar, lose a dollar, or stay the same. How long does the average player survive before going bankrupt?

cash <- 10  
results <- NULL
count <- 1  
while( cash > 0 )
{
  cash <- cash +   
    sample( c(-1,0,1), size=1 )  
  results[count] <- cash  
  count <- count + 1  
}

This is a one-dimensional outcome tracked over time. Physicists have used a similar model to examine particle motion. It is called a Brownian Motion model. It is similar to the betting model above except for each time period the particle moves in two dimensions.

x <- 0  
y <- 0 
for( i in 1:1000 )
{
  x[i+1] <- x[i] + rnorm(1)
  y[i+1] <- y[i] + rnorm(1)
}

Questions

Consider the two following problems.

(1) How long does the typical person take to go bankrupt? If you don’t want to do a complicated mathematical proof, you can create a simulation, play the game 10,000 times, then report the average period each game lasted.

What is the code to make this work?

(2) Note the trailing tail in the Brownian Motion animation. How would you create that as part of an animation?

Post your ideas or solutions on YellowDig:

Share your ideas about these problems with your classmates. Or share another animation that you found that uses loops.

YELLOWDIG

*** { @unit = “FRI Jan 31st”, @title = “Lab 02”, @assignment, @foldout }

Please review the instructions at the end of the lecture notes:

Building Simulations in R: Mastering Loops

LAB-02 Instructions

Submit Solutions to Canvas:

SUBMIT LAB

** Week 3 - GitHub Pages

*** { @unit = “”, @title = “Unit Overview”, @foldout }

Customized Reporting

A big part of every analysts job is trying to find ways to distill large volumes of data and information down to meaningful bites of knowledge, often for diverse stakeholder audiences that have varying degrees of technical expertise. For this reason, communication skills are extremely valuable for data scientists. You will constantly be challenged to find the interesting story that emerges from an overwhelming amount of data, and find creative ways to tell the story so that information becomes actionable.

Although it might not sound as edgy as building a machine learning classifier, the ability to create customized reporting formats and automate various steps of analysis will both make you more efficient and also more effective at communication.

This lab introduces you to one powerful tool for your toolkit - using GitHub pages to build a website quickly and inexpensively (for free, actually). Then use it to host various components of projects including public-facing reports and RMD documents after rendering.

For the project component of the course we will use a CV template to learn how the pagedown package can be used to create highly-customized report templates:

Auto-Generated Resume

We will also practice automation by the separation of the design elements of reporting from the data contained in the reports. In this example for a CV, Nick Strayer’s positions are stored in a CSV file on GitHub:

Positions CSV

And they are added to the document templates using some custom functions which filter positions and loop through the list to iteratively build the document.

GitHub Pages Set-Up

This week’s lab will ask you to configure a GitHub page within a repository on your account. GitHub pages are an amazing resource because (1) they allow you to create an unlimited number of websites related to your projects FOR FREE, and (2) they can be created and maintained using Markdown, which simplifies a lot of the complexity of websites. You will learn to link HTML files generated from R Studio so that you can start sharing analytical projects with external audiences.

The set-up of a simple GitHub page is fairly straight-forward and can be completed in a few basic steps:

This will give you a barebones website with a landing page you can write using Markdown, and a few templates that you can select from.

You have access to myriad advanced features on the platform. GitHub pages leverage several powerful web frameworks like Jekyll, Bootstrap, Liquid, and Javascript to make customization of static pages both easy and powerful. We will spend some time talking about how the pieces of a website fit together so that you have a rudimentary knowledge of the platform:

Barebones Jekyll Example

More importantly, GitHub pages can help demonstrate the concept of page templates. We can design the layout of a graphic, table, or section of text on a page then dynamically populate it with data. GitHub pages allow you to do with with basic HTML and Liquid tags:

Table Template Example

And the pagedown package in R allows you to develop a variety of templates using similar principles:

Pagedown Overview

Similar to other work we have done in R, we will start by using some working examples then reverse engineer them so you can see how the pieces fit together. You are not expected to master any of these topics in the short time-frame of the semester. The proper benchmark of knowledge is can you take an existing open source project and adapt it as necessary.

Cascading Style Sheets

You will not be required to learn web programming languages like HTML and Javascript (though they are super useful if you invest the time). You do, however, need to become familiar with very basic CSS as it is impossible to do customization without it. CSS started as a somewhat modest project but has evolved into a powerful language. R Markdown documents support CSS, which makes them fully customizable. It will also become more important so you begin to develop dashboards or custom interactive Shiny apps, since CSS is the primary means of controlling layouts and other style elements.

These two pages on the example GitHub site have the same content, but CSS elements are used to change the page layout and style on the second. Click on the “see page layout” button to see the CSS elements.

Basic Page

Style Added with CSS

Required Reading

Skim the following chapters, reading to get a general sense of concepts and the basics of how each might function. You can skip sections that explain the code in detail. I am more concerned that you understand how these basic pieces fit together, and when you hear terms like “responsive” you conceptually know what people are talking about.

Introduction to Web Programming

Hello CSS

Responsive Design

Bootstrap (wikipedia entry)

*** { @unit = “FRI Jan 31st”, @title = “YellowDig Practice Problems”, @assignment, @foldout }

PRACTICE QUESTION

YELLOWDIG

*** { @unit = “TUE Feb 11th”, @title = “Lab 03”, @assignment, @foldout }

Lab-03 - Creating a GitHub Page

Instructions

The animation in the Unit Overview above shows how simple it is to activate GitHub pages for any project repository so that you can turn markdown files into web-hosted HTML files and share tutorials or reports created from RMD files.

If we want a website with a bit more functionality, however, we will need to start from an existing template and adapt it.

For this lab you will be asked to fork the beautiful-jekyll website template:

Beautiful Jekyll Website Template on GitHub

Follow the instructions in the README file to begin customizing your page.

In the _config.yml file in the default directory do the following:

Change the website name and description.

# Name of website
title: My website

# Short description of your site
description: A virtual proof that I'm awesome

You can update social network IDs if you like, or replace Dean’s info with empty quotes "" so the social media icons are present but not active.

Change the color scheme for the website:

# Personalize the colors in your website. Colour values can be any valid CSS colour

navbar-col: "#F5F5F5"
navbar-text-col: "#404040"
navbar-children-col: "#F5F5F5"
page-col: "#FFFFFF"
link-col: "#008AFF"
hover-col: "#0085A1"
footer-col: "#F5F5F5"
footer-text-col: "#777777"
footer-link-col: "#404040"

You have forked the master branch of the website, which does not include the “getting started” page on the live site menu:

Navigate to the getting started page located on another branch, and copy this file to the main folder on your site. I would copy the text from the raw view of the page and just create a new file called getstarted.md on your site.

Now update the navigation bar and add another option called “Getting Started” under “Resources”. You will use the text “getstarted” for the URL, excluding the .md markdown extension. GitHub pages converts all markdown files to HTML files in the background, so you want to direct the user to the HTML version, which does not require an explicit extension to work in browsers:

# List of links in the navigation bar
navbar-links:
  About Me: "aboutme"
  Resources:
    - Beautiful Jekyll: "http://deanattali.com/beautiful-jekyll/"
    - Learn markdown: "http://www.markdowntutorial.com/"
    - Getting Started: "getstarted"                         # ADD THIS LINK
  Author's home: "http://deanattali.com"

The index.html file contains text from the landing page of the website. You will find some page title and descriptions in the YAML header of this file:

---
layout: page
title: My website
subtitle: This is where I will tell my friends way too much about me
use-site-title: true
---

Change the Text Style on the Getting Started Page

Demonstrate that you are able to apply CSS styles to specific elements of a page.

Create a new div section around Step 1 on the Getting Started page.

## Overview of steps required

There are only three simple steps, ....  

Here is a 40-second video ....

<img src="../img/install-steps.gif" style="width:100%;" alt="Installation steps" />

<div class="gs-section-01">

### 1. Fork the Beautiful Jekyll repository 

Fork the [repository](https://github.com/daattali/beautiful-jekyll) 
by clicking the Fork button on the top right corner in GitHub.

</div>

Follow the Barebones Jekyll example for customizing a page style by adding a CSS style sheet the bottom of the Getting Started page:

<style>

.gs-section-01 h3 { 
     color: red }

.gs-section-01 p {
     font-size: 30px;
}

</style>

Similarly, add new div sections around Step 02 and Step 03 on the page so that each step has different header styles and text. It doesn’t have to look nice - just show you are able to selectively change the style on a page.

Create a Liquid Table

Using the Barebones Jekyll Custom Table example add a page with a custom table.

Copy the liquid-table.html template and add it as a new layout in your site’s layout folder. You will need to change the parent page template on the liquid-table.html page to “default” or “page” in your new site (you don’t have a nice-text layout that you can use as the parent page layout).

---
layout: default
---

Create a new page in your main website folder called table-demo.md and copy the page content from the Barebones Jekyll example.

You will need to add the ryan-v-ryan.jpg image to your img folder for it to be accessible on your new site (you can right-click and save it, then drag it into the image folder on your GitHub site).

You do not need to include the “See Page Layout” button.

---
layout: liquid-table
title: 'amiright?'
reynolds:
  strengths:
  - good father
  - funny
  - dated alanis morissette
  weaknesses: 
  - singing
  - green lantern movie
  - tennis backhand 
gosling:
  strengths: 
  - builds houses
  - is a real boy
  - never dated alanis morissette
  weaknesses: 
  - micky mouse club
  - cries a lot
  - not ryan reynolds
---

![](img/ryan-v-ryan.jpg)  

## Lorem Ipsum

Lorem ipsum dolor sit amet....

Add the page to your navigation bar:

# List of links in the navigation bar
navbar-links:
  About Me: "aboutme"
  Resources:
    - Beautiful Jekyll: "http://deanattali.com/beautiful-jekyll/"
    - Learn markdown: "http://www.markdowntutorial.com/"
    - Getting Started: "getstarted"                         # ADD THIS LINK
  Table Demo: "table-demo"

When these steps are done, submit a link to (1) your live site and (2) your GitHub repo where the website lives.

Submit Solutions to Canvas:

SUBMIT LAB

And share your page link on YellowDig:

YELLOWDIG

** Week 4 - Regular Expressions

*** { @unit = “”, @title = “Unit Overview”, @reading, @foldout }

Text as Data

So this week comes with an up-front warning. You can get a PhD in Natural Language Processing, which is an entire field devote to computation tools used to process and analyze text as data. We have one week to cover the topic!

We obviously cannot go too deep into this interesting field, but let’s at least motivate some of the R functionality with a couple of cool examples.

Which Hip-Hop Artist Has the Largest Vocabulary?

Who is the Anonymous Op-Ed Writer inside the Trump Administration?

Sentiment Analysis

These examples all demonstrate interesting uses of text as data. They are also examples of the types of insight that can come from analysis with big data - the patterns are hiding in plain sight but our brains cannot hold enough information at one time to see it. Once we can find a system to extract hidden patterns from language we can go beyond seeking large public databases to generate insights, and we can start using all of Twitter, all published news stories, or all of the internet to identify trends and detect outliers.

String Processing & Regular Expressions

The core of all text analysis requires two sets of skills. Text is computer science is referred to as “strings”, a reference to the fact that spoken languages mean nothing to computers so they just treat them as strings of letters (words) or strings of words (sentences). String processing refers to a set of functions and conventions that are used to manipulate text as data. If you think about the data steps for regular data, we clean combine, transform, and merge data inside of data frames. Similarly there are operations for important text datasets (often lots of documents full of words), cleaning them (removing words, fixing spelling errors), merging documents, etc. Core R contains many string processing functions, and there are lots of great packages.

“Regular expression” are a set of functions used to aid in processing text by defining very precise ways to query a text database by looking for specific strings, or more often strings that match some specific pattern that has meaning. For example, if I gave you the following text with everything but punctuation replaced by X, you could still tell me what the word are for:

xxxxx@xxx.com (email address)
www.xxxxxxxx.xxx (web URL)
@xxxxxxx (social media handle)

So regular expressions can be very useful for searching large databases for general classes of text, or alternatively for searching for generic text that occurs only in a very specific context (at the beginning or end of a word, in the middle of a phrase, etc.).

Chapter

Slides

Helpful Reference Material:

stringR package

One Page RegEx Cheat Sheet

RegEx cheat sheet in R

Practice

The function grep( pattern, string ) works as follows:

Search for the pattern in each of the strings in the character vector at the top, strings.

The search pattern in each case below represents a regular expression.

What will the following cases return?

strings <- c("through","rough","thorough","throw","though","true","threw","thought","thru","trough")

# what will the following return? 

grep( pattern="th?rough", strings, value = TRUE)

grep( pattern=".ough", strings, value = TRUE)

grep( pattern="^.ough", strings, value = TRUE)

grep( pattern="ough.", strings, value = TRUE)

grep( pattern="[^r]ough", strings, value = TRUE)

# these are not as useful

grep( pattern="tr*", strings, value = TRUE)

grep( pattern="t*o", strings, value = TRUE)

*** { @unit = “FRI Feb 7th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }

Explain the following unexpected behaviors:

When is 5 larger than 10?

> (1:10) > 5
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> (1:10) > "5"
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

Invisible Dogs

x is a factor cataloging animals in a shelter, recording the type of animal.

Why can’t I count dogs?

> x  # TYPE OF ANIMAL (FACTOR)
[1] cat   dog   mouse
Levels: cat dog  mouse

> x == "cat" 
[1]  TRUE FALSE FALSE
> x == "dog"
[1] FALSE FALSE FALSE
> x == "mouse" 
[1] FALSE FALSE  TRUE

Average Years of Edu

I have a sample of 10 people and am trying to determine their average level of education. 12= high school degree, 16 = four-year college degree, etc.

My data is stored as a factor (which it should be since it is a categorical variable. But that makes it hard to calculate averages.

What is going wrong here?

grade.levels <- factor( c(12, 16, 12, 7, 7, 5, 6, 5, 9, 10) )

> # want to know average level of 
> # schooling for sample:
> mean( grade.levels )
[1] NA
Warning message:
In mean.default(grade.levels) :
  argument is not numeric or logical: returning NA
> 
> # mean requires a numeric variable
> mean( as.numeric( grade.levels ) )
[1] 3.8

RegEx Example

We have an large database where all of the addresses and geographic coordinates are stored as follows:

x <- c("100 LANGDON ST
MADISON, WI
(43.07794869500003, -89.39083342499998)", "00 N STOUGHTON RD E WASHINGTON AVE
MADISON, WI
(43.072951239000076, -89.38668964199996)")

Write a function that accepts the address vector x as the input, and returns a vector of numeric coordinates.

Note that the length of addresses can change, so you will need to use regular expressions (instead of just a substr() function) to generate proper results.

YELLOWDIG

** Week 5 - Text Analysis

*** { @unit = “MON Feb 24th”, @title = “Lab 04”, @assignment, @foldout }

Lab-04 - Text Analysis

LAB-04 Instructions

Submit Solutions to Canvas:

SUBMIT LAB

*** { @unit = “FRI Feb 21st”, @title = “YellowDig Discussion”, @assignment, @foldout }

The Crisis of Scientific Reproducibility

The topic this week is an introduction to the hugely important topic of reproducibility in science, the ability to reproduce results from ground-breaking scientific work that was published in top journals. For a long-time there was an assumption that peer-review meant that each scientist subjected their work to fellow scientists that were experts on their topic, and thus it provides a solid barrier to error-prone and work from being published.

The notion of reproducibility, however, grew from fields like physics and chemistry where early lab experiments could be described with enough precision for another scientist to mix the same chemicals, or recreate the same conditions for a gravity experiment, and easily verify whether the claims in the paper were defensible.

Things get a lot more complicated now that (1) the data requirements necessary to publish in top journals have expanded, (2) methods have become much more complicated, and (3) science is very competitive with high-stakes rewards for winning grants or covetted endowed professor positions, resulting in proprietary data, data collection techniques, or lab conditions like stem cell lines or bacteria strains that cannot be easily replicated. As a result, peer reviewers have to take a lot of what authors say on face value without having enough information to challenge certain assertions, or without having access to the raw data and thousands of lines of code that produces the results that are being defended in the paper. Furthermore, weaknesses in how statistical methods are reported have introduced systematic bias into the types of research that gets published in top journals (it has to be splashy, and thus more likely to be anomoly studies than reproducible work).

If you are not going into academics, should you care? Yes, because the problems with reproducibility in science are just a proxy for problems with data analysis that will arise in any organization outside of academics as well. Scientists experience pressure to publish. Consultants also experience pressure to do work fast, and to identify patterns that will impress clients. These sorts of issues will arise in any environment where data brings value. In science we care about making the process transparent so that others can replicate work. If you are a manager or project lead for a team of analysts, you should care about transparency because it allows you to ensure your team is doing the job correctly, especially if your name is going on the report!

This week’s topic introduces you to the fascinating topic of the replication crisis in science. Your task will be to read two articles on reproduciblity in science:

When the Revolution Came for Amy Cuddy, New York Times Magazine, 2017

How Quality Control Could Save Science

You are welcome to skim additional articles on the topic conveniently catalogued by Nature Magazine:

Challenges in irreproducible research

For the discussion topic this week, I would like you to argue either:

(1) that the reproducibiilty crisis can be effectively ended if science adopts new technologies and better practices, or

(2) that the problems with reproducibility are so engrained in the limits of science and in the DNA of academic institutions that there will always be problems with reproducility, and attempts to address it are insufficient in their ability to get to the root of the problem, or naive about human nature.

Pick a side and make your case!

YELLOWDIG

** Week 6 - Data APIs & Tidy Data

*** { @unit = “”, @title = “Unit Overview”, @foldout }

Nice Overview of APIs

Data journalists describe the value of APIs.

Tutorial on DataUSA APIs

Members of the MIT Media Lab spun out a company called Datawheel. Their goal is to make public data more accessible as well as useful. Their team boasts a number of graphic design experts and data visualization geniuses. They have found ways to take large and confusing government datasets, and make them interesting and accessible.

One cool aspect of their DataUSA project is that to make it work they ended up downloading a bunch of large and clumbsy US government open datasets, cleaning up their formats, and hosting new copies on their own servers. In order for their website to function, each city view must be able to pull data from over a dozen databases quickly, so they have architected the new data structures so that users can query data at different levels like city, county, or state. Data is quickly aggregated accordingly.

The really great part, though, is that they have made their API endpoints freely available to the public. Since they have cleaned up over a dozen government datasets and how host them on their fast servers, they have made it easy to access a lot of statistics quickly.

The lab this week will draw from a Gist that explains the structure of the DataUSA API, and teached the basics of how data APIs function.

DataUSA API Tutorial

We will also look at some different types of APIs, such as the Census Bureau’s automatic geocoding tool that allows you to upload a spreadsheet with 10,000 addresses, and will return a new spreadsheet with all addresses geocoded and matched to Census tracts.

This tool is not technically an API, but you can still automate calls by writing an R script that will upload files to the site for your so that you can batch process more than 10,000 addresses at a time.

*** { @unit = “FRI Feb 28th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }

Casting

If you recall some of the rules with implicit casting, R will try to select the data type that preserves the most information.

> x <- 1:3
> y <- c("a","b","c")
> c( x, y )
[1] "1" "2" "3" "a" "b" "c"

> x.as.string <- as.character( x )
> x.as.string
[1] "1" "2" "3"
> as.numeric( x.as.string )
[1] 1 2 3

> as.numeric( y )
[1] NA NA NA
Warning message:
NAs introduced by coercion 

Note the rules for conversion when you combine numeric and logical vectors:

> x <- c(TRUE,FALSE,TRUE)
> c( x, FALSE )
[1]  TRUE FALSE  TRUE FALSE

> c( x, 1 )
[1] 1 0 1 1

> c( x, "ONE" )
[1] "TRUE"  "FALSE" "TRUE"  "ONE"  

> x2 <- c( x, 1 )
> as.logical( x2 )
[1]  TRUE FALSE  TRUE  TRUE

> x3 <- c( x, 2 )
> as.logical( x3 )
[1]  TRUE FALSE  TRUE  TRUE

> x4 <- c( x, 0 )
> as.logical( x4 )
[1]  TRUE FALSE  TRUE FALSE

Try to guess how it treats the following cases before you run the code:

x <- c(TRUE,FALSE,TRUE)
c( x, 2 )

x2 <- c( x, 1.1 )
as.logical( x2 )

x3 <- c( x, 0.9 )
as.logical( x3 )

x4 <- c( x, log(1) )
as.logical( x4 )

x5 <- c( x, -1 )
as.logical( x5 )

RegEx Practice

# How many of these state names contain a W? 
> states <- c("New Mexico","New York","Washington","West Virginia")
> grep( pattern = "w", x = states, value = TRUE )
[1] "New Mexico" "New York"  
> grep( pattern = "W", x = states, value = TRUE )
[1] "Washington"    "West Virginia"

Which pattern would you use to match all state names with a W, no matter if the W is capital or lowercase? You are not allowed to use the ignore.case argument in grep().

YELLOWDIG

*** { @unit = “OPTIONAL”, @title = “Lab 05”, @assignment, @foldout }

THIS LAB IS OPTIONAL

Scaling Your Analysis w Functions and Loops

If you recall from CPP 526 we discussed the example where Ben Balter, GitHub’s official government evangelist, created a project to make Washington DC open GIS files more accessible and useful by converting them all to a format more amenable to open-source projects (geoJSON files).

Ben wrote a script that downloaded all of Washington DC’s open data files, converted them to better formats, then uploaded them to GitHub so others have access:

https://github.com/benbalter/dc-maps

The geoJSON files can also be read into R directly from GitHub, making it easy to incorporate the spatial maps and data into a wide variety of projects:

library( geojsonio )
library( sp )
github <- "https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/2006-traffic-volume.geojson"
traffic <- geojson_read( x=github, method="local", what="sp" )
plot( traffic, col="steelblue" )

Recall the lab where you created one Dorling cartogram for your neighborhood clustering project:

library( geojsonio )   # read shapefiles
library( sp )          # work with shapefiles
library( sf )          # work with shapefiles - simple features format
library( tmap )        # theme maps
library( dplyr )       # data wrangling
library( pander )      # nice tables 


crosswalk <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv"
crosswalk <- read.csv( crosswalk, stringsAsFactors=F, colClasses="character" )

# search for citie names by strings, use the ^ anchor for "begins with" 
grep( "^MIN", crosswalk$msaname, value=TRUE ) 

# select all FIPS for Minneapolis
these.minneapolis <- crosswalk$msaname == "MINNEAPOLIS-ST. PAUL, MN-WI"
these.fips <- crosswalk$fipscounty[ these.minneapolis ]
these.fips <- na.omit( these.fips )

state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )

dat <- data.frame( name="MINNEAPOLIS-ST. PAUL, MN-WI",
                   state.fips, county.fips, fips=these.fips )               
dat

name	state.fips	county.fips	fips
MINNEAPOLIS-ST. PAUL, MN-WI	27	003	27003
MINNEAPOLIS-ST. PAUL, MN-WI	27	019	27019
MINNEAPOLIS-ST. PAUL, MN-WI	27	025	27025
MINNEAPOLIS-ST. PAUL, MN-WI	27	037	27037
MINNEAPOLIS-ST. PAUL, MN-WI	27	053	27053
MINNEAPOLIS-ST. PAUL, MN-WI	27	059	27059
MINNEAPOLIS-ST. PAUL, MN-WI	27	123	27123
MINNEAPOLIS-ST. PAUL, MN-WI	27	139	27139
MINNEAPOLIS-ST. PAUL, MN-WI	27	141	27141
MINNEAPOLIS-ST. PAUL, MN-WI	27	163	27163
MINNEAPOLIS-ST. PAUL, MN-WI	27	171	27171
MINNEAPOLIS-ST. PAUL, MN-WI	55	093	55093
MINNEAPOLIS-ST. PAUL, MN-WI	55	109	55109

Now download shapefiles with Census data:

library( tidycensus )

# census_api_key("YOUR KEY GOES HERE")
# key <- "abc123"
# census_api_key( key )


# Minneapolis metro area spans two states - 
# Minnesota = 27
# Wisconsin = 55

msp.pop1 <-
get_acs( geography = "tract", variables = "B01003_001",
         state = "27", county = county.fips[state.fips=="27"], geometry = TRUE ) %>% 
         select( GEOID, estimate ) %>%
         rename( POP=estimate )

msp.pop2 <-
get_acs( geography = "tract", variables = "B01003_001",
         state = "55", county = county.fips[state.fips=="55"], geometry = TRUE ) %>% 
         select( GEOID, estimate ) %>%
         rename( POP=estimate )

msp.pop <- rbind( msp.pop1, msp.pop2 )

plot( msp.pop )

Convert to a Dorling cartogram:

# convert sf map object to an sp version
msp.sp <- as_Spatial( msp )
class( msp.sp )

# project map and remove empty tracts
msp.sp <- spTransform( msp.sp, CRS("+init=epsg:3395"))
msp.sp <- msp.sp[ msp.sp$POP != 0 & (! is.na( msp.sp$POP )) , ]

# convert census tract polygons to dorling cartogram
# no idea why k=0.03 works, but it does - default is k=5
msp.sp$pop.w <- msp.sp$POP / 9000 # max(msp.sp$POP)   # standardizes it to max of 1.5
msp_dorling <- cartogram_dorling( x=msp.sp, weight="pop.w", k=0.05 )
plot( msp_dorling )

Instructions:

Create an R script that will convert all US Metro Area shapefiles into Dorling cartograms, one new shapefile for each metro area.
Save each Dorling cartogram as a geoJSON file.
Create a dorling-msa-geojson GitHub repository.
Upload the files and add instructions to the README for people to use them as alternatives to regular Census tract maps to improve the visualization of demographic data in urban environments.

For example, once you have finished it will be possible to do the following:

# dorling cartogram of Phoenix Census Tracts
github.url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/phx_dorling.geojson"
phx <- geojson_read( x=github.url,  what="sp" )
plot( phx )

Start with pseudo-code and write down the steps. I would recommend writing a couple of functions:

Select and parse state and county FIPS codes based upon a city name, return a data frame.
Using the MSA data frame you just created, download the census data and shapefile.
Convert a current MSA object to a Dorling cartogram object.

Test your code with a single city until it is functional:

these.minneapolis <- crosswalk$msaname == "MINNEAPOLIS-ST. PAUL, MN-WI"

At that point you can scale your steps by generalizing the city name.

city.names <- unique( crosswalk$cbsaname )

for( i in city.names )
{
  # your code here 
}

Submit Solutions to Canvas:

SUBMIT LAB

** Week 7 - Customized Reporting

*** { @unit = “MON Mar 2nd”, @title = “Code-Through Assignment”, @assignment, @foldout }

Code-Through

Since you are sharing your code-through with your classmates on Yellowdig, it will serve as your discussion topic this week.

Add your codethrough files (the HTML specifically) to your new website on GitHub repository and generate an active URL for your tutorial so that you can share with classmates. Note that you cannot host Shiny apps or other dynamic apps on GitHub - they must be static HTML pages.

Code-Through Instructions

Submit to Canvas:

SUBMIT CODE-THROUGH

Post on Yellowdig

YELLOWDIG

*** { @unit = “MON Mar 2nd”, @title = “Build an R Package”, @assignment, @foldout }

Build an R Package

This tutorial will teach you how to build and share a package in R. At some point you might develop a tool that you want to upload to the CRAN so it is widely available. More likely, if you are working with a team of analysts within an organization you will begin building a library of functions that are specific to the project. At some point it will be more efficient for the team to create a package to maintain the project code so that team members can update or enhance the functions, and others can easily update the functions by re-installing the package.

Complete the tuturial on “packaging” your R code from Labs 01 and 02 into a new montyhall package to make it easier to run simulations to evaluate game strategies.

Final Project Instructions

Submit to Canvas:

To receive credit for the assignment, submit the URL to your package on GitHub.

SUBMIT PACKAGE

*** { @unit = “MON Mar 2nd”, @title = “Report Template Assignment”, @assignment, @foldout }

Automating Report Generation

This assignment teaches you to use RMD templates to simplify and automate the process of generating reports.

We will explore the process by reverse-engineering a simple example that was created to build resumes:

Begin by reading about the process:

Automated Reporting

Instructions

For this assignment you will need to clone Nick Strayer’s CV project:

CV Project on GitHub

You can do this in the GitHub desktop application under File » Clone » URL then type in the project URL:

https://github.com/DS4PS/cv

Note, since the project is actively being developed this version on DS4PS is frozen in time for pedagogical purposes. You can follow the link to his repo to see what he has added.

A quick note on the difference between “cloning” a project and “forking” a GitHub project:

A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project, althgouh a connection exists between your fork and the original repository itself. In this way, your fork acts as a bridge between the original repository and your personal copy where you can contribute back to the original project using Pull Requests.

Unlike forking, when cloning you won’t be able to pull down changes from the original repository you cloned from, and if the project is owned by someone else you won’t be able to contribute back to it unless you are specifically invited as a collaborator. Cloning is ideal for instances when you need a way to quickly get your own copy of a repository where you may not be contributing to the original project.

In this instance we are not contributing back to the project to improve it. We just want our own local copy to work with, so cloning is the best option.

Build Your Resume

After cloning the files, you should have local copies on your desktop. You will need to edit at least two files:

select either the index.Rmd file (CV format) or resume.Rmd (short resume format)
positions.csv

The “index.Rmd” and “resume.Rmd” files contains the pagedown code to generate the resume. You will need to adapt the code as appropriate for your purposes Be sure to retain the helper functions, as you are required to pull position data from the CSV file instead of hard-coding it in the file. You can create your own section titles and content. List as many positions, projects, or internships as you can to reach at least 2 pages.

Second, delete Nick’s content form the “positions.csv” file and replace it with your own content for your positions.

When you are done, knit your file to generate your HTML resume.

Create a new repository on your GitHub account called “CV”. Initiate with a README file. Clone the repository to your computer, and copy all of the updated files from your project. Commit these files to GitHub so they are in the new CV repo.

Go into settings and activate your GitHub page for this repository. You do not have to select a template.

You should now be able to view your HTML resume online.

For the assignment submit the following:

The URL of your GitHub CV repository
The URL of your resume or CV
A zipped folder with all of the files from the repor

Consider creating a GitHub site to host a portfolio of projects you are working on. You can add the CV and your code-through assignments to the site.

Submit to Canvas:

SUBMIT PROJECT

Course Schedule

Welcome Back!

Getting Help

Description

Learning Objectives

Lab

Review

Assigned Reading

Recommended Reading

Planning Your Code with Pseudo-Code

Lab-01 - Control Structures

Submit Solutions to Canvas:

Description

Learning Objectives

Assigned Reading

Questions

Submit Solutions to Canvas:

Customized Reporting

GitHub Pages Set-Up

Cascading Style Sheets

Required Reading

Lab-03 - Creating a GitHub Page

Instructions

Change the website name and description.

Change the color scheme for the website:

Add Page and Update Navigation

Change the Text Style on the Getting Started Page

Create a Liquid Table

Submit Solutions to Canvas:

Text as Data

String Processing & Regular Expressions

Helpful Reference Material:

Practice

When is 5 larger than 10?

Invisible Dogs

Average Years of Edu

RegEx Example

Lab-04 - Text Analysis

Submit Solutions to Canvas:

The Crisis of Scientific Reproducibility

Nice Overview of APIs

Tutorial on DataUSA APIs

Casting

RegEx Practice

Scaling Your Analysis w Functions and Loops

Instructions:

Submit Solutions to Canvas:

Code-Through

Submit to Canvas:

Post on Yellowdig

Build an R Package

Submit to Canvas:

Automating Report Generation

Instructions

Build Your Resume

Submit to Canvas: