** Welcome
*** { @unit = “”, @title = “Course Overview”, @foldout }
CPP 527 is the second course in the Foundations of Data Science sequence. This semester extends work done in CPP 526 by introducing programming topics like control structures, loops, and regular expressions that are necessary for building simulations and specialized applications in R. We will also cover the foundations of document design using both GitHub pages (free websites like this one) and customized RMD templates so that you can begin developing custom reporting formats so that enable you to better structure results from analytical projects or automate tasks.
*** { @unit = “”, @title = “Getting Help”, @assignment, @foldout }
Note that the discussion board is hosted by the GitHub issues feature. It is a great forum because:
Please preview your responses before posting to ensure proper formatting. Note that you format code by placing fences around the code:
```
# your code here
lm( y ~ x1 + x2 )
```
The fences are three back-ticks. These look like quotation marks, but are actually the character at the top left of your keyboard (if you have a US or European keyboard).
GitHub does not have a native math rendering language (RMD documents, on the other hand, support formulas). So you have two options, type formulas as regular text and use code formatting to make them clear (this option is usually sufficient). Or you can type your formula in a formula editor and copy and paste an image of the nicely-formatted example.
```
y = b0 + b1•X1 + b2•X2 + e
b1 = cov(x,y) / var(x)
```
** Week 1 - Control Structures
*** { @unit = “”, @title = “Unit Overview”, @foldout }
This section introduces control structures that serves to incorporate decision-making into computer code. It enables things like if-then logic to determine what code should be used based upon specified conditions.
Once you have completed this section you will be able to
Your assignment this week will be to design computer code to simulate the steps in the game show Let’s Make a Deal.
*** { @unit = “”, @title = “Readings”, @reading, @foldout }
Please revisit the following chapter from last semester:
Required:
Quick Reference on Control Structures
This topic builds off of the use of loops and thus is a little more advanced - we will cover it in CPP 528. It would not hurt to preview the topic now, though.
*** { @unit = “”, @title = “Psuedo-Code”, @lecture, @foldout }
Typically as you start a specific task in programming there are two things to consider.
(1) What are the steps needed to complete the task? (2) How do I implement each step? How do I translate them into the appropriate functions and syntax?
It will save you a huge amount of time if you separate these tasks. First, take a step back from the problem that think about the steps. Write down each step in order. Make some notes about what data is needed from the previous step, and what the return result will be from the current step.
Think back to the cooking example. If we are going to bake cookies our pseudo-code would look something like this:
Note that it lacks many necessary details. How much of each ingredient? What temperature does the oven need to be? How long do we bake for?
Once we have the big picture down and are comfortable with the process then we can start to fill in these details:
Note that in computer programming terms butter, sugar, and brown sugar are the inputs or arguments needed for a function. The wet sand mixture is the return value of the process.
In the final step, we will begin to implement code.
# 1. Preheat the oven.
# - preheat to 375 degrees
preheat_oven <- function( temp=375 )
{
start_oven( temp )
return( NULL )
}
# 2. In a large bowl, mix butter with the sugars until well-combined.
# - 1/3 cup butter
# - 1/2 cup sugar
# - 1/4 cup brown sugar
# - mix until the consistency of wet sand
mix_sugar <- function( butter=0.33, sugar=0.5, brown.sugar=0.25 )
{
sugar.mixture <- mix( butter, sugar, brown.sugar )
return( sugar.mixture )
}
# 3. Stir in vanilla and egg until incorporated.
# - add to sugar mixture
# - mix until consistency of jelly
add_wet_ingredients <- function( sugar=sugar.mixture, eggs=2, vanilla=2 )
{
# note that the results from the previous step are the inputs into this step
}
We are describing here the process of writing pseudo-code. It it the practice of:
Pseudo-code helps you start the process and work incrementally. It is important because the part of your brain that does general problem-solving (creating the basic recipe) is different than the part that drafts specific syntax in a langauge and de-bugs the code. If you jump right into the code it is easy to get lost or derailed.
More importantly, pseudo-code captures the problem logic, and thus it is independet of any specific computer language. When collaborating on projects one person might generate the system logic, and another might implement. So it is important to practice developing a general overview of your task at hand.
Here are some helpful examples:
*** { @unit = “Due Jan 21th”, @title = “Lab 01”, @assignment, @foldout }
This lab is based upon the famous Monty Hall Problem.
Although there was much debate about the correct solution when it was initially introduced there are many concise explanations of the proper solution:
The Monty Hall Problem is a great example of a mathematical problem that might be hard to solve using a mathematical proof, but it is fairly easy to solve using simulation. Since it is just a game with simple and explicit rules we can build our own virtual version. Then we can compare how outcomes differ when we deploy the two different strategies for selecting doors.
In Lab 01 we will use control structures to build a virtual version of the game. In Lab 02 we will use simulation to play the game thousands of times so that we can get stable estimates of the payoff of each strategy.
** Week 2 - Simulations
*** { @unit = “”, @title = “Unit Overview”, @lecture, @foldout }
This section introduces loops. We will use them to create simulations.
Once you have completed this section you will be able to
*** { @unit = “”, @title = “Readings”, @reading, @foldout }
Required:
Building Simulations in R: Mastering Loops
Creating Animations with Loops
Background reading:
Why Americans Are So Damn Unhealthy, In 4 Shocking Charts
*** { @unit = “FRI Jan 24th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }
How can you make interesting animations in R?
We covered a very basic animation - a random walk - in the lecture notes.
Start game with $10 in cash and see how long you last. At each step you flip a coin and win a dollar, lose a dollar, or stay the same. How long does the average player survive before going bankrupt?
cash <- 10
results <- NULL
count <- 1
while( cash > 0 )
{
cash <- cash +
sample( c(-1,0,1), size=1 )
results[count] <- cash
count <- count + 1
}
This is a one-dimensional outcome tracked over time. Physicists have used a similar model to examine particle motion. It is called a Brownian Motion model. It is similar to the betting model above except for each time period the particle moves in two dimensions.
x <- 0
y <- 0
for( i in 1:1000 )
{
x[i+1] <- x[i] + rnorm(1)
y[i+1] <- y[i] + rnorm(1)
}
Consider the two following problems.
(1) How long does the typical person take to go bankrupt? If you don’t want to do a complicated mathematical proof, you can create a simulation, play the game 10,000 times, then report the average period each game lasted.
What is the code to make this work?
(2) Note the trailing tail in the Brownian Motion animation. How would you create that as part of an animation?
Post your ideas or solutions on YellowDig:
Share your ideas about these problems with your classmates. Or share another animation that you found that uses loops.
*** { @unit = “FRI Jan 31st”, @title = “Lab 02”, @assignment, @foldout }
Please review the instructions at the end of the lecture notes:
Building Simulations in R: Mastering Loops
** Week 3 - GitHub Pages
*** { @unit = “”, @title = “Unit Overview”, @foldout }
A big part of every analysts job is trying to find ways to distill large volumes of data and information down to meaningful bites of knowledge, often for diverse stakeholder audiences that have varying degrees of technical expertise. For this reason, communication skills are extremely valuable for data scientists. You will constantly be challenged to find the interesting story that emerges from an overwhelming amount of data, and find creative ways to tell the story so that information becomes actionable.
Although it might not sound as edgy as building a machine learning classifier, the ability to create customized reporting formats and automate various steps of analysis will both make you more efficient and also more effective at communication.
This lab introduces you to one powerful tool for your toolkit - using GitHub pages to build a website quickly and inexpensively (for free, actually). Then use it to host various components of projects including public-facing reports and RMD documents after rendering.
For the project component of the course we will use a CV template to learn how the pagedown package can be used to create highly-customized report templates:
We will also practice automation by the separation of the design elements of reporting from the data contained in the reports. In this example for a CV, Nick Strayer’s positions are stored in a CSV file on GitHub:
And they are added to the document templates using some custom functions which filter positions and loop through the list to iteratively build the document.
This week’s lab will ask you to configure a GitHub page within a repository on your account. GitHub pages are an amazing resource because (1) they allow you to create an unlimited number of websites related to your projects FOR FREE, and (2) they can be created and maintained using Markdown, which simplifies a lot of the complexity of websites. You will learn to link HTML files generated from R Studio so that you can start sharing analytical projects with external audiences.
The set-up of a simple GitHub page is fairly straight-forward and can be completed in a few basic steps:
This will give you a barebones website with a landing page you can write using Markdown, and a few templates that you can select from.
You have access to myriad advanced features on the platform. GitHub pages leverage several powerful web frameworks like Jekyll, Bootstrap, Liquid, and Javascript to make customization of static pages both easy and powerful. We will spend some time talking about how the pieces of a website fit together so that you have a rudimentary knowledge of the platform:
More importantly, GitHub pages can help demonstrate the concept of page templates. We can design the layout of a graphic, table, or section of text on a page then dynamically populate it with data. GitHub pages allow you to do with with basic HTML and Liquid tags:
And the pagedown package in R allows you to develop a variety of templates using similar principles:
Similar to other work we have done in R, we will start by using some working examples then reverse engineer them so you can see how the pieces fit together. You are not expected to master any of these topics in the short time-frame of the semester. The proper benchmark of knowledge is can you take an existing open source project and adapt it as necessary.
You will not be required to learn web programming languages like HTML and Javascript (though they are super useful if you invest the time). You do, however, need to become familiar with very basic CSS as it is impossible to do customization without it. CSS started as a somewhat modest project but has evolved into a powerful language. R Markdown documents support CSS, which makes them fully customizable. It will also become more important so you begin to develop dashboards or custom interactive Shiny apps, since CSS is the primary means of controlling layouts and other style elements.
These two pages on the example GitHub site have the same content, but CSS elements are used to change the page layout and style on the second. Click on the “see page layout” button to see the CSS elements.
Skim the following chapters, reading to get a general sense of concepts and the basics of how each might function. You can skip sections that explain the code in detail. I am more concerned that you understand how these basic pieces fit together, and when you hear terms like “responsive” you conceptually know what people are talking about.
Introduction to Web Programming
*** { @unit = “FRI Jan 31st”, @title = “YellowDig Practice Problems”, @assignment, @foldout }
*** { @unit = “TUE Feb 11th”, @title = “Lab 03”, @assignment, @foldout }
The animation in the Unit Overview above shows how simple it is to activate GitHub pages for any project repository so that you can turn markdown files into web-hosted HTML files and share tutorials or reports created from RMD files.
If we want a website with a bit more functionality, however, we will need to start from an existing template and adapt it.
For this lab you will be asked to fork the beautiful-jekyll website template:
Beautiful Jekyll Website Template on GitHub
Follow the instructions in the README file to begin customizing your page.
In the _config.yml
file in the default directory do the following:
# Name of website
title: My website
# Short description of your site
description: A virtual proof that I'm awesome
You can update social network IDs if you like, or replace Dean’s info with empty quotes ""
so the social media icons are present but not active.
# Personalize the colors in your website. Colour values can be any valid CSS colour
navbar-col: "#F5F5F5"
navbar-text-col: "#404040"
navbar-children-col: "#F5F5F5"
page-col: "#FFFFFF"
link-col: "#008AFF"
hover-col: "#0085A1"
footer-col: "#F5F5F5"
footer-text-col: "#777777"
footer-link-col: "#404040"
You have forked the master branch of the website, which does not include the “getting started” page on the live site menu:
Navigate to the getting started page located on another branch, and copy this file to the main folder on your site. I would copy the text from the raw view of the page and just create a new file called getstarted.md on your site.
Now update the navigation bar and add another option called “Getting Started” under “Resources”. You will use the text “getstarted” for the URL, excluding the .md markdown extension. GitHub pages converts all markdown files to HTML files in the background, so you want to direct the user to the HTML version, which does not require an explicit extension to work in browsers:
# List of links in the navigation bar
navbar-links:
About Me: "aboutme"
Resources:
- Beautiful Jekyll: "http://deanattali.com/beautiful-jekyll/"
- Learn markdown: "http://www.markdowntutorial.com/"
- Getting Started: "getstarted" # ADD THIS LINK
Author's home: "http://deanattali.com"
The index.html file contains text from the landing page of the website. You will find some page title and descriptions in the YAML header of this file:
---
layout: page
title: My website
subtitle: This is where I will tell my friends way too much about me
use-site-title: true
---
Demonstrate that you are able to apply CSS styles to specific elements of a page.
Create a new div section around Step 1 on the Getting Started page.
## Overview of steps required
There are only three simple steps, ....
Here is a 40-second video ....
<img src="../img/install-steps.gif" style="width:100%;" alt="Installation steps" />
<div class="gs-section-01">
### 1. Fork the Beautiful Jekyll repository
Fork the [repository](https://github.com/daattali/beautiful-jekyll)
by clicking the Fork button on the top right corner in GitHub.
</div>
Follow the Barebones Jekyll example for customizing a page style by adding a CSS style sheet the bottom of the Getting Started page:
<style>
.gs-section-01 h3 {
color: red }
.gs-section-01 p {
font-size: 30px;
}
</style>
Similarly, add new div sections around Step 02 and Step 03 on the page so that each step has different header styles and text. It doesn’t have to look nice - just show you are able to selectively change the style on a page.
Using the Barebones Jekyll Custom Table example add a page with a custom table.
Copy the liquid-table.html template and add it as a new layout in your site’s layout folder. You will need to change the parent page template on the liquid-table.html page to “default” or “page” in your new site (you don’t have a nice-text layout that you can use as the parent page layout).
---
layout: default
---
Create a new page in your main website folder called table-demo.md and copy the page content from the Barebones Jekyll example.
You will need to add the ryan-v-ryan.jpg image to your img folder for it to be accessible on your new site (you can right-click and save it, then drag it into the image folder on your GitHub site).
You do not need to include the “See Page Layout” button.
---
layout: liquid-table
title: 'amiright?'
reynolds:
strengths:
- good father
- funny
- dated alanis morissette
weaknesses:
- singing
- green lantern movie
- tennis backhand
gosling:
strengths:
- builds houses
- is a real boy
- never dated alanis morissette
weaknesses:
- micky mouse club
- cries a lot
- not ryan reynolds
---
![](img/ryan-v-ryan.jpg)
## Lorem Ipsum
Lorem ipsum dolor sit amet....
Add the page to your navigation bar:
# List of links in the navigation bar
navbar-links:
About Me: "aboutme"
Resources:
- Beautiful Jekyll: "http://deanattali.com/beautiful-jekyll/"
- Learn markdown: "http://www.markdowntutorial.com/"
- Getting Started: "getstarted" # ADD THIS LINK
Table Demo: "table-demo"
When these steps are done, submit a link to (1) your live site and (2) your GitHub repo where the website lives.
And share your page link on YellowDig:
** Week 4 - Regular Expressions
*** { @unit = “”, @title = “Unit Overview”, @reading, @foldout }
So this week comes with an up-front warning. You can get a PhD in Natural Language Processing, which is an entire field devote to computation tools used to process and analyze text as data. We have one week to cover the topic!
We obviously cannot go too deep into this interesting field, but let’s at least motivate some of the R functionality with a couple of cool examples.
Which Hip-Hop Artist Has the Largest Vocabulary?
Who is the Anonymous Op-Ed Writer inside the Trump Administration?
These examples all demonstrate interesting uses of text as data. They are also examples of the types of insight that can come from analysis with big data - the patterns are hiding in plain sight but our brains cannot hold enough information at one time to see it. Once we can find a system to extract hidden patterns from language we can go beyond seeking large public databases to generate insights, and we can start using all of Twitter, all published news stories, or all of the internet to identify trends and detect outliers.
The core of all text analysis requires two sets of skills. Text is computer science is referred to as “strings”, a reference to the fact that spoken languages mean nothing to computers so they just treat them as strings of letters (words) or strings of words (sentences). String processing refers to a set of functions and conventions that are used to manipulate text as data. If you think about the data steps for regular data, we clean combine, transform, and merge data inside of data frames. Similarly there are operations for important text datasets (often lots of documents full of words), cleaning them (removing words, fixing spelling errors), merging documents, etc. Core R contains many string processing functions, and there are lots of great packages.
“Regular expression” are a set of functions used to aid in processing text by defining very precise ways to query a text database by looking for specific strings, or more often strings that match some specific pattern that has meaning. For example, if I gave you the following text with everything but punctuation replaced by X, you could still tell me what the word are for:
So regular expressions can be very useful for searching large databases for general classes of text, or alternatively for searching for generic text that occurs only in a very specific context (at the beginning or end of a word, in the middle of a phrase, etc.).
The function grep( pattern, string ) works as follows:
Search for the pattern in each of the strings in the character vector at the top, strings.
The search pattern in each case below represents a regular expression.
What will the following cases return?
strings <- c("through","rough","thorough","throw","though","true","threw","thought","thru","trough")
# what will the following return?
grep( pattern="th?rough", strings, value = TRUE)
grep( pattern=".ough", strings, value = TRUE)
grep( pattern="^.ough", strings, value = TRUE)
grep( pattern="ough.", strings, value = TRUE)
grep( pattern="[^r]ough", strings, value = TRUE)
# these are not as useful
grep( pattern="tr*", strings, value = TRUE)
grep( pattern="t*o", strings, value = TRUE)
*** { @unit = “FRI Feb 7th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }
Explain the following unexpected behaviors:
> (1:10) > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> (1:10) > "5"
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
x is a factor cataloging animals in a shelter, recording the type of animal.
Why can’t I count dogs?
> x # TYPE OF ANIMAL (FACTOR)
[1] cat dog mouse
Levels: cat dog mouse
> x == "cat"
[1] TRUE FALSE FALSE
> x == "dog"
[1] FALSE FALSE FALSE
> x == "mouse"
[1] FALSE FALSE TRUE
I have a sample of 10 people and am trying to determine their average level of education. 12= high school degree, 16 = four-year college degree, etc.
My data is stored as a factor (which it should be since it is a categorical variable. But that makes it hard to calculate averages.
What is going wrong here?
grade.levels <- factor( c(12, 16, 12, 7, 7, 5, 6, 5, 9, 10) )
> # want to know average level of
> # schooling for sample:
> mean( grade.levels )
[1] NA
Warning message:
In mean.default(grade.levels) :
argument is not numeric or logical: returning NA
>
> # mean requires a numeric variable
> mean( as.numeric( grade.levels ) )
[1] 3.8
We have an large database where all of the addresses and geographic coordinates are stored as follows:
x <- c("100 LANGDON ST
MADISON, WI
(43.07794869500003, -89.39083342499998)", "00 N STOUGHTON RD E WASHINGTON AVE
MADISON, WI
(43.072951239000076, -89.38668964199996)")
Write a function that accepts the address vector x as the input, and returns a vector of numeric coordinates.
Note that the length of addresses can change, so you will need to use regular expressions (instead of just a substr() function) to generate proper results.
** Week 5 - Text Analysis
*** { @unit = “MON Feb 24th”, @title = “Lab 04”, @assignment, @foldout }
*** { @unit = “FRI Feb 21st”, @title = “YellowDig Discussion”, @assignment, @foldout }
The topic this week is an introduction to the hugely important topic of reproducibility in science, the ability to reproduce results from ground-breaking scientific work that was published in top journals. For a long-time there was an assumption that peer-review meant that each scientist subjected their work to fellow scientists that were experts on their topic, and thus it provides a solid barrier to error-prone and work from being published.
The notion of reproducibility, however, grew from fields like physics and chemistry where early lab experiments could be described with enough precision for another scientist to mix the same chemicals, or recreate the same conditions for a gravity experiment, and easily verify whether the claims in the paper were defensible.
Things get a lot more complicated now that (1) the data requirements necessary to publish in top journals have expanded, (2) methods have become much more complicated, and (3) science is very competitive with high-stakes rewards for winning grants or covetted endowed professor positions, resulting in proprietary data, data collection techniques, or lab conditions like stem cell lines or bacteria strains that cannot be easily replicated. As a result, peer reviewers have to take a lot of what authors say on face value without having enough information to challenge certain assertions, or without having access to the raw data and thousands of lines of code that produces the results that are being defended in the paper. Furthermore, weaknesses in how statistical methods are reported have introduced systematic bias into the types of research that gets published in top journals (it has to be splashy, and thus more likely to be anomoly studies than reproducible work).
If you are not going into academics, should you care? Yes, because the problems with reproducibility in science are just a proxy for problems with data analysis that will arise in any organization outside of academics as well. Scientists experience pressure to publish. Consultants also experience pressure to do work fast, and to identify patterns that will impress clients. These sorts of issues will arise in any environment where data brings value. In science we care about making the process transparent so that others can replicate work. If you are a manager or project lead for a team of analysts, you should care about transparency because it allows you to ensure your team is doing the job correctly, especially if your name is going on the report!
This week’s topic introduces you to the fascinating topic of the replication crisis in science. Your task will be to read two articles on reproduciblity in science:
When the Revolution Came for Amy Cuddy, New York Times Magazine, 2017
How Quality Control Could Save Science
You are welcome to skim additional articles on the topic conveniently catalogued by Nature Magazine:
Challenges in irreproducible research
For the discussion topic this week, I would like you to argue either:
(1) that the reproducibiilty crisis can be effectively ended if science adopts new technologies and better practices, or
(2) that the problems with reproducibility are so engrained in the limits of science and in the DNA of academic institutions that there will always be problems with reproducility, and attempts to address it are insufficient in their ability to get to the root of the problem, or naive about human nature.
Pick a side and make your case!
** Week 6 - Data APIs & Tidy Data
*** { @unit = “”, @title = “Unit Overview”, @foldout }
Data journalists describe the value of APIs.
Members of the MIT Media Lab spun out a company called Datawheel. Their goal is to make public data more accessible as well as useful. Their team boasts a number of graphic design experts and data visualization geniuses. They have found ways to take large and confusing government datasets, and make them interesting and accessible.
One cool aspect of their DataUSA project is that to make it work they ended up downloading a bunch of large and clumbsy US government open datasets, cleaning up their formats, and hosting new copies on their own servers. In order for their website to function, each city view must be able to pull data from over a dozen databases quickly, so they have architected the new data structures so that users can query data at different levels like city, county, or state. Data is quickly aggregated accordingly.
The really great part, though, is that they have made their API endpoints freely available to the public. Since they have cleaned up over a dozen government datasets and how host them on their fast servers, they have made it easy to access a lot of statistics quickly.
The lab this week will draw from a Gist that explains the structure of the DataUSA API, and teached the basics of how data APIs function.
We will also look at some different types of APIs, such as the Census Bureau’s automatic geocoding tool that allows you to upload a spreadsheet with 10,000 addresses, and will return a new spreadsheet with all addresses geocoded and matched to Census tracts.
This tool is not technically an API, but you can still automate calls by writing an R script that will upload files to the site for your so that you can batch process more than 10,000 addresses at a time.
*** { @unit = “FRI Feb 28th”, @title = “YellowDig Practice Problems”, @assignment, @foldout }
If you recall some of the rules with implicit casting, R will try to select the data type that preserves the most information.
> x <- 1:3
> y <- c("a","b","c")
> c( x, y )
[1] "1" "2" "3" "a" "b" "c"
> x.as.string <- as.character( x )
> x.as.string
[1] "1" "2" "3"
> as.numeric( x.as.string )
[1] 1 2 3
> as.numeric( y )
[1] NA NA NA
Warning message:
NAs introduced by coercion
Note the rules for conversion when you combine numeric and logical vectors:
> x <- c(TRUE,FALSE,TRUE)
> c( x, FALSE )
[1] TRUE FALSE TRUE FALSE
> c( x, 1 )
[1] 1 0 1 1
> c( x, "ONE" )
[1] "TRUE" "FALSE" "TRUE" "ONE"
> x2 <- c( x, 1 )
> as.logical( x2 )
[1] TRUE FALSE TRUE TRUE
> x3 <- c( x, 2 )
> as.logical( x3 )
[1] TRUE FALSE TRUE TRUE
> x4 <- c( x, 0 )
> as.logical( x4 )
[1] TRUE FALSE TRUE FALSE
Try to guess how it treats the following cases before you run the code:
x <- c(TRUE,FALSE,TRUE)
c( x, 2 )
x2 <- c( x, 1.1 )
as.logical( x2 )
x3 <- c( x, 0.9 )
as.logical( x3 )
x4 <- c( x, log(1) )
as.logical( x4 )
x5 <- c( x, -1 )
as.logical( x5 )
# How many of these state names contain a W?
> states <- c("New Mexico","New York","Washington","West Virginia")
> grep( pattern = "w", x = states, value = TRUE )
[1] "New Mexico" "New York"
> grep( pattern = "W", x = states, value = TRUE )
[1] "Washington" "West Virginia"
Which pattern would you use to match all state names with a W, no matter if the W is capital or lowercase? You are not allowed to use the ignore.case argument in grep().
*** { @unit = “OPTIONAL”, @title = “Lab 05”, @assignment, @foldout }
THIS LAB IS OPTIONAL
If you recall from CPP 526 we discussed the example where Ben Balter, GitHub’s official government evangelist, created a project to make Washington DC open GIS files more accessible and useful by converting them all to a format more amenable to open-source projects (geoJSON files).
Ben wrote a script that downloaded all of Washington DC’s open data files, converted them to better formats, then uploaded them to GitHub so others have access:
https://github.com/benbalter/dc-maps
The geoJSON files can also be read into R directly from GitHub, making it easy to incorporate the spatial maps and data into a wide variety of projects:
library( geojsonio )
library( sp )
github <- "https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/2006-traffic-volume.geojson"
traffic <- geojson_read( x=github, method="local", what="sp" )
plot( traffic, col="steelblue" )
Recall the lab where you created one Dorling cartogram for your neighborhood clustering project:
library( geojsonio ) # read shapefiles
library( sp ) # work with shapefiles
library( sf ) # work with shapefiles - simple features format
library( tmap ) # theme maps
library( dplyr ) # data wrangling
library( pander ) # nice tables
crosswalk <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv"
crosswalk <- read.csv( crosswalk, stringsAsFactors=F, colClasses="character" )
# search for citie names by strings, use the ^ anchor for "begins with"
grep( "^MIN", crosswalk$msaname, value=TRUE )
# select all FIPS for Minneapolis
these.minneapolis <- crosswalk$msaname == "MINNEAPOLIS-ST. PAUL, MN-WI"
these.fips <- crosswalk$fipscounty[ these.minneapolis ]
these.fips <- na.omit( these.fips )
state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )
dat <- data.frame( name="MINNEAPOLIS-ST. PAUL, MN-WI",
state.fips, county.fips, fips=these.fips )
dat
name | state.fips | county.fips | fips |
---|---|---|---|
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 003 | 27003 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 019 | 27019 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 025 | 27025 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 037 | 27037 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 053 | 27053 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 059 | 27059 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 123 | 27123 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 139 | 27139 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 141 | 27141 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 163 | 27163 |
MINNEAPOLIS-ST. PAUL, MN-WI | 27 | 171 | 27171 |
MINNEAPOLIS-ST. PAUL, MN-WI | 55 | 093 | 55093 |
MINNEAPOLIS-ST. PAUL, MN-WI | 55 | 109 | 55109 |
Now download shapefiles with Census data:
library( tidycensus )
# census_api_key("YOUR KEY GOES HERE")
# key <- "abc123"
# census_api_key( key )
# Minneapolis metro area spans two states -
# Minnesota = 27
# Wisconsin = 55
msp.pop1 <-
get_acs( geography = "tract", variables = "B01003_001",
state = "27", county = county.fips[state.fips=="27"], geometry = TRUE ) %>%
select( GEOID, estimate ) %>%
rename( POP=estimate )
msp.pop2 <-
get_acs( geography = "tract", variables = "B01003_001",
state = "55", county = county.fips[state.fips=="55"], geometry = TRUE ) %>%
select( GEOID, estimate ) %>%
rename( POP=estimate )
msp.pop <- rbind( msp.pop1, msp.pop2 )
plot( msp.pop )
Convert to a Dorling cartogram:
# convert sf map object to an sp version
msp.sp <- as_Spatial( msp )
class( msp.sp )
# project map and remove empty tracts
msp.sp <- spTransform( msp.sp, CRS("+init=epsg:3395"))
msp.sp <- msp.sp[ msp.sp$POP != 0 & (! is.na( msp.sp$POP )) , ]
# convert census tract polygons to dorling cartogram
# no idea why k=0.03 works, but it does - default is k=5
msp.sp$pop.w <- msp.sp$POP / 9000 # max(msp.sp$POP) # standardizes it to max of 1.5
msp_dorling <- cartogram_dorling( x=msp.sp, weight="pop.w", k=0.05 )
plot( msp_dorling )
For example, once you have finished it will be possible to do the following:
# dorling cartogram of Phoenix Census Tracts
github.url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/phx_dorling.geojson"
phx <- geojson_read( x=github.url, what="sp" )
plot( phx )
Start with pseudo-code and write down the steps. I would recommend writing a couple of functions:
Test your code with a single city until it is functional:
these.minneapolis <- crosswalk$msaname == "MINNEAPOLIS-ST. PAUL, MN-WI"
At that point you can scale your steps by generalizing the city name.
city.names <- unique( crosswalk$cbsaname )
for( i in city.names )
{
# your code here
}
** Week 7 - Customized Reporting
*** { @unit = “MON Mar 2nd”, @title = “Code-Through Assignment”, @assignment, @foldout }
Since you are sharing your code-through with your classmates on Yellowdig, it will serve as your discussion topic this week.
Add your codethrough files (the HTML specifically) to your new website on GitHub repository and generate an active URL for your tutorial so that you can share with classmates. Note that you cannot host Shiny apps or other dynamic apps on GitHub - they must be static HTML pages.
*** { @unit = “MON Mar 2nd”, @title = “Build an R Package”, @assignment, @foldout }
This tutorial will teach you how to build and share a package in R. At some point you might develop a tool that you want to upload to the CRAN so it is widely available. More likely, if you are working with a team of analysts within an organization you will begin building a library of functions that are specific to the project. At some point it will be more efficient for the team to create a package to maintain the project code so that team members can update or enhance the functions, and others can easily update the functions by re-installing the package.
Complete the tuturial on “packaging” your R code from Labs 01 and 02 into a new montyhall package to make it easier to run simulations to evaluate game strategies.
To receive credit for the assignment, submit the URL to your package on GitHub.
*** { @unit = “MON Mar 2nd”, @title = “Report Template Assignment”, @assignment, @foldout }
This assignment teaches you to use RMD templates to simplify and automate the process of generating reports.
We will explore the process by reverse-engineering a simple example that was created to build resumes:
Begin by reading about the process:
For this assignment you will need to clone Nick Strayer’s CV project:
You can do this in the GitHub desktop application under File » Clone » URL then type in the project URL:
https://github.com/DS4PS/cv
Note, since the project is actively being developed this version on DS4PS is frozen in time for pedagogical purposes. You can follow the link to his repo to see what he has added.
A quick note on the difference between “cloning” a project and “forking” a GitHub project:
A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project, althgouh a connection exists between your fork and the original repository itself. In this way, your fork acts as a bridge between the original repository and your personal copy where you can contribute back to the original project using Pull Requests.
Unlike forking, when cloning you won’t be able to pull down changes from the original repository you cloned from, and if the project is owned by someone else you won’t be able to contribute back to it unless you are specifically invited as a collaborator. Cloning is ideal for instances when you need a way to quickly get your own copy of a repository where you may not be contributing to the original project.
In this instance we are not contributing back to the project to improve it. We just want our own local copy to work with, so cloning is the best option.
After cloning the files, you should have local copies on your desktop. You will need to edit at least two files:
The “index.Rmd” and “resume.Rmd” files contains the pagedown code to generate the resume. You will need to adapt the code as appropriate for your purposes Be sure to retain the helper functions, as you are required to pull position data from the CSV file instead of hard-coding it in the file. You can create your own section titles and content. List as many positions, projects, or internships as you can to reach at least 2 pages.
Second, delete Nick’s content form the “positions.csv” file and replace it with your own content for your positions.
When you are done, knit your file to generate your HTML resume.
Create a new repository on your GitHub account called “CV”. Initiate with a README file. Clone the repository to your computer, and copy all of the updated files from your project. Commit these files to GitHub so they are in the new CV repo.
Go into settings and activate your GitHub page for this repository. You do not have to select a template.
You should now be able to view your HTML resume online.
For the assignment submit the following:
Consider creating a GitHub site to host a portfolio of projects you are working on. You can add the CV and your code-through assignments to the site.