Rex W. Douglass
Director Machine Learning for Social Science Lab (MSSL)
Center for Peace and Security Studies (cPASS)
Department of Political Science
University of California San Diego
rexdouglass@gmail.com
www.rexdouglass.com
@rexdouglass
Please bring a two sided coin(s) and scratch paper to class for passing notes for the demonstrations.
This is a 6 hour introduction to machine learning spread across two three-hour lectures. The goal of this very short course is narrow: to give you enough of an overview, vocabulary, and intuition, so that you can identify machine learning problems in the wild and begin your own research into relevant literatures and possible approaches. The goal is not to train you to execute a particular machine learning solution. There are far too many approaches available; they may not cover whatever problem you find; and the state of the art will be different in a year or two anyway. Instead, we will learn how to think about and classify problems into broad types, how to define and measure the efficacy of different solutions to that problem, how to avoid some common and subtle mistakes, and how to think about a full machine learning pipeline from start to finish.
Math and programming are not something you learn, they’re something you get used to. The readings of this course are, with a few exceptions, voluntary and intended for self study. They are to help point you in the right direction when you realize you need to start brushing up on a particular set of tools in order to tackle a particular problem.
Do not purchase any books - each of these should be available for free on-line at the link given. Any individual one would provide a decent background to the field of machine learning. For this course, I’ve picked select chapters when I thought they did a good job reviewing a specific subtopic.
Related Classes *COMS W4721 Machine Learning for Data Science
There are a number places on-line for constant updates on machine learning
* Reddit Machine Learning Subreddit
* arxiv
* Arxiv Sanity Preserver
* My twitter feed
* Political Analysis
* openreview
* Distill
Conferences
* Top Conferences for Machine Learning & Arti. Intelligence
* Neural Information Processing Systems (NIPS)
* International Conference on Machine Learning
Students are not expected to know any particular language or set of software. We will be demonstrating best practices as used in the Machine Learning for Social Science Lab at the Center for Peace and Security Studies, UCSD. In that lab, our software stack consists of Python and R for data preparation and analysis, Spark for database management, Keras/Tensorflow for deep learning, Github for revision control, and Ubuntu for our operating system and command-line tools.
This guide is written as an R notebook using R-Studio. It renders output as static HTML that you should be able to view on a regular web browser.
#install.packages("pacman")
library(pacman)
p_load(infotheo)
p_load(tidyverse)
The following packages have been unloaded:
tidyverse
Installing package into 㤼㸱C:/Users/skynetmini/Documents/R/win-library/3.4㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyverse_1.2.1.zip'
InternetOpenUrl failed: 'The operation timed out'Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyverse_1.2.1.zip'
download of package 㤼㸱tidyverse㤼㸲 failed
tidyverse installed
Failed to install/load:
tidyverse
p_load(ggplot2)
p_load(cowplot)
p_load(mlbench)
p_load(Metrics)
set.seed(123)
Philentropy: Information Theory and Distance Quantification with R., Drost, (2018), Journal of Open Source Software, 3(26), 765
Zero-Bit information source, N-Bit message: No variation, and an arbitrary number of measurements with arbitrary number of states
“Design, Inference, and the Strategic Logic of Suicide Terrorism”, SCOTT ASHWORTH, JOSHUA D. CLINTON, ADAM MEIROWITZ, and KRISTOPHER W. RAMSAY, 2008, APSR
One-Bit information sources are variables that can take on two different states, e.g. a coin flip. Call \(Y\) the true state at the source, and \(\hat{Y}\) the mental model of the state at the destination.
(IntroMachineLearningWithR) “Chapter 3 Example datasets” * Binary_classification
library(infotheo)
N <- 699 #Flip a coin N times (Matched to Breast Cancer Dataset Below)
sample_space <- c(1,0) #Heads and Tails
A fair coin has equal likelihood of both heads and tails. Estimated entropy is close to the true value of 1 bit.
p <- 0.5 #Fair Coin
Y_coin_fair <- sample(sample_space, size = N, replace = TRUE, prob = c(p, 1 - p))
print(table(Y_coin_fair))
Y_coin_fair
0 1
358 341
print(natstobits(entropy(Y_coin_fair, method="emp")))
[1] 0.9995733
An unfair coin is weighted to be more likely to land heads or tails. Estimated entropy is less than a full bit. There is less surprise than from a full fair coin flip.
p <- 0.8 #Unfair Coin
Y_coin_unfair <- sample(sample_space, size = N, replace = TRUE, prob = c(p, 1 - p))
print(table(Y_coin_unfair))
Y_coin_unfair
0 1
126 573
print(natstobits(entropy(Y_coin_unfair, method="emp")))
[1] 0.68064
A two headed coin will only ever land one way.
p <- 1 #Two headed coin
Y_coin_twoheaded <- sample(sample_space, size = N, replace = TRUE, prob = c(p, 1 - p))
print(table(Y_coin_twoheaded))
Y_coin_twoheaded
1
699
print(natstobits(entropy(Y_coin_twoheaded, method="emp")))
[1] 1.281371e-15
“Breast Cancer”, Raul Eulogio, January 26, 2018
data(BreastCancer)
glimpse(BreastCancer)
Observations: 699
Variables: 11
$ Id <chr> "1000025", "1002945", "1015425", "1016277", "1017023", "1017122", "101...
$ Cl.thickness <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, 10, 6, 7, 10, 3,...
$ Cell.size <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1, 7, 1, 3, 5, 1, ...
$ Cell.shape <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1, 7, 1, 2, 5, 1, ...
$ Marg.adhesion <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1, 6, 1, 10, 3, 1,...
$ Epith.c.size <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, 4, 2, 5, 6, 2, 2...
$ Bare.nuclei <fct> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, 1, 10, 1, 10, 7,...
$ Bl.cromatin <fct> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, 4, 3, 5, 7, 2, 7...
$ Normal.nucleoli <fct> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, 1, 1, 4, 10, 1, ...
$ Mitoses <fct> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, 2, 1, 4, 1, 1, 1...
$ Class <fct> benign, benign, benign, benign, benign, malignant, benign, benign, ben...
summary(BreastCancer$Class)
benign malignant
458 241
print(natstobits(entropy(BreastCancer$Class, method="emp")))
[1] 0.9293179
data(iris)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, s...
summary(iris$Species)
setosa versicolor virginica
50 50 50
print(natstobits(entropy(iris$Species, method="emp")))
[1] 1.584962
data(BostonHousing)
glimpse(BostonHousing)
Observations: 506
Variables: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.2112...
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, ...
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, ...
$ chas <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524, 0...
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004, 6.377, 6...
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 39.0,...
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921...
$ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307, 307, 307...
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, ...
$ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.63, 386.71...
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45, 13.27, ...
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, ...
summary(BostonHousing$medv)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.02 21.20 22.53 25.00 50.00
print(natstobits(entropy(discretize(BostonHousing$medv), method="emp")))
[1] 2.80701
(IntroMachineLearningWithR) 5.4 Classification performance
How should we compare the true reality \(Y\) to our mental model of it \(\hat{Y}\)?
table(BreastCancer$Class, BreastCancer$Class)
benign malignant
benign 458 0
malignant 0 241
table(BreastCancer$Class, Y_coin_fair)
Y_coin_fair
0 1
benign 239 219
malignant 119 122
table(BreastCancer$Class, Y_coin_unfair)
Y_coin_unfair
0 1
benign 84 374
malignant 42 199
table(BreastCancer$Class, Y_coin_twoheaded)
Y_coin_twoheaded
1
benign 458
malignant 241
p_load(Metrics)
BreastCancer$Class_binary <- as.numeric(BreastCancer$Class=="malignant")
accuracy(BreastCancer$Class_binary, BreastCancer$Class_binary)
[1] 1
accuracy(BreastCancer$Class_binary, Y_coin_fair)
[1] 0.5164521
accuracy(BreastCancer$Class_binary, Y_coin_unfair)
[1] 0.4048641
accuracy(BreastCancer$Class_binary, Y_coin_twoheaded)
[1] 0.3447783
p_load(Metrics)
Metrics::precision(BreastCancer$Class_binary,
BreastCancer$Class_binary)
[1] 1
Metrics::precision(BreastCancer$Class_binary, Y_coin_fair)
[1] 0.3577713
Metrics::precision(BreastCancer$Class_binary, Y_coin_unfair)
[1] 0.3472949
Metrics::precision(BreastCancer$Class_binary, Y_coin_twoheaded)
[1] 0.3447783
p_load(Metrics)
Metrics::recall(BreastCancer$Class_binary,
BreastCancer$Class_binary)
[1] 1
Metrics::recall(BreastCancer$Class_binary, Y_coin_fair)
[1] 0.5062241
Metrics::recall(BreastCancer$Class_binary, Y_coin_unfair)
[1] 0.8257261
Metrics::recall(BreastCancer$Class_binary, Y_coin_twoheaded)
[1] 1
#Note Metrics::f1 doesn't give the correct values
p_load(MLmetrics)
F1_Score(BreastCancer$Class_binary,
BreastCancer$Class_binary)
[1] 1
F1_Score(BreastCancer$Class_binary, Y_coin_fair)
[1] 0.5857843
F1_Score(BreastCancer$Class_binary, Y_coin_unfair)
[1] 0.2876712
#F1_Score(BreastCancer$Class_binary, Y_coin_twoheaded)
#Not happy about all 1 prediction
print( (0.3447783*1)/(0.3447783+1)*2 ) #Calculate by hand
[1] 0.512766
MLmetrics::LogLoss(BreastCancer$Class_binary,
BreastCancer$Class_binary)
[1] 9.992007e-16
MLmetrics::LogLoss(BreastCancer$Class_binary, Y_coin_fair)
[1] 16.70129
MLmetrics::LogLoss(BreastCancer$Class_binary, Y_coin_unfair)
[1] 20.55531
MLmetrics::LogLoss(BreastCancer$Class_binary, Y_coin_twoheaded)
[1] 22.63056
Measuring classifier performance: a coherent alternative to the area under the ROC curve, David J. Hand, Mach Learn (2009) 77: 103–123
Generate ROC Curve Charts for Print and Interactive Use, Michael C Sachs, 2018-06-23 Illustrated Guide to ROC and AUC, Raffael Vogler, June 23, 2015
#Simulate a probalistic prediction
noised_prediction <- function(prediction){ noised <-runif(N,0,0.5) ; noised[prediction==1] <-noised[prediction==1]+0.5; return(noised) }
AUC(noised_prediction(BreastCancer$Class_binary), BreastCancer$Class_binary)
[1] 1
AUC(noised_prediction(Y_coin_fair), BreastCancer$Class_binary)
[1] 0.5230118
AUC(noised_prediction(Y_coin_unfair), BreastCancer$Class_binary)
[1] 0.5179384
AUC(noised_prediction(Y_coin_twoheaded), BreastCancer$Class_binary)
[1] 0.4781388
p_load(plotROC)
Installing package into 㤼㸱C:/Users/skynetmini/Documents/R/win-library/3.4㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/plotROC_2.2.1.zip'
InternetOpenUrl failed: 'A connection with the server could not be established'Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/plotROC_2.2.1.zip'
download of package 㤼㸱plotROC㤼㸲 failedthere is no package called 㤼㸱plotROC㤼㸲Failed to install/load:
plotROC
set.seed(2529)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
M2 <- rnorm(200, mean = D.ex, sd = 1.5)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, M2 = M2, stringsAsFactors = FALSE)
basicplot <- ggplot(test, aes(d = D, m = M1)) + geom_roc(labels = FALSE)
Error in is_list(x) : object 'rlang_is_list' not found
y_hat=mean(BostonHousing$medv)
MAE(BostonHousing$medv, y_hat)
MSE(BostonHousing$medv, y_hat)
Function_(mathematics)
* Inverse_function
“Achieving Statistical Significance with Covariates and without Transparency”, Gabriel Lenz, Alexander Sahn, November 27, 2017
“Do We Really Know the WTO Cures Cancer?” Stephen Chaudoin, Jude Hays and Raymond Hicks, British Journal of Political Science.