This webpage provides the materials for a presentation that was given during a campus visit to Hamilton College on May 5, 2023
You can download the slides used in the presentation here:
Download Data_Science_for_Hamiltonians.pdfPlease feel free to contact me at thlim@arizona.edu if you have any questions.
Remember
You can download the do-file and the data set used in this website here:
Download data_science_for_hamiltonians.zip// Load data set with the second row as the variable names
import delimited "NY-snowfall-202301.csv", varnames(2) clear
describe
(38 vars, 469 obs)
Contains data
obs: 469
vars: 38
size: 78,792
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
ghcnid str11 %11s GHCN ID
stationname str30 %30s Station Name
county str12 %12s County
state str8 %9s State
elevation int %8.0g Elevation
latitude float %9.0g Latitude
longitude float %9.0g Longitude
jan1 str3 %9s Jan 1
jan2 str3 %9s Jan 2
jan3 str3 %9s Jan 3
jan4 str3 %9s Jan 4
jan5 str3 %9s Jan 5
jan6 str3 %9s Jan 6
jan7 str3 %9s Jan 7
jan8 str3 %9s Jan 8
jan9 str3 %9s Jan 9
jan10 str3 %9s Jan 10
jan11 str4 %9s Jan 11
jan12 str3 %9s Jan 12
jan13 str3 %9s Jan 13
jan14 str3 %9s Jan 14
jan15 str3 %9s Jan 15
jan16 str3 %9s Jan 16
jan17 str3 %9s Jan 17
jan18 str3 %9s Jan 18
jan19 str4 %9s Jan 19
jan20 str3 %9s Jan 20
jan21 str3 %9s Jan 21
jan22 str3 %9s Jan 22
jan23 str4 %9s Jan 23
jan24 str3 %9s Jan 24
jan25 str3 %9s Jan 25
jan26 str3 %9s Jan 26
jan27 str3 %9s Jan 27
jan28 str3 %9s Jan 28
jan29 str3 %9s Jan 29
jan30 str4 %9s Jan 30
jan31 str3 %9s Jan 31
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
* Check variable and data type
tab jan1
Jan 1 | Freq. Percent Cum.
------------+-----------------------------------
0.0 | 254 54.16 54.16
M | 214 45.63 99.79
T | 1 0.21 100.00
------------+-----------------------------------
Total | 469 100.00
// Replace by each variable
replace jan1 = subinstr(jan1, "M", "0", .)
replace jan1 = subinstr(jan1, "T", "0", .)
(214 real changes made)
(1 real change made)
// Check variable and data type
describe jan1
tab jan1
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
jan1 str3 %9s Jan 1
Jan 1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 215 45.84 45.84
0.0 | 254 54.16 100.00
------------+-----------------------------------
Total | 469 100.00
// Make string to numeric
destring jan1, gen(test)
describe test
jan1 has all characters numeric; test generated as byte
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
test byte %10.0g Jan 1
// Replace "M" (missing) and "T" (trace) with 0 in all columns starting with 'jan'
foreach var of varlist jan* {
replace `var' = subinstr(`var', "M", "0", .)
replace `var' = subinstr(`var', "T", "0", .)
destring `var', replace
}
// Check variable and data type
tab jan2
describe jan2
> 'jan'
(0 real changes made)
(0 real changes made)
jan1 has all characters numeric; replaced as byte
(142 real changes made)
(12 real changes made)
jan2 has all characters numeric; replaced as double
(139 real changes made)
(2 real changes made)
jan3 has all characters numeric; replaced as byte
(205 real changes made)
(0 real changes made)
jan4 has all characters numeric; replaced as byte
(210 real changes made)
(0 real changes made)
jan5 has all characters numeric; replaced as double
(185 real changes made)
(25 real changes made)
jan6 has all characters numeric; replaced as double
(177 real changes made)
(60 real changes made)
jan7 has all characters numeric; replaced as double
(112 real changes made)
(77 real changes made)
jan8 has all characters numeric; replaced as double
(113 real changes made)
(38 real changes made)
jan9 has all characters numeric; replaced as double
(106 real changes made)
(60 real changes made)
jan10 has all characters numeric; replaced as double
(95 real changes made)
(47 real changes made)
jan11 has all characters numeric; replaced as double
(150 real changes made)
(75 real changes made)
jan12 has all characters numeric; replaced as double
(166 real changes made)
(39 real changes made)
jan13 has all characters numeric; replaced as double
(157 real changes made)
(54 real changes made)
jan14 has all characters numeric; replaced as double
(120 real changes made)
(95 real changes made)
jan15 has all characters numeric; replaced as double
(89 real changes made)
(20 real changes made)
jan16 has all characters numeric; replaced as double
(122 real changes made)
(14 real changes made)
jan17 has all characters numeric; replaced as double
(187 real changes made)
(33 real changes made)
jan18 has all characters numeric; replaced as double
(147 real changes made)
(19 real changes made)
jan19 has all characters numeric; replaced as double
(180 real changes made)
(46 real changes made)
jan20 has all characters numeric; replaced as double
(152 real changes made)
(52 real changes made)
jan21 has all characters numeric; replaced as double
(110 real changes made)
(85 real changes made)
jan22 has all characters numeric; replaced as double
(121 real changes made)
(16 real changes made)
jan23 has all characters numeric; replaced as double
(125 real changes made)
(78 real changes made)
jan24 has all characters numeric; replaced as double
(88 real changes made)
(92 real changes made)
jan25 has all characters numeric; replaced as double
(134 real changes made)
(15 real changes made)
jan26 has all characters numeric; replaced as double
(95 real changes made)
(55 real changes made)
jan27 has all characters numeric; replaced as double
(117 real changes made)
(65 real changes made)
jan28 has all characters numeric; replaced as double
(113 real changes made)
(28 real changes made)
jan29 has all characters numeric; replaced as double
(126 real changes made)
(55 real changes made)
jan30 has all characters numeric; replaced as double
(108 real changes made)
(22 real changes made)
jan31 has all characters numeric; replaced as double
Jan 2 | Freq. Percent Cum.
------------+-----------------------------------
0 | 465 99.15 99.15
.1 | 3 0.64 99.79
.2 | 1 0.21 100.00
------------+-----------------------------------
Total | 469 100.00
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
jan2 double %10.0g Jan 2
// Add a new column with the sum of each row
egen total = rowtotal(jan1 - jan31)
sum total,d
total
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 469
25% 1.1 0 Sum of Wgt. 469
50% 7.5 Mean 7.889765
Largest Std. Dev. 6.822664
75% 11.8 29.7
90% 16.4 31.4 Variance 46.54874
95% 19.4 32.9 Skewness .8291596
99% 29.5 33.8 Kurtosis 3.733055
// Check if Elevation and Latitude are numeric
describe elevation latitude
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
elevation int %8.0g Elevation
latitude float %9.0g Latitude
// Check variable
sum elevation, d
Elevation
-------------------------------------------------------------
Percentiles Smallest
1% 5 -9999
5% 30 3
10% 100 4 Obs 469
25% 380 4 Sum of Wgt. 469
50% 642 Mean 743.5416
Largest Std. Dev. 723.4853
75% 1110 2029
90% 1568 2169 Variance 523431
95% 1698 2220 Skewness -6.733887
99% 2021 2495 Kurtosis 104.8178
// Check error
tab stationname if elevation == -9999
Station Name | Freq. Percent Cum.
-------------------------------+-----------------------------------
BATAVIA 1.4 E | 1 100.00 100.00
-------------------------------+-----------------------------------
Total | 1 100.00
// Recode
recode elevation -9999=900
(elevation: 1 changes made)
corr total elevation latitude
(obs=469)
| total elevat~n latitude
-------------+---------------------------
total | 1.0000
elevation | 0.5405 1.0000
latitude | 0.5664 0.3353 1.0000
twoway scatter total elevation, ///
xtitle("Elevation") ytitle("Total") ///
title("Scatterplot of Total vs. Elevation")
> xtitle("Elevation") ytitle("Total") ///
> title("Scatterplot of Total vs. Elevation")
Total ~ Elevation
twoway (scatter total latitude) (lfit total latitude), ///
xtitle("Latitude") ytitle("Total") ///
title("Scatterplot of Total vs. Latitude")
> xtitle("Latitude") ytitle("Total") ///
> title("Scatterplot of Total vs. Latitude")
Total ~ Latitude
reg total elevation latitude
Source | SS df MS Number of obs = 469
-------------+---------------------------------- F(2, 466) = 197.88
Model | 10004.6003 2 5002.30013 Prob > F = 0.0000
Residual | 11780.211 466 25.2794228 R-squared = 0.4592
-------------+---------------------------------- Adj R-squared = 0.4569
Total | 21784.8113 468 46.548742 Root MSE = 5.0279
------------------------------------------------------------------------------
total | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
elevation | .0051263 .0004693 10.92 0.000 .0042041 .0060484
latitude | 3.313543 .2761122 12.00 0.000 2.770963 3.856122
_cons | -136.9432 11.62777 -11.78 0.000 -159.7926 -114.0939
------------------------------------------------------------------------------
margins, at(latitude=(40.54(1)44.84))
marginsplot
Predictive margins Number of obs = 469
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : latitude = 40.54
2._at : latitude = 41.54
3._at : latitude = 42.54
4._at : latitude = 43.54
5._at : latitude = 44.54
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 1.318498 .5947581 2.22 0.027 .1497579 2.487238
2 | 4.63204 .3571999 12.97 0.000 3.930118 5.333962
3 | 7.945583 .2322118 34.22 0.000 7.489271 8.401895
4 | 11.25913 .3643196 30.90 0.000 10.54321 11.97504
5 | 14.57267 .6033334 24.15 0.000 13.38708 15.75826
------------------------------------------------------------------------------
Variables that uniquely identify margins: latitude
Marginal Effect of Latitude
reg total c.elevation##c.elevation##c.elevation latitude
Source | SS df MS Number of obs = 469
-------------+---------------------------------- F(4, 464) = 104.84
Model | 10342.088 4 2585.522 Prob > F = 0.0000
Residual | 11442.7233 464 24.6610415 R-squared = 0.4747
-------------+---------------------------------- Adj R-squared = 0.4702
Total | 21784.8113 468 46.548742 Root MSE = 4.966
------------------------------------------------------------------------------
total | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
elevation | -.0076266 .0034867 -2.19 0.029 -.0144782 -.0007749
|
c.elevation#|
c.elevation | .0000138 3.79e-06 3.63 0.000 6.31e-06 .0000212
|
c.elevation#|
c.elevation#|
c.elevation | -4.07e-09 1.20e-09 -3.38 0.001 -6.44e-09 -1.71e-09
|
latitude | 3.954896 .3295734 12.00 0.000 3.307255 4.602537
_cons | -161.5611 13.50858 -11.96 0.000 -188.1066 -135.0155
------------------------------------------------------------------------------
margins, at(elevation=(0(100)2500))
marginsplot
Predictive margins Number of obs = 469
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : elevation = 0
2._at : elevation = 100
3._at : elevation = 200
4._at : elevation = 300
5._at : elevation = 400
6._at : elevation = 500
7._at : elevation = 600
8._at : elevation = 700
9._at : elevation = 800
10._at : elevation = 900
11._at : elevation = 1000
12._at : elevation = 1100
13._at : elevation = 1200
14._at : elevation = 1300
15._at : elevation = 1400
16._at : elevation = 1500
17._at : elevation = 1600
18._at : elevation = 1700
19._at : elevation = 1800
20._at : elevation = 1900
21._at : elevation = 2000
22._at : elevation = 2100
23._at : elevation = 2200
24._at : elevation = 2300
25._at : elevation = 2400
26._at : elevation = 2500
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | . (not estimable)
2 | 5.984411 .5999464 9.97 0.000 4.805462 7.163359
3 | 5.60585 .4222693 13.28 0.000 4.776053 6.435648
4 | 5.453481 .3367075 16.20 0.000 4.791821 6.115142
5 | 5.502861 .3223978 17.07 0.000 4.86932 6.136402
6 | 5.729548 .3373162 16.99 0.000 5.066691 6.392405
7 | 6.1091 .3529676 17.31 0.000 5.415487 6.802713
8 | 6.617076 .3605384 18.35 0.000 5.908586 7.325567
9 | 7.229034 .3614414 20.00 0.000 6.518769 7.939299
10 | 7.920532 .3610039 21.94 0.000 7.211127 8.629937
11 | 8.667128 .3648421 23.76 0.000 7.950181 9.384076
12 | 9.444381 .3762415 25.10 0.000 8.705032 10.18373
13 | 10.22785 .3949241 25.90 0.000 9.451787 11.00391
14 | 10.99309 .417921 26.30 0.000 10.17184 11.81434
15 | 11.71566 .4419586 26.51 0.000 10.84717 12.58415
16 | 12.37112 .4662718 26.53 0.000 11.45486 13.28739
17 | 12.93503 .49549 26.11 0.000 11.96135 13.90871
18 | 13.38294 .5421034 24.69 0.000 12.31766 14.44823
19 | 13.69042 .6260242 21.87 0.000 12.46023 14.92062
20 | 13.83302 .7685783 18.00 0.000 12.3227 15.34335
21 | 13.7863 .9852521 13.99 0.000 11.8502 15.72241
22 | 13.52582 1.284649 10.53 0.000 11.00137 16.05027
23 | 13.02714 1.671993 7.79 0.000 9.741524 16.31276
24 | 12.26581 2.151945 5.70 0.000 8.037047 16.49458
25 | 11.2174 2.729625 4.11 0.000 5.853439 16.58136
26 | 9.857454 3.410736 2.89 0.004 3.155052 16.55986
------------------------------------------------------------------------------
Variables that uniquely identify margins: elevation
Marginal Effect of Elevation
//reshape data to long
reshape long jan, i(ghcnid) j(date) string
rename jan inches
(note: j = 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30
> 31 4 5 6 7 8 9)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 469 -> 14539
Number of variables 40 -> 11
j variable (31 values) -> date
xij variables:
jan1 jan10 ... jan9 -> jan
-----------------------------------------------------------------------------
//destring date
destring date, replace
date has all characters numeric; replaced as byte
line inches date if stationname == "NEW HARTFORD 0.8 S"

```stata
scatter inches date if stationname == "NEW HARTFORD 1.0 WSW"
```
Scatter Plot of New Hartford 1.0 WSW
twoway (scatter inches date if stationname == "NEW HARTFORD 0.8 S") (line inches date if stationname == "NEW HARTFORD 0.8 S")
> hes date if stationname == "NEW HARTFORD 0.8 S")
Scatter Plot and Line Graph of New Hartford 0.8 S
Here are some helpful resources learing STATA.
Remember
You can download and install R from here and RStudio and R from here.
You can download the rscript and the data set used in this website here:
Download data_science_for_hamiltonians.zip####Load and Reshape Data####
# Attach packages
install.packages('tidyverse')
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
snowfall <- snowfall %>%
mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
Error in snowfall %>% mutate_all(~str_replace_all(., c(M = "0", T = "0"))): could not find function "%>%"
library(tidyverse)
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.2 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.2 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
####get total snowfall by GHCN####
# Load the CSV file of snowfall
# Read from csv with the first row skipped and with the second row as the column names of the dataframe
snowfall <- read.csv("NY-snowfall-202301.csv", skip = 1, header = TRUE)
# Display column names
colnames(snowfall)
[1] "GHCN.ID" "Station.Name" "County" "State" "Elevation"
[6] "Latitude" "Longitude" "Jan.1" "Jan.2" "Jan.3"
[11] "Jan.4" "Jan.5" "Jan.6" "Jan.7" "Jan.8"
[16] "Jan.9" "Jan.10" "Jan.11" "Jan.12" "Jan.13"
[21] "Jan.14" "Jan.15" "Jan.16" "Jan.17" "Jan.18"
[26] "Jan.19" "Jan.20" "Jan.21" "Jan.22" "Jan.23"
[31] "Jan.24" "Jan.25" "Jan.26" "Jan.27" "Jan.28"
[36] "Jan.29" "Jan.30" "Jan.31"
# Display top five rows
head(snowfall)
GHCN.ID Station.Name County State Elevation Latitude
1 US1NYER0211 AKRON 3.6 W ERIE New York 641 43.02
2 US1NYAB0041 ALBANY 0.3 ESE ALBANY New York 213 42.66
3 US1NYAB0023 ALBANY 0.7 SW ALBANY New York 256 42.66
4 US1NYAB0056 ALBANY 2.4 WNW ALBANY New York 276 42.68
5 USW00014735 ALBANY INTERNATIONAL AIRPORT ALBANY New York 280 42.75
6 US1NYNS0042 ALBERTSON 0.2 SSE NASSAU New York 142 40.77
Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1 -78.57 M M M M M M M M 0.0 0.0 0.0
2 -73.79 0.0 0.0 0.0 0.0 0.0 M M 0.0 0.0 T 0.0
3 -73.81 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T 0.1
4 -73.84 M M 0.0 M M M M 0.0 0.0 T 0.0
5 -73.80 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T T 0.0
6 -73.65 M 0.0 M M M M M 0.0 M 0.0 0.0
Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1 M M M 0.0 0.0 M M 0.0 M M M
2 T M 0.1 T 0.0 0.0 0.0 0.0 M M T
3 T 0.3 T T 0.0 0.0 T 0.0 0.5 0.1 T
4 M M 0.2 T 0.0 0.0 0.0 0.0 0.4 M M
5 0.2 0.2 T T 0.0 T 0.0 0.3 0.5 T 2.0
6 M M M M M 0.0 M 0.0 M 0.0 0.0
Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31
1 M M 0.0 M M 0.0 0.0 0.0 M
2 M M T 0.4 T T 0.0 0.0 0.5
3 3.8 3.0 T 1.0 T T 0.0 0.0 T
4 4.2 3.6 T 1.1 T 0.0 0.0 0.0 0.5
5 6.0 T 1.1 M T T 0.0 0.6 0.1
6 M 0.0 0.0 M 0.0 M 0.0 0.0 M
# Replace "M" (missing) and "T" (trace) with 0 in all columns of the dataframe
snowfall <- snowfall %>%
mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
# Check data type
snowfall$Jan.1 %>% typeof()
[1] "character"
# Convert columns starting with Jan. to numeric
snowfall <- snowfall %>%
mutate_at(vars(starts_with("Jan.")), as.numeric)
# Add a new column with the sum of each row and reorder in descending order of total
snowfall <- snowfall %>%
mutate(total = rowSums(select(., starts_with("Jan.")))) %>%
arrange(desc(total))
# top 5
head(snowfall)
GHCN.ID Station.Name County State Elevation Latitude
1 US1NYSL0006 HANNAWA FALLS 0.1 SW S0. LAWRENCE New York 561 44.61
2 USC00301264 CA00ARAUGUS 3W CA00ARAUGUS New York 1708 42.34
3 USC00308961 WARSAW 4 W WYO0ING New York 1750 42.74
4 USC00304555 LAKE PLACID 2 S ESSEX New York 1888 44.3
5 US1NYH00008 LONG LAKE 1.2 N HA0IL0ON New York 1793 43.99
6 US1NYFK0007 SARANAC LAKE 6.2 N FRANKLIN New York 1710 44.41
Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1 -74.97 0 0.0 0 0 0 0.0 0.1 0.0 0 0.7 0.0
2 -78.91 0 0.0 0 0 0 1.1 4.0 0.4 0 0.0 0.0
3 -78.21 0 0.0 0 0 0 0.0 1.0 0.4 0 0.0 0.0
4 -73.98 0 0.1 0 0 0 0.1 0.5 0.1 0 2.0 0.1
5 -74.42 0 0.0 0 0 0 0.0 0.4 0.6 0 7.0 0.1
6 -74.15 0 0.1 0 0 0 0.0 0.5 0.8 0 0.8 0.1
Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1 0.5 2.5 4.5 0.0 0 0 0.0 0.5 0.7 1.0 0.0
2 0.0 2.4 2.5 0.7 0 0 0.0 0.1 1.5 2.5 0.0
3 0.0 1.0 3.0 1.0 0 0 0.0 0.1 1.0 2.5 0.4
4 0.1 0.0 3.0 1.0 0 0 0.0 0.3 5.0 1.0 0.1
5 0.0 0.0 1.1 0.0 0 0 0.4 0.3 4.0 1.6 0.0
6 0.1 0.2 2.5 0.2 0 0 0.3 0.1 3.6 1.8 0.0
Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31 total
1 0.5 1.0 0.8 6.5 3.5 1.5 1.5 6.5 1.5 33.8
2 4.0 0.6 0.7 3.0 5.5 1.3 0.0 0.1 2.5 32.9
3 3.7 2.2 0.4 7.5 3.4 1.8 0.0 0.0 2.0 31.4
4 2.0 0.3 3.0 3.0 2.5 1.0 1.0 1.0 2.5 29.7
5 2.1 0.6 1.6 3.4 1.6 1.7 0.2 0.0 2.8 29.5
6 1.4 0.2 4.1 3.7 2.0 0.9 0.8 2.1 2.7 29.0
# top 1
snowfall %>%
slice_max(total)
GHCN.ID Station.Name County State Elevation Latitude
1 US1NYSL0006 HANNAWA FALLS 0.1 SW S0. LAWRENCE New York 561 44.61
Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1 -74.97 0 0 0 0 0 0 0.1 0 0 0.7 0
Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1 0.5 2.5 4.5 0 0 0 0 0.5 0.7 1 0
Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31 total
1 0.5 1 0.8 6.5 3.5 1.5 1.5 6.5 1.5 33.8
####Produce an interactive graph of top 15####
# Attach packages
library(ggplot2)
library(plotly)
# Create ggplot object
p <- ggplot(snowfall, aes(x = Station.Name, y = total)) +
geom_bar(stat = "identity", aes(fill = total))
p
# Select top 15 stations and reorder factor levels in descending order
top_stations <- snowfall %>%
top_n(15, total) %>%
mutate(Station.Name = fct_reorder(Station.Name, desc(total)))
# Create ggplot object
p <- ggplot(top_stations, aes(x = Station.Name, y = total)) +
geom_bar(stat = "identity", aes(fill = total))
p
# Create ggplot object with custom color scheme
p <- ggplot(top_stations, aes(x = Station.Name, y = total, text = paste("Snowfall: ", total))) + # text is included for the interactive plot
geom_bar(stat = "identity", aes(fill = total)) +
scale_fill_gradient(low = "grey", high = "blue") + # change color scheme
labs(title = "Top 15 Stations with the Most Snowfall in January",
x = "Station",
y = "Total January Snowfall (in)",
fill = "Total\nSnowfall") + # change labels
theme_bw() + # change themes
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # change x axis label's orientation
p
# Convert ggplot to interactive plot
ggplotly(p)
# modify ggplotly object
ggplotly(p + theme(legend.position = "none"), # delete legend
tooltip = "text", # include text
hovertemplate = paste("<b>%{x}</b><br>",
"Snowfall: %{y}<br>")) %>% # set text template for the text balloon when mouse hovers
layout(hoverlabel = list(bgcolor = c(low = "grey", high = "blue"))) # set color template
####Produce an interactive graph by date####
# Attach packages
library(ggplot2)
library(plotly)
# Leave the columns with the ID and dates
new_snowfall <- snowfall %>% select(Station.Name, starts_with("Jan."))
# tidy
tidy_snowfall <- new_snowfall %>%
gather(key = "date", value = "Snowfall", -Station.Name)
# Change Jan. to YYYY-MM-DD format
tidy_snowfall$ymd <- as.POSIXct(paste0("2023-", sub("Jan\\.", "01-", tidy_snowfall$date)), format = "%Y-%m-%d")
tidy_snowfall$Date <- ymd(tidy_snowfall$ymd)
tidy_snowfall$Station <- tidy_snowfall$Station.Name
# Create ggplot object
ggp <- ggplot(tidy_snowfall, aes(x = Date, y = Snowfall, color = Station)) +
geom_line() +
labs(title = "Snowfall by Stations",
x = "Date",
y = "Snowfall (in)") +
theme_bw() +
theme(legend.position = "none") +
scale_x_date(date_labels = "%b/%d")
ggp
# Convert to plotly object for interactivity
ggplotly(ggp)
####Produce an interactive map of snowfall by NYS assembly districts####
# Attach packages
library(sf)
library(tmap)
library(leaflet)
# Read in the shapefile for the New York State Assembly Districts
ad_shapefile <- st_read("https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
Reading layer `OGRGeoJSON' from data source
`https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson'
using driver `GeoJSON'
Simple feature collection with 150 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -79.76259 ymin: 40.47658 xmax: -71.77749 ymax: 45.01586
Geodetic CRS: WGS 84
# Create a spatial points dataframe
snowfall_sf <- st_as_sf(snowfall, coords = c("Longitude", "Latitude"), crs = 4326)
# Perform spatial join
snowfall_by_district <- st_join(snowfall_sf, ad_shapefile, join = st_intersects)
# Summarize snowfall by district
snowfall_summarized <- snowfall_by_district %>%
group_by(District) %>%
summarize(total_snowfall = sum(total))
# Count the number of observatories in each district
observatory_counts <- snowfall_by_district %>%
group_by(District) %>%
summarize(observatory_count = n())
# Add the observatory count to the snowfall_summarized data frame
snowfall_summarized <- st_join(snowfall_summarized, observatory_counts)
# Calculate the average snowfall for each district
snowfall_averages <- snowfall_summarized %>%
mutate(avg_snowfall = round(total_snowfall / observatory_count, 2)) %>% #create a new column with average snowfall that displys two decimal digits
select(-District.y) %>%
rename(District = District.x)
# Create a shapefile containing the information on snowfall, sponsoring, and the number of mentions
nyassembly <- st_join(ad_shapefile, snowfall_averages) %>%
select(-c(District.y, District.x)) %>%
mutate(OBJECTID = OBJECTID - 150)
# Define the color palette
pal <- leaflet::colorNumeric(palette = c("grey", "blue"), domain = nyassembly$avg_snowfall)
# Create the leaflet map
leaflet <- leaflet() %>%
addTiles() %>%
setView(lng = -75.3, lat = 43, zoom = 6.7) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = nyassembly,
fillColor = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),
weight = 0.3,
color = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),
opacity = 1.0,
fillOpacity = 0.5,
label = ~paste("District: ", OBJECTID, ", ",
"Average Snowfall: ", format(round(avg_snowfall, 2), nsmall = 2)),
highlightOptions = leaflet::highlightOptions(
weight = 1.5,
color = "white",
fillOpacity = 0,
bringToFront = TRUE)) %>%
addLegend("bottomleft",
pal = pal,
values = nyassembly$avg_snowfall,
title = "Average Snowfall",
opacity = 0.8) %>%
addScaleBar(position = "topright")
Average January Snowfall in Inches by NYS Assembly Districts
leaflet
# Check data type
is.numeric(snowfall$Elevation)
[1] FALSE
is.numeric(snowfall$Latitude)
[1] FALSE
# Correct data type
snowfall$Elevation <- as.numeric(snowfall$Elevation)
snowfall$Latitude <- as.numeric(snowfall$Latitude)
# Linear regression
model <- lm(total ~ Elevation + Latitude, data = snowfall)
summary(model)
Call:
lm(formula = total ~ Elevation + Latitude, data = snowfall)
Residuals:
Min 1Q Median 3Q Max
-13.9577 -2.7186 0.2336 2.4818 27.4458
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.582e+02 1.202e+01 -13.162 < 2e-16 ***
Elevation 2.530e-03 3.506e-04 7.216 2.18e-12 ***
Latitude 3.861e+00 2.839e-01 13.601 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.344 on 466 degrees of freedom
Multiple R-squared: 0.3891, Adjusted R-squared: 0.3864
F-statistic: 148.4 on 2 and 466 DF, p-value: < 2.2e-16
Here are some helpful resources learning R: