Data Science Using STATA and R

This webpage provides the materials for a presentation that was given during a campus visit to Hamilton College on May 5, 2023

You can download the slides used in the presentation here:

Download Data_Science_for_Hamiltonians.pdf

Please feel free to contact me at thlim@arizona.edu if you have any questions.

STATA

General Information

Remember

  • to use Do-file
  • to back-up, back-up, and then, back-up
  • help, Google, and ChatGPT is your friend–when you use it right!
  • to check data type when error

You can download the do-file and the data set used in this website here:

Download data_science_for_hamiltonians.zip

Snowfall Data

// Load data set with the second row as the variable names
import delimited "NY-snowfall-202301.csv", varnames(2) clear
describe
(38 vars, 469 obs)


Contains data
  obs:           469                          
 vars:            38                          
 size:        78,792                          
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
ghcnid          str11   %11s                  GHCN ID
stationname     str30   %30s                  Station Name
county          str12   %12s                  County
state           str8    %9s                   State
elevation       int     %8.0g                 Elevation
latitude        float   %9.0g                 Latitude
longitude       float   %9.0g                 Longitude
jan1            str3    %9s                   Jan 1
jan2            str3    %9s                   Jan 2
jan3            str3    %9s                   Jan 3
jan4            str3    %9s                   Jan 4
jan5            str3    %9s                   Jan 5
jan6            str3    %9s                   Jan 6
jan7            str3    %9s                   Jan 7
jan8            str3    %9s                   Jan 8
jan9            str3    %9s                   Jan 9
jan10           str3    %9s                   Jan 10
jan11           str4    %9s                   Jan 11
jan12           str3    %9s                   Jan 12
jan13           str3    %9s                   Jan 13
jan14           str3    %9s                   Jan 14
jan15           str3    %9s                   Jan 15
jan16           str3    %9s                   Jan 16
jan17           str3    %9s                   Jan 17
jan18           str3    %9s                   Jan 18
jan19           str4    %9s                   Jan 19
jan20           str3    %9s                   Jan 20
jan21           str3    %9s                   Jan 21
jan22           str3    %9s                   Jan 22
jan23           str4    %9s                   Jan 23
jan24           str3    %9s                   Jan 24
jan25           str3    %9s                   Jan 25
jan26           str3    %9s                   Jan 26
jan27           str3    %9s                   Jan 27
jan28           str3    %9s                   Jan 28
jan29           str3    %9s                   Jan 29
jan30           str4    %9s                   Jan 30
jan31           str3    %9s                   Jan 31
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.
* Check variable and data type
tab jan1
      Jan 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
        0.0 |        254       54.16       54.16
          M |        214       45.63       99.79
          T |          1        0.21      100.00
------------+-----------------------------------
      Total |        469      100.00
// Replace by each variable
replace jan1 = subinstr(jan1, "M", "0", .)
replace jan1 = subinstr(jan1, "T", "0", .)
(214 real changes made)

(1 real change made)
// Check variable and data type
describe jan1
tab jan1
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
jan1            str3    %9s                   Jan 1

      Jan 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        215       45.84       45.84
        0.0 |        254       54.16      100.00
------------+-----------------------------------
      Total |        469      100.00
// Make string to numeric
destring jan1, gen(test)
describe test
jan1 has all characters numeric; test generated as byte

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
test            byte    %10.0g                Jan 1
// Replace "M" (missing) and "T" (trace) with 0 in all columns starting with 'jan'
foreach var of varlist jan* {
    replace `var' = subinstr(`var', "M", "0", .)
    replace `var' = subinstr(`var', "T", "0", .)
    destring `var', replace
}

// Check variable and data type
tab jan2
describe jan2
> 'jan'
(0 real changes made)
(0 real changes made)
jan1 has all characters numeric; replaced as byte
(142 real changes made)
(12 real changes made)
jan2 has all characters numeric; replaced as double
(139 real changes made)
(2 real changes made)
jan3 has all characters numeric; replaced as byte
(205 real changes made)
(0 real changes made)
jan4 has all characters numeric; replaced as byte
(210 real changes made)
(0 real changes made)
jan5 has all characters numeric; replaced as double
(185 real changes made)
(25 real changes made)
jan6 has all characters numeric; replaced as double
(177 real changes made)
(60 real changes made)
jan7 has all characters numeric; replaced as double
(112 real changes made)
(77 real changes made)
jan8 has all characters numeric; replaced as double
(113 real changes made)
(38 real changes made)
jan9 has all characters numeric; replaced as double
(106 real changes made)
(60 real changes made)
jan10 has all characters numeric; replaced as double
(95 real changes made)
(47 real changes made)
jan11 has all characters numeric; replaced as double
(150 real changes made)
(75 real changes made)
jan12 has all characters numeric; replaced as double
(166 real changes made)
(39 real changes made)
jan13 has all characters numeric; replaced as double
(157 real changes made)
(54 real changes made)
jan14 has all characters numeric; replaced as double
(120 real changes made)
(95 real changes made)
jan15 has all characters numeric; replaced as double
(89 real changes made)
(20 real changes made)
jan16 has all characters numeric; replaced as double
(122 real changes made)
(14 real changes made)
jan17 has all characters numeric; replaced as double
(187 real changes made)
(33 real changes made)
jan18 has all characters numeric; replaced as double
(147 real changes made)
(19 real changes made)
jan19 has all characters numeric; replaced as double
(180 real changes made)
(46 real changes made)
jan20 has all characters numeric; replaced as double
(152 real changes made)
(52 real changes made)
jan21 has all characters numeric; replaced as double
(110 real changes made)
(85 real changes made)
jan22 has all characters numeric; replaced as double
(121 real changes made)
(16 real changes made)
jan23 has all characters numeric; replaced as double
(125 real changes made)
(78 real changes made)
jan24 has all characters numeric; replaced as double
(88 real changes made)
(92 real changes made)
jan25 has all characters numeric; replaced as double
(134 real changes made)
(15 real changes made)
jan26 has all characters numeric; replaced as double
(95 real changes made)
(55 real changes made)
jan27 has all characters numeric; replaced as double
(117 real changes made)
(65 real changes made)
jan28 has all characters numeric; replaced as double
(113 real changes made)
(28 real changes made)
jan29 has all characters numeric; replaced as double
(126 real changes made)
(55 real changes made)
jan30 has all characters numeric; replaced as double
(108 real changes made)
(22 real changes made)
jan31 has all characters numeric; replaced as double

      Jan 2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        465       99.15       99.15
         .1 |          3        0.64       99.79
         .2 |          1        0.21      100.00
------------+-----------------------------------
      Total |        469      100.00

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
jan2            double  %10.0g                Jan 2
// Add a new column with the sum of each row
egen total = rowtotal(jan1 - jan31)

sum total,d
                            total
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 469
25%          1.1              0       Sum of Wgt.         469

50%          7.5                      Mean           7.889765
                        Largest       Std. Dev.      6.822664
75%         11.8           29.7
90%         16.4           31.4       Variance       46.54874
95%         19.4           32.9       Skewness       .8291596
99%         29.5           33.8       Kurtosis       3.733055
// Check if Elevation and Latitude are numeric
describe elevation latitude
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
elevation       int     %8.0g                 Elevation
latitude        float   %9.0g                 Latitude
// Check variable
sum elevation, d
                          Elevation
-------------------------------------------------------------
      Percentiles      Smallest
 1%            5          -9999
 5%           30              3
10%          100              4       Obs                 469
25%          380              4       Sum of Wgt.         469

50%          642                      Mean           743.5416
                        Largest       Std. Dev.      723.4853
75%         1110           2029
90%         1568           2169       Variance         523431
95%         1698           2220       Skewness      -6.733887
99%         2021           2495       Kurtosis       104.8178
// Check error
tab stationname if elevation == -9999
                  Station Name |      Freq.     Percent        Cum.
-------------------------------+-----------------------------------
                 BATAVIA 1.4 E |          1      100.00      100.00
-------------------------------+-----------------------------------
                         Total |          1      100.00
// Recode
recode elevation -9999=900
(elevation: 1 changes made)

Plot Correlation

corr total elevation latitude
(obs=469)

             |    total elevat~n latitude
-------------+---------------------------
       total |   1.0000
   elevation |   0.5405   1.0000
    latitude |   0.5664   0.3353   1.0000
twoway scatter total elevation, ///
  xtitle("Elevation") ytitle("Total") ///
  title("Scatterplot of Total vs. Elevation")
>   xtitle("Elevation") ytitle("Total") ///
>   title("Scatterplot of Total vs. Elevation")

Total ~ Elevation

twoway (scatter total latitude) (lfit total latitude), ///
  xtitle("Latitude") ytitle("Total") ///
  title("Scatterplot of Total vs. Latitude")
>   xtitle("Latitude") ytitle("Total") ///
>   title("Scatterplot of Total vs. Latitude")

Total ~ Latitude

Linear Regression

reg total elevation latitude
      Source |       SS           df       MS      Number of obs   =       469
-------------+----------------------------------   F(2, 466)       =    197.88
       Model |  10004.6003         2  5002.30013   Prob > F        =    0.0000
    Residual |   11780.211       466  25.2794228   R-squared       =    0.4592
-------------+----------------------------------   Adj R-squared   =    0.4569
       Total |  21784.8113       468   46.548742   Root MSE        =    5.0279

------------------------------------------------------------------------------
       total |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   elevation |   .0051263   .0004693    10.92   0.000     .0042041    .0060484
    latitude |   3.313543   .2761122    12.00   0.000     2.770963    3.856122
       _cons |  -136.9432   11.62777   -11.78   0.000    -159.7926   -114.0939
------------------------------------------------------------------------------
margins, at(latitude=(40.54(1)44.84))
marginsplot
Predictive margins                              Number of obs     =        469
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : latitude        =       40.54

2._at        : latitude        =       41.54

3._at        : latitude        =       42.54

4._at        : latitude        =       43.54

5._at        : latitude        =       44.54

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   1.318498   .5947581     2.22   0.027     .1497579    2.487238
          2  |    4.63204   .3571999    12.97   0.000     3.930118    5.333962
          3  |   7.945583   .2322118    34.22   0.000     7.489271    8.401895
          4  |   11.25913   .3643196    30.90   0.000     10.54321    11.97504
          5  |   14.57267   .6033334    24.15   0.000     13.38708    15.75826
------------------------------------------------------------------------------


  Variables that uniquely identify margins: latitude

Marginal Effect of Latitude

reg total c.elevation##c.elevation##c.elevation latitude
      Source |       SS           df       MS      Number of obs   =       469
-------------+----------------------------------   F(4, 464)       =    104.84
       Model |   10342.088         4    2585.522   Prob > F        =    0.0000
    Residual |  11442.7233       464  24.6610415   R-squared       =    0.4747
-------------+----------------------------------   Adj R-squared   =    0.4702
       Total |  21784.8113       468   46.548742   Root MSE        =     4.966

------------------------------------------------------------------------------
       total |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   elevation |  -.0076266   .0034867    -2.19   0.029    -.0144782   -.0007749
             |
 c.elevation#|
 c.elevation |   .0000138   3.79e-06     3.63   0.000     6.31e-06    .0000212
             |
 c.elevation#|
 c.elevation#|
 c.elevation |  -4.07e-09   1.20e-09    -3.38   0.001    -6.44e-09   -1.71e-09
             |
    latitude |   3.954896   .3295734    12.00   0.000     3.307255    4.602537
       _cons |  -161.5611   13.50858   -11.96   0.000    -188.1066   -135.0155
------------------------------------------------------------------------------
margins, at(elevation=(0(100)2500))
marginsplot
Predictive margins                              Number of obs     =        469
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : elevation       =           0

2._at        : elevation       =         100

3._at        : elevation       =         200

4._at        : elevation       =         300

5._at        : elevation       =         400

6._at        : elevation       =         500

7._at        : elevation       =         600

8._at        : elevation       =         700

9._at        : elevation       =         800

10._at       : elevation       =         900

11._at       : elevation       =        1000

12._at       : elevation       =        1100

13._at       : elevation       =        1200

14._at       : elevation       =        1300

15._at       : elevation       =        1400

16._at       : elevation       =        1500

17._at       : elevation       =        1600

18._at       : elevation       =        1700

19._at       : elevation       =        1800

20._at       : elevation       =        1900

21._at       : elevation       =        2000

22._at       : elevation       =        2100

23._at       : elevation       =        2200

24._at       : elevation       =        2300

25._at       : elevation       =        2400

26._at       : elevation       =        2500

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |          .  (not estimable)
          2  |   5.984411   .5999464     9.97   0.000     4.805462    7.163359
          3  |    5.60585   .4222693    13.28   0.000     4.776053    6.435648
          4  |   5.453481   .3367075    16.20   0.000     4.791821    6.115142
          5  |   5.502861   .3223978    17.07   0.000      4.86932    6.136402
          6  |   5.729548   .3373162    16.99   0.000     5.066691    6.392405
          7  |     6.1091   .3529676    17.31   0.000     5.415487    6.802713
          8  |   6.617076   .3605384    18.35   0.000     5.908586    7.325567
          9  |   7.229034   .3614414    20.00   0.000     6.518769    7.939299
         10  |   7.920532   .3610039    21.94   0.000     7.211127    8.629937
         11  |   8.667128   .3648421    23.76   0.000     7.950181    9.384076
         12  |   9.444381   .3762415    25.10   0.000     8.705032    10.18373
         13  |   10.22785   .3949241    25.90   0.000     9.451787    11.00391
         14  |   10.99309    .417921    26.30   0.000     10.17184    11.81434
         15  |   11.71566   .4419586    26.51   0.000     10.84717    12.58415
         16  |   12.37112   .4662718    26.53   0.000     11.45486    13.28739
         17  |   12.93503     .49549    26.11   0.000     11.96135    13.90871
         18  |   13.38294   .5421034    24.69   0.000     12.31766    14.44823
         19  |   13.69042   .6260242    21.87   0.000     12.46023    14.92062
         20  |   13.83302   .7685783    18.00   0.000      12.3227    15.34335
         21  |    13.7863   .9852521    13.99   0.000      11.8502    15.72241
         22  |   13.52582   1.284649    10.53   0.000     11.00137    16.05027
         23  |   13.02714   1.671993     7.79   0.000     9.741524    16.31276
         24  |   12.26581   2.151945     5.70   0.000     8.037047    16.49458
         25  |    11.2174   2.729625     4.11   0.000     5.853439    16.58136
         26  |   9.857454   3.410736     2.89   0.004     3.155052    16.55986
------------------------------------------------------------------------------


  Variables that uniquely identify margins: elevation

Marginal Effect of Elevation

Plot by Date in New Hartford, NY

//reshape data to long
reshape long jan, i(ghcnid) j(date) string
rename jan inches
(note: j = 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30
>  31 4 5 6 7 8 9)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      469   ->   14539
Number of variables                  40   ->      11
j variable (31 values)                    ->   date
xij variables:
                    jan1 jan10 ... jan9   ->   jan
-----------------------------------------------------------------------------
//destring date
destring date, replace
date has all characters numeric; replaced as byte
line inches date if stationname == "NEW HARTFORD 0.8 S"


![Line Graph of New Hartford 0.8 S](line_08S.png)






```stata
scatter inches date if stationname == "NEW HARTFORD 1.0 WSW"
```

Scatter Plot of New Hartford 1.0 WSW

twoway (scatter inches date if stationname == "NEW HARTFORD 0.8 S") (line inches date if stationname == "NEW HARTFORD 0.8 S")
> hes date if stationname == "NEW HARTFORD 0.8 S")

Scatter Plot and Line Graph of New Hartford 0.8 S

R

General Information

Remember

  • to use RScript
  • to back-up, back-up, and then, back-up
  • CRAN, Google, and ChatGPT is your friend–when you use it right!
  • to check data structure and type when error

You can download and install R from here and RStudio and R from here.

You can download the rscript and the data set used in this website here:

Download data_science_for_hamiltonians.zip

Snowfall Data

####Load and Reshape Data####
# Attach packages
install.packages('tidyverse')
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
snowfall <- snowfall %>% 
  mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
Error in snowfall %>% mutate_all(~str_replace_all(., c(M = "0", T = "0"))): could not find function "%>%"
library(tidyverse)
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.2     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.2     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
####get total snowfall by GHCN####
# Load the CSV file of snowfall
# Read from csv with the first row skipped and with the second row as the column names of the dataframe
snowfall <- read.csv("NY-snowfall-202301.csv", skip = 1, header = TRUE)
# Display column names
colnames(snowfall)
 [1] "GHCN.ID"      "Station.Name" "County"       "State"        "Elevation"   
 [6] "Latitude"     "Longitude"    "Jan.1"        "Jan.2"        "Jan.3"       
[11] "Jan.4"        "Jan.5"        "Jan.6"        "Jan.7"        "Jan.8"       
[16] "Jan.9"        "Jan.10"       "Jan.11"       "Jan.12"       "Jan.13"      
[21] "Jan.14"       "Jan.15"       "Jan.16"       "Jan.17"       "Jan.18"      
[26] "Jan.19"       "Jan.20"       "Jan.21"       "Jan.22"       "Jan.23"      
[31] "Jan.24"       "Jan.25"       "Jan.26"       "Jan.27"       "Jan.28"      
[36] "Jan.29"       "Jan.30"       "Jan.31"      
# Display top five rows
head(snowfall)
      GHCN.ID                 Station.Name County    State Elevation Latitude
1 US1NYER0211                  AKRON 3.6 W   ERIE New York       641    43.02
2 US1NYAB0041               ALBANY 0.3 ESE ALBANY New York       213    42.66
3 US1NYAB0023                ALBANY 0.7 SW ALBANY New York       256    42.66
4 US1NYAB0056               ALBANY 2.4 WNW ALBANY New York       276    42.68
5 USW00014735 ALBANY INTERNATIONAL AIRPORT ALBANY New York       280    42.75
6 US1NYNS0042            ALBERTSON 0.2 SSE NASSAU New York       142    40.77
  Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1    -78.57     M     M     M     M     M     M     M     M   0.0    0.0    0.0
2    -73.79   0.0   0.0   0.0   0.0   0.0     M     M   0.0   0.0      T    0.0
3    -73.81   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0      T    0.1
4    -73.84     M     M   0.0     M     M     M     M   0.0   0.0      T    0.0
5    -73.80   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0     T      T    0.0
6    -73.65     M   0.0     M     M     M     M     M   0.0     M    0.0    0.0
  Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1      M      M      M    0.0    0.0      M      M    0.0      M      M      M
2      T      M    0.1      T    0.0    0.0    0.0    0.0      M      M      T
3      T    0.3      T      T    0.0    0.0      T    0.0    0.5    0.1      T
4      M      M    0.2      T    0.0    0.0    0.0    0.0    0.4      M      M
5    0.2    0.2      T      T    0.0      T    0.0    0.3    0.5      T    2.0
6      M      M      M      M      M    0.0      M    0.0      M    0.0    0.0
  Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31
1      M      M    0.0      M      M    0.0    0.0    0.0      M
2      M      M      T    0.4      T      T    0.0    0.0    0.5
3    3.8    3.0      T    1.0      T      T    0.0    0.0      T
4    4.2    3.6      T    1.1      T    0.0    0.0    0.0    0.5
5    6.0      T    1.1      M      T      T    0.0    0.6    0.1
6      M    0.0    0.0      M    0.0      M    0.0    0.0      M
# Replace "M" (missing) and "T" (trace) with 0 in all columns of the dataframe
snowfall <- snowfall %>% 
  mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
# Check data type
snowfall$Jan.1 %>% typeof()
[1] "character"
# Convert columns starting with Jan. to numeric
snowfall <- snowfall %>% 
  mutate_at(vars(starts_with("Jan.")), as.numeric)
# Add a new column with the sum of each row and reorder in descending order of total
snowfall <- snowfall %>% 
  mutate(total = rowSums(select(., starts_with("Jan.")))) %>% 
  arrange(desc(total))
# top 5
head(snowfall)
      GHCN.ID         Station.Name       County    State Elevation Latitude
1 US1NYSL0006 HANNAWA FALLS 0.1 SW S0. LAWRENCE New York       561    44.61
2 USC00301264       CA00ARAUGUS 3W  CA00ARAUGUS New York      1708    42.34
3 USC00308961           WARSAW 4 W      WYO0ING New York      1750    42.74
4 USC00304555      LAKE PLACID 2 S        ESSEX New York      1888     44.3
5 US1NYH00008      LONG LAKE 1.2 N     HA0IL0ON New York      1793    43.99
6 US1NYFK0007   SARANAC LAKE 6.2 N     FRANKLIN New York      1710    44.41
  Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1    -74.97     0   0.0     0     0     0   0.0   0.1   0.0     0    0.7    0.0
2    -78.91     0   0.0     0     0     0   1.1   4.0   0.4     0    0.0    0.0
3    -78.21     0   0.0     0     0     0   0.0   1.0   0.4     0    0.0    0.0
4    -73.98     0   0.1     0     0     0   0.1   0.5   0.1     0    2.0    0.1
5    -74.42     0   0.0     0     0     0   0.0   0.4   0.6     0    7.0    0.1
6    -74.15     0   0.1     0     0     0   0.0   0.5   0.8     0    0.8    0.1
  Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1    0.5    2.5    4.5    0.0      0      0    0.0    0.5    0.7    1.0    0.0
2    0.0    2.4    2.5    0.7      0      0    0.0    0.1    1.5    2.5    0.0
3    0.0    1.0    3.0    1.0      0      0    0.0    0.1    1.0    2.5    0.4
4    0.1    0.0    3.0    1.0      0      0    0.0    0.3    5.0    1.0    0.1
5    0.0    0.0    1.1    0.0      0      0    0.4    0.3    4.0    1.6    0.0
6    0.1    0.2    2.5    0.2      0      0    0.3    0.1    3.6    1.8    0.0
  Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31 total
1    0.5    1.0    0.8    6.5    3.5    1.5    1.5    6.5    1.5  33.8
2    4.0    0.6    0.7    3.0    5.5    1.3    0.0    0.1    2.5  32.9
3    3.7    2.2    0.4    7.5    3.4    1.8    0.0    0.0    2.0  31.4
4    2.0    0.3    3.0    3.0    2.5    1.0    1.0    1.0    2.5  29.7
5    2.1    0.6    1.6    3.4    1.6    1.7    0.2    0.0    2.8  29.5
6    1.4    0.2    4.1    3.7    2.0    0.9    0.8    2.1    2.7  29.0
# top 1
snowfall %>%
  slice_max(total)
      GHCN.ID         Station.Name       County    State Elevation Latitude
1 US1NYSL0006 HANNAWA FALLS 0.1 SW S0. LAWRENCE New York       561    44.61
  Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1    -74.97     0     0     0     0     0     0   0.1     0     0    0.7      0
  Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1    0.5    2.5    4.5      0      0      0      0    0.5    0.7      1      0
  Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31 total
1    0.5      1    0.8    6.5    3.5    1.5    1.5    6.5    1.5  33.8

Interactive Graph of Top 15

####Produce an interactive graph of top 15####
# Attach packages
library(ggplot2)
library(plotly)
# Create ggplot object
p <- ggplot(snowfall, aes(x = Station.Name, y = total)) + 
  geom_bar(stat = "identity", aes(fill = total))
p

# Select top 15 stations and reorder factor levels in descending order
top_stations <- snowfall %>% 
  top_n(15, total) %>% 
  mutate(Station.Name = fct_reorder(Station.Name, desc(total)))
# Create ggplot object
p <- ggplot(top_stations, aes(x = Station.Name, y = total)) +
  geom_bar(stat = "identity", aes(fill = total)) 
p

# Create ggplot object with custom color scheme
p <- ggplot(top_stations, aes(x = Station.Name, y = total, text = paste("Snowfall: ", total))) + # text is included for the interactive plot
  geom_bar(stat = "identity", aes(fill = total)) +
  scale_fill_gradient(low = "grey", high = "blue") + # change color scheme
  labs(title = "Top 15 Stations with the Most Snowfall in January",
       x = "Station",
       y = "Total January Snowfall (in)",
       fill = "Total\nSnowfall") + # change labels
  theme_bw() + # change themes
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) # change x axis label's orientation
p

# Convert ggplot to interactive plot
ggplotly(p)
# modify ggplotly object
ggplotly(p + theme(legend.position = "none"), # delete legend
         tooltip = "text", # include text
         hovertemplate = paste("<b>%{x}</b><br>", 
                               "Snowfall: %{y}<br>")) %>% # set text template for the text balloon when mouse hovers
  layout(hoverlabel = list(bgcolor = c(low = "grey", high = "blue"))) # set color template

Interactive Graph by Date

####Produce an interactive graph by date####
# Attach packages
library(ggplot2)
library(plotly)
# Leave the columns with the ID and dates
new_snowfall <- snowfall %>% select(Station.Name, starts_with("Jan."))
# tidy
tidy_snowfall <- new_snowfall %>% 
  gather(key = "date", value = "Snowfall", -Station.Name) 
# Change Jan. to YYYY-MM-DD format
tidy_snowfall$ymd <- as.POSIXct(paste0("2023-", sub("Jan\\.", "01-", tidy_snowfall$date)), format = "%Y-%m-%d")
tidy_snowfall$Date <- ymd(tidy_snowfall$ymd)
tidy_snowfall$Station <- tidy_snowfall$Station.Name
# Create ggplot object
ggp <- ggplot(tidy_snowfall, aes(x = Date, y = Snowfall, color = Station)) +
  geom_line() +
  labs(title = "Snowfall by Stations",
       x = "Date",
       y = "Snowfall (in)") +
  theme_bw() +
  theme(legend.position = "none") +
  scale_x_date(date_labels = "%b/%d")
ggp

# Convert to plotly object for interactivity
ggplotly(ggp)

Interactive Map

####Produce an interactive map of snowfall by NYS assembly districts####
# Attach packages
library(sf)
library(tmap)
library(leaflet)
# Read in the shapefile for the New York State Assembly Districts
ad_shapefile <- st_read("https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
Reading layer `OGRGeoJSON' from data source 
  `https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson' 
  using driver `GeoJSON'
Simple feature collection with 150 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -79.76259 ymin: 40.47658 xmax: -71.77749 ymax: 45.01586
Geodetic CRS:  WGS 84
# Create a spatial points dataframe
snowfall_sf <- st_as_sf(snowfall, coords = c("Longitude", "Latitude"), crs = 4326)

# Perform spatial join
snowfall_by_district <- st_join(snowfall_sf, ad_shapefile, join = st_intersects)

# Summarize snowfall by district
snowfall_summarized <- snowfall_by_district %>% 
  group_by(District) %>% 
  summarize(total_snowfall = sum(total))

# Count the number of observatories in each district
observatory_counts <- snowfall_by_district %>%
  group_by(District) %>%
  summarize(observatory_count = n())

# Add the observatory count to the snowfall_summarized data frame
snowfall_summarized <- st_join(snowfall_summarized, observatory_counts)

# Calculate the average snowfall for each district
snowfall_averages <- snowfall_summarized %>%
  mutate(avg_snowfall = round(total_snowfall / observatory_count, 2)) %>% #create a new column with average snowfall that displys two decimal digits
  select(-District.y) %>%
  rename(District = District.x)

# Create a shapefile containing the information on snowfall, sponsoring, and the number of mentions
nyassembly <- st_join(ad_shapefile, snowfall_averages) %>% 
  select(-c(District.y, District.x)) %>%
  mutate(OBJECTID = OBJECTID - 150)
# Define the color palette
pal <- leaflet::colorNumeric(palette = c("grey", "blue"), domain = nyassembly$avg_snowfall)

# Create the leaflet map
leaflet <- leaflet() %>%
  addTiles() %>% 
  setView(lng = -75.3, lat = 43, zoom = 6.7) %>% 
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = nyassembly, 
              fillColor = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)), 
              weight = 0.3,
              color = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),  
              opacity = 1.0,
              fillOpacity = 0.5, 
              label = ~paste("District: ", OBJECTID, ", ",
                             "Average Snowfall: ", format(round(avg_snowfall, 2), nsmall = 2)),
              highlightOptions = leaflet::highlightOptions(
                weight = 1.5,
                color = "white",
                fillOpacity = 0,
                bringToFront = TRUE)) %>%
  addLegend("bottomleft", 
            pal = pal, 
            values = nyassembly$avg_snowfall,
            title = "Average Snowfall", 
            opacity = 0.8) %>%
  addScaleBar(position = "topright")

Average January Snowfall in Inches by NYS Assembly Districts

leaflet

Linear Regression

# Check data type
is.numeric(snowfall$Elevation)
[1] FALSE
is.numeric(snowfall$Latitude)
[1] FALSE
# Correct data type
snowfall$Elevation <- as.numeric(snowfall$Elevation)
snowfall$Latitude <- as.numeric(snowfall$Latitude)
# Linear regression
model <- lm(total ~ Elevation + Latitude, data = snowfall)
summary(model)

Call:
lm(formula = total ~ Elevation + Latitude, data = snowfall)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.9577  -2.7186   0.2336   2.4818  27.4458 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.582e+02  1.202e+01 -13.162  < 2e-16 ***
Elevation    2.530e-03  3.506e-04   7.216 2.18e-12 ***
Latitude     3.861e+00  2.839e-01  13.601  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.344 on 466 degrees of freedom
Multiple R-squared:  0.3891,    Adjusted R-squared:  0.3864 
F-statistic: 148.4 on 2 and 466 DF,  p-value: < 2.2e-16