Lab 04

Advanced Data Analysis and Statistics

Notebook Open in Colab

Pre-lab Prep

To use the cloud computing platform Google Colab, you need a Google account and access to Google Drive. SU students can use their @g.syr.edu account.
Students who are not familiar with Google Colab are strongly encouraged to watch this quickview video and visit the Google Colab website to navigate through Welcome to Colab, Overview of Colab, and Guide to Markdown
Students are strongly encouraged to read through Lab 04 Instructions before class.

Lab Materials

1. Lab 04 Instructions

Please download Lab 04 Instructions and go through it before class.

2. Lab 04 Demo

To save time setting up the coding environment and dependencies on your local computer, you can click the Open in Colab button at the top of this webpage to open it in Google Colab.
Once you have opened it in Google Colab, log in using your Google account then click on ‘Runtime’ in the menu bar. Then, select ‘Change runtime type’ and modify the runtime type from Python 3 to R.

###############################################################
# Installing and Loading Packages
###############################################################
# Install all needed packages
install.packages("dataRetrieval")
install.packages("dplyr")
install.packages("lubridate")

# load these packages into the memory for later use
library(dataRetrieval)
library(dplyr)
library(lubridate)

#######################################################################################
# Downloading River Chemistry Time Series Data Using dataRetrieval package
#######################################################################################
# In the demo, the specific USGS site I am going to download Ca data for 
# has the site number USGS-01391500 (Saddle River at Lodi NJ)
# let's define a variable to store the site number
siteid <- "USGS-01391500"

# In lab 04 deliverables, you need to explore the other two sites:
# USGS-01111500 (BRANCH RIVER AT FORESTDALE, RI)
# USGS-02336300 (PEACHTREE CREEK AT ATLANTA, GA)

# USGS encode all chemicals as numeric codes. Calcium's code is 00915 while Sodium code is 00930
parmCd <- "00930"

# let's focus on water quality data collected from 1978 to 2018
start.date = as.Date("1978-01-01")
end.date = as.Date("2019-01-01")

# download data and assigned downloaded data to a variable named "demo_site"
demo_site <- readWQPqw(siteNumbers = siteid,
                       parameterCd = parmCd,
                       startDate = start.date,
                       endDate = end.date)

# extract year, month, month-day info of sampling date and add them into three columns
demo_site$year <- year(demo_site$ActivityStartDate)
demo_site$month <- month(demo_site$ActivityStartDate)
demo_site$mday <- mday(demo_site$ActivityStartDate)

# Simplify the dataset by keeping only most essential columns (i.e., location, sampling date, data value)
demo_site <- demo_site %>%
    select(c("MonitoringLocationIdentifier", "ActivityStartDate", 
             "year", "month", "mday", "ResultMeasureValue")) %>%
        rename(site_no=MonitoringLocationIdentifier, sample_dt=ActivityStartDate,
               result_va=ResultMeasureValue)

# open this refined dataset to overview
View(demo_site)

# we can also save this dataset as a csv file for future use
write.csv(demo_site, file = "./demo_site.csv", row.names = FALSE, na = "")

# in the future, you can read this file by using read.csv function
#read.csv(file = "./demo_site.csv", header = TRUE, na.strings = c("", "NA"))

#######################################################################################
# Data Processing
#######################################################################################

#### Data overview
# print top 6 rows
head(demo_site)

# print top 5 rows
head(demo_site,5)

# print bottom 6 rows
tail(demo_site)

# print the statistical summary of each column
summary(demo_site)

#### Index and subset
# print the cell that is at row 2 and column 3.
demo_site[2,3]
# In the above example, first number indicates row number while the second number indicates column number

# print the 2nd row
demo_site[2,]

# print the top 10 rows in the column of 'result_va'
demo_site[1:10,"result_va"]

# print the column named 'site_no'
demo_site$site_no

# print out all rows whose column 'year' is larger than 2000
demo_site[demo_site$year>=2000,]

#### Column-wise and row-wise summary
# print out the minimum value of column 'result_va'
min(demo_site$result_va)

# What is the max Ca concentration?
# hint: use the fucntion max

### Helpful Functions apply() and tapply()
# We create a new data frame with 2 columns. First column contains 1, 2, and 3.
# Second column contains 4, 5, and 6
test_df <- data.frame(c1=c(1,2,3),c2=c(4,5,6))

# How does 'apply' work? 
# Try to run '?apply'
apply(test_df, 1, sum)
apply(test_df, 2, sum)

# Try '?tapply'
tapply(demo_site$result_va, list(demo_site$year, demo_site$site_no), mean, na.rm=TRUE)

#######################################################################################
# Data Visualization and Regression Analysis
#######################################################################################

# aggregate data by year
annual_summary <- tapply(demo_site$result_va, demo_site$year, mean, na.rm=TRUE)
annual_summary <- data.frame(year=as.numeric(names(annual_summary)), 
                             result_annual=annual_summary)

# plotting
plot(x = annual_summary$year,
     y = annual_summary$result_annual,
     xlab="Year",
     ylab="Annual Mean Na Concentration (mg/L)",
     main="Temporal Trend of Annual Mean Na Concentration 1978-2018",
     type="b")

# regression analysis
abline(lm(result_annual ~ year, data=annual_summary), col=2)
summary(lm(result_annual ~ year, data=annual_summary))

# save the figure to pdf

3. Lab 04 Deliverable

Modify and rerun the demo code to generate the temporal plot of Na concentration for the other two sites (USGS-01111500 and USGS-02336300) and perform the regression analysis for both sites
Submit a single-page PDF file including these two plots plus 2-3 paragraphs describing these two plots and what explain the difference (refer to papers in Lab 01 and previous lectures) in the temporal trend at these two sites

Deliverables

Deliverables	Date Assigned	Date Due
Lab 04 (refer to the SU Blackboard website)	Thur 10/24/2024	Thur 10/31/2024, 12:30pm ET