Skip to content Skip to sidebar Skip to footer

If My Data Is Log10 Normal, Which Family to Use in a Glm

Modeling in R

The purpose of a statistical model is to help empathise what variables might best predict a phenomenon of interest, which ones have more or less influence, define a predictive equation with coefficients for each of the variables, and then apply that equation to predict values using the same input variables for other areas. This process requires samples with observations of the explanatory (or contained) and response (or dependent) variables in question.

We'll start with a simple linear model using our Sierra February climate information to model temperature (response variable) as predicted by elevation (explanatory variable):

Linear model

Figure 11.1: Linear model

Some common statistical models

There are many types of statistical models. Variables may be nominal (chiselled) or interval/ratio data. Yous may be interested in predicting a continuous interval/ratio variable from other continuous variables, or predicting the probability of an occurrence (eastward.g. of a species), or maybe the count of something (likewise perhaps a species). You may be needing to classify your phenomena based on continuous variables. Here are some examples:

  • lm(y ~ x) linear regression model with ane explanatory variable
  • lm(y ~ x1 + x2 + x3) multiple regression, a linear model with multiple explanatory variables
  • glm(y ~ ten, family = poisson) generalized linear model, poisson distribution; come across ?family unit to see those supported, including binomial, gaussian, poisson, etc.
  • glm(y ~ x + y, family = binomial) glm for logistic regression
  • aov(y ~ x) analysis of variance (same as lm except in the summary)
  • gam(y ~ x) generalized additive models
  • tree(y ~ ten) or rpart(y ~ 10) regression/nomenclature trees

Linear model (lm)

If nosotros look at the Sierra February climate instance, the regression line in the graph higher up shows the response variable temperature predicted past the explanatory variable elevation. Here's the code that produced the model and graph:

                                                      library(igisci);                    library(tidyverse)                  sierra                    <-                    sierraFeb                    %>%                                                        filter(!                    is.na(TEMPERATURE))                  model1                    =                    lm(TEMPERATURE                    ~                    Tiptop,                    data =                    sierra)                  cc                    =                    model1$coefficients                  sierra$resid                    =                    resid(model1)                  sierra$predict                    =                    predict(model1)                                    eqn                    =                    paste("temperature =",                                                        (round(cc[one],                    digits=                    two)),                    "+",                                                        (round(cc[2],                    digits=                    three)),                    "tiptop + e")                                      ggplot(sierra,                    aes(ten=Peak,                    y=TEMPERATURE))                    +                                                        geom_smooth(method=                    "lm",                    se=                    FALSE,                    color=                    "lightgrey")                    +                                                        geom_segment(aes(xend=Peak,                    yend=predict),                    blastoff=.2)                    +                                                        geom_point(aes(colour=resid))                    +                                                        scale_color_gradient2(low=                    "blueish",                    mid=                    "ivory2",                    high=                    "red")                    +                                                        guides(color=                    "none")                    +                                                        theme_bw()                    +                                                        ggtitle(paste("Residuals (e) from model: ",eqn))                              

The summary() role can be used to look at the results statistically:

              ##  ## Call: ## lm(formula = TEMPERATURE ~ ELEVATION, data = sierra) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -ii.9126 -1.0466 -0.0027  0.7940  4.5327  ##  ## Coefficients: ##               Approximate Std. Error t value Pr(>|t|)     ## (Intercept) 11.8813804  0.3825302   31.06   <2e-16 *** ## ELEVATION   -0.0061018  0.0002968  -20.56   <2e-16 *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' ane ##  ## Residual standard error: i.533 on sixty degrees of freedom ## Multiple R-squared:  0.8757, Adjusted R-squared:  0.8736  ## F-statistic: 422.half-dozen on 1 and threescore DF,  p-value: < ii.2e-sixteen            

Probably the near important statistic is the p value for the explanatory variable Meridian, which in this case is very small: \(<2.2\times10^{-16}\).

The graph shows non only the linear prediction line, but also the scatter plot of input points displayed with connecting offsets indicating the magnitude of the residuum (error term) of the data point away from the prediction.

Making Predictions

Nosotros tin can use the linear equation and the coefficients derived for the intercept and the explanatory variable (and rounded a bit) to predict temperatures given the explanatory variable pinnacle, and then:

\[ Temperature_{prediction} = 11.88 - 0.006{elevation} \]

So with these coefficients, we can use the equation to a vector of elevations to predict temperature from those elevations:

                                  a                    <-                    model1$coefficients[1]                  b                    <-                    model1$coefficients[2]                  elevations                    <-                    c(500,                    k,                    1500,                    2000)                  elevations                              
              ## [1]  500 thousand 1500 2000            
                                  tempEstimate                    <-                    a                    +                    b                    *                    elevations                  tempEstimate                              
              ## [1]  viii.8304692  5.7795580  two.7286468 -0.3222645            

Next, nosotros'll plot predictions and residuals on a map.

Spatial Influences on Statistical Analysis

Spatial statistical analysis brings in the spatial dimension to a statistical analysis, ranging from visual assay of patterns to specialized spatial statistical methods. There are many applications for these methods in environmental enquiry, since spatial patterns are by and large highly relevant. We might ask:

  • What patterns tin we meet?
  • What is the effect of scale?
  • Relationships amidst variables – practise they vary spatially?

In addition to helping to identify variables not considered in the model, the effect of local weather and trend of nearby observations to be related is rich for further exploration. Readers are encouraged to larn more most spatial statistics, and some methods are explored in Hijmans (northward.d.).

But we'll utilise a fairly straightforward method that can help enlighten u.s. about the result of other variables and the influence of proximity – mapping residuals. Mapping residuals is ofttimes enlightening since they can illustrate the influence of other variables. In that location's fifty-fifty a recent weblog on this: https://californiawaterblog.com/2021/10/17/sometimes-studying-the-variation-is-the-interesting-thing/?fbclid=IwAR1tfnj2SM_du6-jEd9h-AApRPO7w88TiPUbvTFz5dkagpQ1RMV04ZZYAr8

Mapping Residuals

If we start with a map of Feb temperatures in the Sierra and endeavour to model these as predicted past elevation, we might see the following result. (First we'll build the model again, though this is probably still in retention.)

                                                            library(tidyverse)                                          library(igisci)                                          library(sf)                    sierra                      <-                      st_as_sf(filter(sierraFeb,                      !                      is.na(TEMPERATURE)),                      coords=                      c("LONGITUDE",                      "LATITUDE"),                      crs=                      4326)                    modelElev                      <-                      lm(TEMPERATURE                      ~                      ELEVATION,                      information =                      sierra)                                          summary(modelElev)                                  
                ##  ## Call: ## lm(formula = TEMPERATURE ~ ELEVATION, data = sierra) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -2.9126 -one.0466 -0.0027  0.7940  4.5327  ##  ## Coefficients: ##               Estimate Std. Mistake t value Pr(>|t|)     ## (Intercept) 11.8813804  0.3825302   31.06   <2e-16 *** ## Pinnacle   -0.0061018  0.0002968  -twenty.56   <2e-16 *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residual standard error: 1.533 on 60 degrees of freedom ## Multiple R-squared:  0.8757, Adjusted R-squared:  0.8736  ## F-statistic: 422.6 on 1 and 60 DF,  p-value: < 2.2e-16              

We can see from the model summary that Top is significant equally a predictor for TEMPERATURE, with a very low P value, and the R-squared value tells united states of america that 87% of the variation in temperature might be "explained" by meridian. We'll store the residuum (resid) and prediction (predict), and set up our mapping environment to create maps of the original data, prediction, and residuals.

                                      cc                      <-                      modelElev$coefficients                    sierra$resid                      <-                      resid(modelElev)                    sierra$predict                      <-                      predict(modelElev)                                        eqn                      =                      paste("temperature =",                                                              (round(cc[1],                      digits=                      two)),                      "+",                                                              (round(cc[2],                      digits=                      three)),                      "pinnacle")                    ct                      <-                      st_read(system.file("extdata","sierra/CA_places.shp",package=                      "igisci"))                    ct$AREANAME_pad                      <-                      paste0(str_replace_all(ct$AREANAME,                      '[A-Za-z]',' '), ct$AREANAME)                    bounds                      <-                      st_bbox(sierra)                                  

The original information

                                                            library(tmap);                      library(terra);                      library(igisci)                    hillsh                      <-                      rast(ex("CA/ca_hillsh_WGS84.tif"))                                          tm_shape(hillsh,                      bbox=bounds)                      +                      tm_raster(palette=                      "-Greys",                      way=                      "cont",                      legend.show=F)                      +                                                              tm_shape(st_make_valid(CA_counties))                      +                      tm_borders()                      +                                                              tm_shape(ct)                      +                      tm_dots()                      +                      tm_text("AREANAME",                      size=                      0.5,                      only=                      c("left","top"))                      +                                                              tm_shape(sierra)                      +                      tm_symbols(col=                      "TEMPERATURE",                      palette=                      "-RdBu",                      size=                      0.5)                      +                                                              tm_layout(legend.position=                      c("correct","top"))                      +                                                              tm_graticules(lines=F)                      +                                                              tm_layout(main.championship=                      "February Normals",                                                              principal.title.position =                      c("correct"),                      chief.title.size =                      0.viii)                                  

Original February temperature data

Figure 11.two: Original February temperature data

The prediction map

                                                            library(tmap)                                          tm_shape(hillsh,                      bbox=bounds)                      +                      tm_raster(palette=                      "-Greys",                      style=                      "cont",                      legend.show=F)                      +                                                              tm_shape(st_make_valid(CA_counties))                      +                      tm_borders()                      +                                                              tm_shape(ct)                      +                      tm_dots()                      +                      tm_text("AREANAME",                      size=                      0.v,                      just=                      c("left","pinnacle"))                      +                                                              tm_shape(sierra)                      +                      tm_symbols(col=                      "predict",                      palette=                      "-RdBu",                      size=                      0.5)                      +                                                              tm_layout(legend.position=                      c("right","top"))                      +                                                              tm_graticules(lines=F)                      +                                                              tm_layout(main.title=eqn,                                                              main.title.position =                      c("correct"),                      chief.championship.size =                      0.8)                                  

Temperature predicted by elevation model

Effigy 11.iii: Temperature predicted by top model

We can also use the model to predict temperatures from a raster of elevations, which should match the elevations at the weather condition stations, allowing us to predict temperature for the remainder of the study expanse. We should have more confidence in the area (the Sierra Nevada and nearby areas) actually sampled past the stations, then we'll crop with the extent roofing that surface area (though this will include more than coastal areas that are probable to differ in Feb temperatures from patterns observed in the surface area sampled).

We'll use GCS pinnacle raster information.

                                                            library(igisci)                                          library(tmap);                      library(terra)                    elev                      <-                      rast(ex("CA/ca_elev_WGS84.tif"))                    elevSierra                      <-                      crop(elev,                      ext(-                      122,                      -                      118,                      35,                      41))                      # GCS                                        b0                      <-                      coef(modelElev)[one]                    b1                      <-                      coef(modelElev)[2]                    tempSierra                      <-                      b0                      +                      b1                      *                      elevSierra                                          names(tempSierra)                      =                      "Predicted Temperature"                                                              #tempCA <- b0 + b1 * elev                                                              tm_shape(tempSierra)                      +                      tm_raster(palette=                      "-RdBu")                      +                      tm_graticules(lines=F)                      +                                                              tm_legend(position =                      c("correct",                      "summit"))                      +                                                              tm_shape(st_make_valid(CA_counties))                      +                      tm_borders()                      +                                                              tm_shape(ct)                      +                      tm_dots()                      +                      tm_text("AREANAME",                      size=                      0.5,                      just=                      c("left","top"))                      +                                                              tm_layout(primary.title=eqn,                                                              main.title.position =                      c("correct"),                      principal.title.size =                      0.7)                                  

Temperature predicted by elevation raster

Figure 11.4: Temperature predicted by summit raster

The residuals

                                                            library(tmap)                                          tm_shape(hillsh,                      bbox=bounds)                      +                      tm_raster(palette=                      "-Greys",                      manner=                      "cont",                      fable.show=F)                      +                                                              tm_shape(st_make_valid(CA_counties))                      +                      tm_borders()                      +                                                              tm_shape(ct)                      +                      tm_dots()                      +                      tm_text("AREANAME",                      size=                      0.5,                      just=                      c("left","pinnacle"))                      +                                                              tm_shape(sierra)                      +                      tm_symbols(col=                      "resid",                      palette=                      "-RdBu",                      size=                      0.five)                      +                                                              tm_layout(legend.position=                      c("correct","tiptop"),                                                              chief.championship =                      paste("east from",eqn,"+ e"),                                          master.championship.size =                      0.7,                                          chief.title.position =                      "correct")                      +                                                              tm_graticules(lines=F)                                  

Residuals

Figure xi.five: Residuals

Statistically, if the residuals from regression are spatially autocorrelated, we should look for patterns in the residuals to find other explanatory variables. Notwithstanding nosotros can visually see, based upon the design of the residuals, that there's at least 1 important explanatory variable not included in the model: latitude. We'll exit that for you to look at in the exercises.

Analysis of Covariance

Analysis of Covariance, or ANCOVA, has the aforementioned purpose as Analysis of Variance that we looked at nether tests earlier, just too takes into business relationship the influence of other variables called covariates. In this way, analysis of covariance combines a linear model with an analysis of variance.

"Are water samples from streams draining sandstone, limestone, and shale different based on carbonate solutes, while taking into account top?"

Response variable is modeled from the factor (ANOVA) plus the covariate (regression)

  • ANOVA: Thursday ~ rocktype \[where TH is full hardness, a measure out of carbonate solutes\]

  • Regression: Th ~ elevation

  • ANCOVA: TH ~ rocktype + elevation

    • However shouldn't involve interaction betwixt rocktype (gene) and tiptop (covariate):

Case: stream types distinguished by discharge and gradient

Iii common river types are meandering, braided and anastomosed. For each, their slope varies by bankfull discharge in a relationship that looks something like:

Slope by Q, river types

Figure 11.half dozen: Slope by Q, river types

Braided river

Effigy xi.7: Braided river

Meandering river

Figure 11.8: Meandering river

Anastomosed river

Effigy 11.9: Anastomosed river

Requirements:

  • No relationship between discharge (covariate) and channel type (factor).
  • Another interpretation: the gradient of the human relationship between the covariate and response variable is nearly the same for each group; simply the intercept differs. Assumes parallel slopes.

log10(Due south) ~ strtype * log10(Q) … interaction between covariate and factor

log10(South) ~ strtype + log10(Q) … no interaction, parallel slopes

If models are not significantly different, remove interaction term due to parsimony, and satisfies this ANCOVA requirement.

                                                            library(tidyverse)                    csvPath                      <-                      organisation.file("extdata","streams.csv",                      parcel=                      "igisci")                    streams                      <-                      read_csv(csvPath)                    streams$strtype                      <-                      gene(streams$type,                      labels=                      c("Anastomosing","Braided","Meandering"))                                          summary(streams)                                  
                ##      blazon                 Q                South                    strtype   ##  Length:41          Min.   :     6   Min.   :0.000011   Anastomosing: 8   ##  Class :grapheme   1st Qu.:    xv   1st Qu.:0.000100   Braided     :12   ##  Mode  :character   Median :    twoscore   Median :0.000700   Meandering  :21   ##                     Mean   :  4159   Mean   :0.001737                     ##                     3rd Qu.:   550   3rd Qu.:0.002800                     ##                     Max.   :100000   Max.   :0.009500              
                                                            ggplot(streams,                      aes(Q, S,                      colour=strtype))                      +                                                              geom_point()                                  

Q vs S: needs log scaling

Figure 11.10: Q vs S: needs log scaling

                                                            library(scales)                      # needed for the trans_format function below                                                      
                                                            ggplot(streams,                      aes(Q, S,                      color=strtype))                      +                                                              geom_point()                      +                      geom_smooth(method=                      "lm",                      se =                      FALSE)                      +                                                              scale_x_continuous(trans=                      log10_trans(),                                          labels =                      trans_format("log10",                      math_format(10                      ^.ten)))                      +                                                              scale_y_continuous(trans=                      log10_trans(),                                          labels =                      trans_format("log10",                      math_format(10                      ^.x)))                                  

log scaled Q vs S with stream type

Figure 11.11: log scaled Q vs S with stream type

                                      ancova                      =                      lm(log10(S)~strtype*                      log10(Q),                      data=streams)                                          summary(ancova)                                  
                ##  ## Call: ## lm(formula = log10(Southward) ~ strtype * log10(Q), information = streams) ##  ## Residuals: ##      Min       1Q   Median       3Q      Max  ## -0.63636 -0.13903 -0.00032  0.12652  0.60750  ##  ## Coefficients: ##                            Estimate Std. Fault t value Pr(>|t|)     ## (Intercept)                -3.91819    0.31094 -12.601 1.45e-14 *** ## strtypeBraided              two.20085    0.35383   6.220 three.96e-07 *** ## strtypeMeandering           i.63479    0.33153   4.931 one.98e-05 *** ## log10(Q)                   -0.43537    0.18073  -2.409   0.0214 *   ## strtypeBraided:log10(Q)    -0.01488    0.19102  -0.078   0.9384     ## strtypeMeandering:log10(Q)  0.05183    0.18748   0.276   0.7838     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Rest standard error: 0.2656 on 35 degrees of freedom ## Multiple R-squared:  0.9154, Adjusted R-squared:  0.9033  ## F-statistic: 75.73 on 5 and 35 DF,  p-value: < 2.2e-16              
                ## Assay of Variance Table ##  ## Response: log10(Southward) ##                  Df  Sum Sq Mean Sq  F value    Pr(>F)     ## strtype           ii 18.3914  ix.1957 130.3650 < 2.2e-xvi *** ## log10(Q)          1  8.2658  eight.2658 117.1821 1.023e-12 *** ## strtype:log10(Q)  2  0.0511  0.0255   0.3619    0.6989     ## Residuals        35  2.4688  0.0705                        ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1              
                                                            # At present an additive model, which does not have that interaction                                        ancova2                      =                      lm(log10(S)~strtype+                      log10(Q),                      information=streams)                                          anova(ancova2)                                  
                ## Analysis of Variance Table ##  ## Response: log10(Due south) ##           Df  Sum Sq Mean Sq F value    Pr(>F)     ## strtype    2 18.3914  9.1957  135.02 < 2.2e-16 *** ## log10(Q)   i  8.2658  8.2658  121.37  3.07e-xiii *** ## Residuals 37  ii.5199  0.0681                       ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1              
                ## Analysis of Variance Table ##  ## Model 1: log10(S) ~ strtype * log10(Q) ## Model 2: log10(South) ~ strtype + log10(Q) ##   Res.Df    RSS Df Sum of Sq      F Pr(>F) ## 1     35 two.4688                            ## 2     37 2.5199 -two -0.051051 0.3619 0.6989              
                                                            # not significantly different, so model simplification is justified                                                                                  # Now we remove the strtype term                                        ancova3                      =                      update(ancova2,                      ~                      .                      -                      strtype)                                                              anova(ancova2,ancova3)                                                      
                ## Assay of Variance Table ##  ## Model one: log10(S) ~ strtype + log10(Q) ## Model 2: log10(S) ~ log10(Q) ##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)     ## 1     37  ii.5199                                   ## 2     39 25.5099 -2    -22.99 168.78 < ii.2e-16 *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' ane              
                                                            # Goes too far.  Removing the strtype creates a significantly different model                                                                                  step(ancova)                                  
                ## Start:  AIC=-103.2 ## log10(S) ~ strtype * log10(Q) ##  ##                    Df Sum of Sq    RSS     AIC ## - strtype:log10(Q)  two  0.051051 2.5199 -106.36 ## <none>                          2.4688 -103.20 ##  ## Stride:  AIC=-106.36 ## log10(S) ~ strtype + log10(Q) ##  ##            Df Sum of Sq     RSS      AIC ## <none>                   2.5199 -106.364 ## - log10(Q)  1    viii.2658 ten.7857  -48.750 ## - strtype   2   22.9901 25.5099  -15.455              
                ##  ## Call: ## lm(formula = log10(S) ~ strtype + log10(Q), information = streams) ##  ## Coefficients: ##       (Intercept)     strtypeBraided  strtypeMeandering           log10(Q)   ##           -three.9583             2.1453             i.7294            -0.4109              

ANOVA & ANCOVA are applications of a linear model.

  • Uses lm in R
  • Response variable is continuous, assumed normally distributed

mymodel = lm(log10(S) ~ strtype + log10(Q))

  • The linear model, with categorical explanatory variable + covariate

anova(mymodel)

  • Displays the Analysis of Variance table from the linear model

Generalized Linear Model (GLM)

The glm in R allows you to work with diverse types of information using various distributions, described as families such as:

  • gaussian : normal distribution – what is used with lm

  • binomial : logit – used with probabilities.

    • Used for logistic regression
    • A simple instance of predicting a canis familiaris panting or not (a binomial, since the dog either panted or didn't) related to temperature: https://bscheng.com/2016/10/23/logistic-regression-in-r/
  • poisson : for counts. Unremarkably used for species counts.

  • see help(glm) for other examples

Binomial family: Logistic GLM with streams

When the glm family unit is binomial, nosotros might exercise what's sometimes chosen logistic regression where we look at the probability of a categorical value (might be a species occurrence, for instance) given various independent variables. Let'southward try this with the streams data nosotros simply used for analysis of covariance, to predict the probability of a item stream type from slope and discharge:

                                                            library(igisci)                                          library(tidyverse)                    streams                      <-                      read_csv(ex("streams.csv"))                    streams$strtype                      <-                      factor(streams$type,                      labels=                      c("Anastomosing","Braided","Meandering"))                                  

For logistic regression, nosotros demand probability data, so locations where nosotros know the probability of something being truthful. If you think nigh it, that'due south pretty difficult to know for an ascertainment, but we might know whether something is true or not, similar we observed the phenomenon or not, and so nosotros simply assign a probability of i if something is observed, and 0 if information technology'southward not. Since in the case of the stream types, we really take iii unlike probabilities: whether it's truthful that the stream is anastomosing, whether it's braided and whether information technology's meandering, and then we'll create three dummy variables for that with values of 1 or 0: if information technology'southward braided, the value for braided is i, otherwise information technology'south 0; the same applies to all 3 dummy variables:

                                      streams                      <-                      streams                      %>%                                                              mutate(braided =                      ifelse(type                      ==                      "B",                      1,                      0),                                          meandering =                      ifelse(type                      ==                      "G",                      one,                      0),                                          anastomosed =                      ifelse(type                      ==                      "A",                      1,                      0))                    streams                                  
                ## # A tibble: 41 x vii ##    blazon       Q       S strtype braided meandering anastomosed ##    <chr>  <dbl>   <dbl> <fct>     <dbl>      <dbl>       <dbl> ##  1 B        9.9 0.0045  Braided       1          0           0 ##  2 B       11   0.007   Braided       one          0           0 ##  3 B       xvi   0.0068  Braided       1          0           0 ##  4 B       25   0.0042  Braided       1          0           0 ##  5 B       40   0.0028  Braided       1          0           0 ##  6 B      150   0.0041  Braided       one          0           0 ##  7 B      190   0.002   Braided       one          0           0 ##  8 B      550   0.0007  Braided       1          0           0 ##  ix B     1100   0.00055 Braided       i          0           0 ## 10 B     7100   0.00042 Braided       1          0           0 ## # ... with 31 more rows              

So nosotros tin run the logistic regression on each one:

                                                            summary(glm(braided                      ~                      log(Q)                      +                      log(S),                      family=binomial,                      data=streams))                                  
                ##  ## Call: ## glm(formula = braided ~ log(Q) + log(S), family = binomial, data = streams) ##  ## Deviance Residuals:  ##      Min        1Q    Median        3Q       Max   ## -2.00386  -0.49007  -0.00961   0.15789   1.58088   ##  ## Coefficients: ##             Estimate Std. Error z value Pr(>|z|)    ## (Intercept)  17.5853     6.7443   2.607  0.00912 ** ## log(Q)        2.0940     0.8135   2.574  0.01005 *  ## log(S)        4.3115     ane.6173   two.666  0.00768 ** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' one ##  ## (Dispersion parameter for binomial family taken to exist ane) ##  ##     Naught deviance: 49.572  on 40  degrees of freedom ## Balance deviance: 23.484  on 38  degrees of freedom ## AIC: 29.484 ##  ## Number of Fisher Scoring iterations: 8              
                                                            summary(glm(meandering                      ~                      log(Q)                      +                      log(S),                      family=binomial,                      information=streams))                                  
                ##  ## Call: ## glm(formula = meandering ~ log(Q) + log(Southward), family = binomial,  ##     data = streams) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -1.5469  -i.2515   0.8197   one.1073   1.3072   ##  ## Coefficients: ##             Estimate Std. Error z value Pr(>|z|)   ## (Intercept)  2.43255    1.38078   ane.762   0.0781 . ## log(Q)       0.04984    0.13415   0.372   0.7103   ## log(Due south)       0.34750    0.19471   ane.785   0.0743 . ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' 1 ##  ## (Dispersion parameter for binomial family taken to be 1) ##  ##     Null deviance: 56.814  on 40  degrees of freedom ## Residual deviance: 53.074  on 38  degrees of freedom ## AIC: 59.074 ##  ## Number of Fisher Scoring iterations: iv              
                                                            summary(glm(anastomosed                      ~                      log(Q)                      +                      log(Southward),                      family=binomial,                      information=streams))                                  
                ##  ## Call: ## glm(formula = anastomosed ~ log(Q) + log(Due south), family = binomial,  ##     data = streams) ##  ## Deviance Residuals:  ##        Min          1Q      Median          3Q         Max   ## -two.815e-05  -2.110e-08  -2.110e-08  -2.110e-08   2.244e-05   ##  ## Coefficients: ##              Estimate Std. Fault z value Pr(>|z|) ## (Intercept)   -176.09  786410.69       0        1 ## log(Q)         -11.81   77224.35       0        1 ## log(South)         -24.18   70417.40       0        1 ##  ## (Dispersion parameter for binomial family taken to be one) ##  ##     Nothing deviance: four.0472e+01  on 40  degrees of liberty ## Residue deviance: 1.3027e-09  on 38  degrees of freedom ## AIC: 6 ##  ## Number of Fisher Scoring iterations: 25              

So what we can see from this admittedly pocket-size sample is that:

  • Predicting braided character from Q and S works pretty well, distinguishing these characteristics from the characteristics of all other streams \[`braided==0`\]
  • Predicting anastomosed or meandering however isn't pregnant.

We really need a larger sample size, and Due north American geomorphology has recently discovered the "anabranching" stream type (of which anastomosed is a type) that had been long recognized in Commonwealth of australia, and its characteristics range across meandering at to the lowest degree, then muddying the waters…

Logistic landslide model

With its active tectonics, steep slopes and relatively common intense rain events during the wintertime season, littoral California has a propensity for landslides. Pacifica has been the focus of many landslide studies by the USGS and others, with events ranging from coastal erosion influenced by wave action to landslides and debris flows on steep hillslopes driven by positive pore force per unit area (Ellen and Wieczorek (1988)).

Landslide in San Pedro Creek watershed

Figure 11.12: Landslide in San Pedro Creek watershed

Sediment source analysis

Figure 11.13: Sediment source analysis

Also encounter the relevant department in the San Pedro Creek Virtual Fieldtrip story map: J. Davis (n.d.).

Landslides detected in San Pedro Creek Watershed in Pacifica, CA, are in the SanPedro folder of extdata. Landslides were detected on aerial photography from 1941, 1955, 1975, 1983, and 1997, and confirmed in the field during a study in the early 2000's (Sims (2004), J. Davis and Blesius (2015)). We'll create a logistic regression to compare landslide probability to various ecology factors. We'll use elevation, slope and other derivatives, distance to streams, distance to trails, and the results of a physical model predicting gradient stability every bit a stability alphabetize (SI), using slope and soil transmissivity data.

                                                            library(igisci);                      library(sf);                      library(tidyverse);                      library(terra)                    slides                      <-                      st_read(ex("SanPedro/slideCentroids.shp"))                      %>%                                                              mutate(IsSlide =                      1)                    slides$Visible                      <-                      factor(slides$Visible)                    trails                      <-                      st_read(ex("SanPedro/trails.shp"))                    streams                      <-                      st_read(ex("SanPedro/streams.shp"))                    wshed                      <-                      st_read(ex("SanPedro/SPCWatershed.shp"))                                  
                                                            ggplot(information=slides)                      +                      geom_sf(aes(col=Visible))                      +                                                              scale_color_brewer(palette=                      "Spectral")                      +                                                              geom_sf(data=streams,                      col=                      "blue")                      +                                                              geom_sf(data=wshed,                      fill up=                      NA)                                  

Landslides in San Pedro Creek watershed

Figure 11.xiv: Landslides in San Pedro Creek watershed

At this point, we take data for occurrences of landslides, and so for each point we assign the dummy variable IsSlide = 1. We don't have any absence data, simply we can create some using random points and assign them IsSlide = 0. Information technology would probably exist a adept idea to avoid areas very close to the occurrences, simply nosotros'll just put random points everywhere and only remove the ones that don't have SI data.

For the random locations, we'll set a seed and then for the book the results will stay the same; and if we're adept with that since it doesn't actually thing what seed we apply, we might besides use the answer to everything: 42.

                                                            library(terra)                    elev                      <-                      rast(ex("SanPedro/dem.tif"))                    SI                      <-                      rast(ex("SanPedro/SI.tif"))                    npts                      <-                      length(slides$Visible)                      *                      ii                      # up information technology a little to become plenty afterward clip                                                              set.seed(42)                    ten                      <-                      runif(npts,                      min=                      xmin(elev),                      max=                      xmax(elev))                    y                      <-                      runif(npts,                      min=                      ymin(elev),                      max=                      ymax(elev))                    rsampsRaw                      <-                      st_as_sf(data.frame(ten,y),                      coords =                      c("10","y"),                      crs=                      crs(elev))                                          ggplot(data=rsampsRaw)                      +                      geom_sf(size=                      0.five)                      +                                                              geom_sf(data=streams,                      col=                      "blue")                      +                                                              geom_sf(data=wshed,                      fill=                      NA)                                  

Raw random points

Figure 11.fifteen: Raw random points

At present we accept twice as many random points as we take landslides, within the rectangular extend of the watershed, as defined by the elevation raster. Nosotros doubled these because we need to exclude some points, yet still have enough to be roughly as many random points as slides. We'll exclude:

  • all points outside the watershed
  • points within 100 m of the landslide occurrences, the distance based on knowledge of the grain of landscape features (like spurs and swales) that might influence landslide occurrence
                                      slideBuff                      <-                      st_buffer(slides,                      100)                    wshedNoSlide                      <-                      st_difference(wshed,                      st_union(slideBuff))                                  

The 100 m buffer excludes all areas around all landslides in the data set up. We're going to run the model on specific years, but the all-time option of landslide absences will be to exclude areas around all landslides we know of.

                                                            ggplot(data=slides)                      +                      geom_sf(aes(col=Visible))                      +                                                              scale_color_brewer(palette=                      "Spectral")                      +                                                              geom_sf(data=streams,                      col=                      "blue")                      +                                                              geom_sf(information=wshedNoSlide,                      fill=                      NA)                                  

Landslides and buffers to exclude from random points

Figure eleven.16: Landslides and buffers to exclude from random points

                                      rsamps                      <-                      st_intersection(rsampsRaw,wshedNoSlide)                      %>%                                                              mutate(IsSlide =                      0)                    allPts                      <-                      bind_rows(slides, rsamps)                                  
                                                            ggplot(data=allPts)                      +                      geom_sf(aes(col=IsSlide))                      +                                                              geom_sf(data=streams,                      col=                      "blue")                      +                                                              geom_sf(data=wshed,                      make full=                      NA)                                  

Landslides and random points (excluded from slide buffers)

Figure 11.17: Landslides and random points (excluded from slide buffers)

Now we'll derive some environmental variables as rasters and extract their values to the slide/noslide bespeak information set. We'll use the shade function to create an exposure raster with the lord's day at the boilerplate noon angle from the southward. Notation the utilise of dplyr::rename to create meaningful variable names for rasters that had names like "lyr1" (who knows why?) which I determined past running information technology without the rename and seeing what information technology produced.

                                      slidesV                      <-                      vect(allPts)                    strms                      <-                      rasterize(vect(streams),elev)                    tr                      <-                      rasterize(vect(trails),elev)                    stD                      <-                      altitude(strms);                      names(stD)                      =                      "stD"                                        trD                      <-                      distance(tr);                      names(trD)                      =                      "trD"                                        gradient                      <-                      terrain(elev,                      v=                      "slope")                    curv                      <-                      terrain(slope,                      v=                      "slope")                    aspect                      <-                      terrain(elev,                      v=                      "aspect")                    exposure                      <-                      shade(slope/                      180                      *pi,aspect/                      180                      *pi,angle=                      53,direction=                      180)                    elev_ex                      <-                      terra::                      excerpt(elev, slidesV)                      %>%                      rename(elev=dem)                      %>%                      dplyr::                      select(-ID)                    SI_ex                      <-                      terra::                      excerpt(SI, slidesV)                      %>%                      dplyr::                      select(-ID)                    slope_ex                      <-                      terra::                      extract(slope, slidesV)                      %>%                      dplyr::                      select(-ID)                    curv_ex                      <-                      terra::                      extract(curv, slidesV)                      %>%                      rename(curv =                      slope)                      %>%                      dplyr::                      select(-ID)                    exposure_ex                      <-                      terra::                      extract(exposure, slidesV)                      %>%                      rename(exposure=lyr1)                      %>%                      dplyr::                      select(-ID)                    stD_ex                      <-                      terra::                      extract(stD, slidesV)                      %>%                      dplyr::                      select(-ID)                    trD_ex                      <-                      terra::                      extract(trD, slidesV)                      %>%                      dplyr::                      select(-ID)                    slidePts                      <-                      cbind(allPts,elev_ex,SI_ex,slope_ex,curv_ex,exposure_ex,stD_ex,trD_ex)                    slidesData                      <-                      every bit.data.frame(slidePts)                      %>%                                                              filter(!                      is.nan(SI))                      %>%                      st_as_sf(crs=                      st_crs(slides))                                  

We'll filter for just the slides in the 1983 imagery, and for the model also include all of the random points (detected with NA values for Visible):

                                      slides83                      <-                      slidesData                      %>%                      filter(Visible                      ==                      1983)                    slides83plusr                      <-                      slidesData                      %>%                      filter(Visible                      ==                      1983                      |                      is.na(Visible))                                  

The logistic regression in the glm model uses the binomial family:

                                                            summary(glm(IsSlide                      ~                      SI                      +                      curv                      +                      elev                      +                      slope                      +                      exposure                      +                      stD                      +                      trD,                                                              family=binomial,                      data=slides83plusr))                                  
                ##  ## Call: ## glm(formula = IsSlide ~ SI + curv + elev + slope + exposure +  ##     stD + trD, family = binomial, information = slides83plusr) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -2.3819  -0.5882  -0.0002   0.7378   3.9146   ##  ## Coefficients: ##               Guess Std. Error z value Pr(>|z|)     ## (Intercept)  0.1426071  1.0618610   0.134  0.89317     ## SI          -i.8190286  0.3639202  -4.998 5.78e-07 *** ## curv         0.0059237  0.0113657   0.521  0.60223     ## elev        -0.0025050  0.0014398  -1.740  0.08190 .   ## gradient        0.0623201  0.0211010   2.953  0.00314 **  ## exposure     ii.6971912  0.6603844   4.084 4.42e-05 *** ## stD         -0.0052639  0.0019765  -2.663  0.00774 **  ## trD         -0.0020730  0.0006532  -three.174  0.00150 **  ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' i ##  ## (Dispersion parameter for binomial family taken to be 1) ##  ##     Null deviance: 668.29  on 483  degrees of freedom ## Residual deviance: 402.90  on 476  degrees of freedom ## AIC: 418.9 ##  ## Number of Fisher Scoring iterations: 8              

So while we might want to make certain that we've picked the right variables, some conclusions from this might be that stability index (SI) is non surprisingly a strong predictor of landslides, only also that other than elevation and curvature, the other variables are also pregnant.

Map the logistic effect

Should exist able to, but need to use the formula for the model, which in our case volition have half-dozen explanatory variables SI (\(X_1\)), elev (\(X_2\)), gradient (\(X_3\)), exposure (\(X_4\)), stD (\(X_5\)) and trD (\(X_6\)): \[p = \frac{e^{b_0+b_1X_1+b_2X_2+b_3X_3+b_4X_4+b_5X_5}}{1+e^{b_0+b_1X_1+b_2X_2+b_3X_3+b_4X_4+b_5X_5}}\]

So nosotros retrieve the coefficients with coef() and use map algebra to create the prediction map, including dots for the landslides actually observed in 1983:

                                                                  library(tmap)                      SI_stD_model                        <-                        glm(IsSlide                        ~                        SI                        +                        gradient                        +                        exposure                        +                        stD                        +                        trD,                        family=binomial,                        data=slides83plusr)                      b0                        <-                        coef(SI_stD_model)[1]                      b1                        <-                        coef(SI_stD_model)[2]                      b2                        <-                        coef(SI_stD_model)[three]                      b3                        <-                        coef(SI_stD_model)[4]                      b4                        <-                        coef(SI_stD_model)[5]                      b5                        <-                        coef(SI_stD_model)[half-dozen]                      predictionMap                        <-                        (exp(b0+b1*SI+b2*slope+b3*exposure+b4*stD+b5*trD))/                                                                    (1                        +                        exp(b0+b1*SI+b2*slope+b3*exposure+b4*stD+b5*trD))                                              names(predictionMap)                        <-                        "slide_probability_1983"                                                                    tm_shape(predictionMap)                        +                        tm_raster()                        +                                                                    tm_shape(streams)                        +                        tm_lines(col=                        "blue")                        +                                                                    tm_shape(wshed)                        +                        tm_borders(col=                        "blackness")                        +                                                                    tm_shape(slides83)                        +                        tm_dots(col=                        "greenish")                        +                                                                    tm_layout(title=                        "Predicted by SI, slope, exposure, streamD, trailD")                        +                                                                    tm_graticules(lines=F)                                      

Logistic model prediction of 1983 landslide probability

Figure 11.18: Logistic model prediction of 1983 landslide probability

Poisson regression

Poission regression (family=poisson in glm) is for analyzing counts. Yous might utilize this to determine which variables are significant predictors, or y'all might desire to see where or when counts are especially loftier, or college than expected. Or it might exist applied to compare groupings. For case, if instead of looking at differences amongst rock types in Sinking Cove based on chemical properties in h2o samples or measurements (where we applied ANOVA), we looked at differences in fish counts, that might exist an application of poisson regression, as long equally other information assumptions agree.

  • An caption of poisson distribution using meteor showers at: https://towardsdatascience.com/the-poisson-distribution-and-poisson-procedure-explained-4e2cb17d459
  • An explanation of poisson distribution applied to schooling data: https://stats.oarc.ucla.edu/r/dae/poisson-regression/

Seabird Poisson Model

This model is further developed in a case report where nosotros'll map a prediction, but we'll look at the basics for applying a poisson family model in glm: seabird observations. The data is potentially suitable since information technology's counts with known time intervals of observation effort and initial review of the information suggests that the mean counts is reasonably similar to the variance. The information were nerveless through the Access program ( Applied California Current Ecosystem Studies (n.d.)) and a thorough analysis provided by Studwell et al. (2017).

                                                                  library(igisci);                        library(sf);                        library(tidyverse);                        library(tmap)                                              library(maptiles)                      transects                        <-                        st_read(ex("SFmarine/transects.shp"))                                      

One of the seabirds studies, the blackness-footed albatross, has the lowest counts so a relatively rare bird, and studies of its distribution may be informative. We'll await at counts collected during transect cruises in July 2006.

                                                                  tmap_mode("plot")                      oceanBase                        <-                        get_tiles(transects,                        provider=                        "Esri.OceanBasemap")                      transJul2006                        <-                        transects                        %>%                        filter(calendar month==                        seven                        &                        yr==                        2006                        &                        avg_tem>                        0                        &                        avg_sal>                        0                        &                        avg_fluo>                        0)                                              tm_shape(oceanBase)                        +                        tm_rgb()                        +                                                                    tm_shape(transJul2006)                        +                        tm_symbols(col=                        "bfal",                        size=                        "bfal")                                      

Black-footed albatross counts, July 2006

Effigy 11.xix: Black-footed albatross counts, July 2006

Prior studies have suggested that temperature, salinity, fluorescence, depth and various distances might be good explanatory variables to use to look at spatial patterns, and then we'll employ these in the model.

                                                                  summary(glm(bfal~avg_tem+avg_sal+avg_fluo+avg_dep+dist_land+dist_isla+dist_200m+dist_cord,                        data=transJul2006,family=poisson))                                      
                  ##  ## Telephone call: ## glm(formula = bfal ~ avg_tem + avg_sal + avg_fluo + avg_dep +  ##     dist_land + dist_isla + dist_200m + dist_cord, family = poisson,  ##     information = transJul2006) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -1.2935  -0.4646  -0.2445  -0.0945   4.7760   ##  ## Coefficients: ##               Approximate Std. Error z value Pr(>|z|)     ## (Intercept) -1.754e+02  7.032e+01  -2.495  0.01260 *   ## avg_tem      7.341e-01  three.479e-01   2.110  0.03482 *   ## avg_sal      five.202e+00  two.097e+00   2.480  0.01312 *   ## avg_fluo    -six.978e-01  1.217e+00  -0.574  0.56626     ## avg_dep     -iii.561e-03  viii.229e-04  -four.328 1.51e-05 *** ## dist_land   -two.197e-04  6.194e-05  -3.547  0.00039 *** ## dist_isla    1.372e-05  1.778e-05   0.771  0.44043     ## dist_200m   -4.696e-04  i.037e-04  -4.528 5.95e-06 *** ## dist_cord   -v.213e-06  1.830e-05  -0.285  0.77570     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## (Dispersion parameter for poisson family unit taken to be one) ##  ##     Null deviance: 152.14  on 206  degrees of freedom ## Residual deviance: 105.73  on 198  degrees of freedom ## AIC: 160.37 ##  ## Number of Fisher Scoring iterations: 7                

Nosotros can see in the model coefficients table several predictive variables that appear to exist significant, which nosotros'll explore further in the example study.

  • avg_tem
  • avg_sal
  • avg_dep
  • dist_land
  • dist_200m

Note R will allow you to run glm on any family, just information technology'south upward to y'all to make sure it's appropriate. For example, if yous provide a response variable that is not count data or doesn't fit the requirements for a poisson model, R will create a result just it's bogus.

Models Employing Machine Learning

Models using machine learning algorithms are commonly used in information science, fitting with its general exploratory and data-mining approach. There are many auto learning algorithms, and many resources for learning more about them, but they all share a basically black-box approach where a collection of variables are explored for patterns in input variables that help to explicate a response variable. The latter is like to more than conventional statistical modeling depict above, with the divergence existence the machine learning approach – call back of robots going through your data looking for connections.

We'll explore a few car learning methods (such equally neural networks, random forests, and support vector machines) in a subsequently affiliate when nosotros endeavor to use them to allocate satellite imagery, an important environmental application. Equally this application is in the spatial domain, we might desire to wait further into spatial statistical methods.

Exercise

  1. Start by creating a new RStudio projection "modeling" including a "data" folder we'll use for the last question when you lot bring in your own information.

  2. Add together LATITUDE as a second independent variable after Elevation in the model predicting February temperatures. Note that you'll need to change a setting in the st_as_sf to not remove Latitude and LONGITUDE, then y'all'll need to modify the lm regression formula to include both ELEVATION + LATITUDE. Offset derive and display the model summary results, and so answer the questions – is the caption meliorate based on the \(r^2\)?

  3. At present map the predictions and residuals.

Goal of Question 2

Figure 11.20: Goal of Question ii

  1. Can you come across any potential for additional variables? What testify might there be for localized effects?

  2. Let's build a prediction raster from the above model. First past reading in the ca_elev_WGS84.tif raster equally was done before, with the same crop, and once again assigning it to elevSierra, and get the iii coefficients b0, b1, and b2.

  • Now utilise the following lawmaking to create a raster of the same dimensions as elevSierra, but with latitude assigned as z values to the cells:
                                  lats                    <-                    yFromCell(elevSierra,                    one                    :                    ncell(elevSierra))                  latitude                    <-                    setValues(elevSierra,lats)                                      plot(latitude)                              

Latitude raster

Figure 11.21: Latitude raster

  • And so build a raster prediction of temperature from the model and the two explanatory rasters.

Goal of Question 4

Effigy eleven.22: Goal of Question 4

  1. Modify the code for the San Pedro Creek Watershed logistical model to look at 1975 landslides. Are the conclusions unlike?

  2. Using your own information, that should have location data also equally other attributes, create a statistical model of your choice, and provide your assessment of the results, a prediction map, and where advisable mapped residuals.

  3. Optionally, using your ain information, no location information needed, create a glm binomial model. If you don't have anything obvious, consider doing something like the study of a pet domestic dog panting based on temperature. https://bscheng.com/2016/10/23/logistic-regression-in-r/

References

Applied California Current Ecosystem Studies. due north.d. https://www.accessoceans.org.

Davis, Jerry. n.d. San Pedro Creek Watershed Virtual Fieldtrip: Story Map. https://storymaps.arcgis.com/stories/62705877a9f64ac5956a64230430c248.

Davis, Jerry, and Leonhard Blesius. 2015. "A Hybrid Physical and Maximum-Entropy Landslide Susceptibility Model." Entropy 17 (half-dozen): 4271–92. https://world wide web.mdpi.com/1099-4300/17/6/4271.

Ellen, Stephen D., and Gerald F. Wieczorek. 1988. Landslides, Floods, and Marine Furnishings of the Tempest of January 3-5, 1982, in the San Francisco Bay Region, California. Vol. 1434. USGS. https://pubs.usgs.gov/pp/1988/1434/.

Hijmans, Robert J. n.d. Spatial Data Science. https://rspatial.org.

Sims, Stephanie. 2004. Hillslope Sediment Source Cess of San Pedro Creek Watershed, California. https://geog.sfsu.edu/thesis/hillslope-sediment-source-cess-san-pedro-creek-watershed-california.

Studwell, Anna, Ellen Hines, Meredith L. Elliott, Julie Howar, Barbara Holzman, Nadav Nur, and Jaime Jahncke. 2017. "Modeling Nonresident Seabird Foraging Distributions to Inform Body of water Zoning in Key California." PLoS Ane. https://doi.org/10.1371/journal.pone.0169517.

nelsonoursend1939.blogspot.com

Source: https://bookdown.org/igisc/EnvDataSci/modeling-in-r.html

Postar um comentário for "If My Data Is Log10 Normal, Which Family to Use in a Glm"