The goal of the script is to fix issues in NuSEDS and to merge the two constituent datasets all_areas_nuseds (referred to as NUSEDS) and conservation_unit_system_sites (referred to as CUSS).

The dataset version corresponding to this script is:

Fisheries and Oceans Canada. 2024. NuSEDS - New Salmon Escapement Database System. Pacific Biological Station. Retrieved January 22, 2024, from https://open.canada.ca/data/en/dataset/c48669a3-045b-400d-b730-48aafe8c5ee6.

1 Import datasets

The NuSEDS (all_areas_nuseds) and CUSS (conservation_unit_system_sites) datasets are imported and duplicated rows are removed. The columns considered for NuSEDS are related to the population, location, fish counts and year of assessment:

##  [1] "SPECIES"                   "POP_ID"                   
##  [3] "GFE_ID"                    "ANALYSIS_YR"              
##  [5] "NATURAL_ADULT_SPAWNERS"    "NATURAL_JACK_SPAWNERS"    
##  [7] "NATURAL_SPAWNERS_TOTAL"    "ADULT_BROODSTOCK_REMOVALS"
##  [9] "JACK_BROODSTOCK_REMOVALS"  "TOTAL_BROODSTOCK_REMOVALS"
## [11] "OTHER_REMOVALS"            "TOTAL_RETURN_TO_RIVER"

The following 1 duplicated rows in NuSEDS when considering these columns are removed:

##        SPECIES POP_ID GFE_ID ANALYSIS_YR NATURAL_ADULT_SPAWNERS
## 190541    Chum  51316   1174        2011                     NA
##        NATURAL_JACK_SPAWNERS NATURAL_SPAWNERS_TOTAL ADULT_BROODSTOCK_REMOVALS
## 190541                    NA                     NA                        NA
##        JACK_BROODSTOCK_REMOVALS TOTAL_BROODSTOCK_REMOVALS OTHER_REMOVALS
## 190541                       NA                        NA             NA
##        TOTAL_RETURN_TO_RIVER
## 190541                    NA

For CUSS, 0 duplicated rows are found:

##  [1] MAP_LABEL         GFE_ID            SYSTEM_SITE       GFE_TYPE         
##  [5] SPECIES_QUALIFIED Y_LAT             X_LONGT           FAZ_ACRO         
##  [9] MAZ_ACRO          JAZ_ACRO          CU_NAME           CU_ACRO          
## [13] CU_LAT            CU_LONGT          CU_TYPE           CU_INDEX         
## [17] FULL_CU_IN        SBJ_ID            POP_ID            IS_INDICATOR     
## [21] CMNTS             EFFECTIVE_DT      WATERSHED_CDE     FWA_WATERSHED_CDE
## <0 rows> (or 0-length row.names)

The list of conservation units (CUs) as shown in the Pacific Salmon Explorer (PSE) is imported. The columns cu_name_pse and cu_name_dfo correspond to CU_NAME in CUSS, cu_index corresponds to FULL_CU_INDEX.

##   region species_name species_abbr cuid             cu_name_pse   cu_name_dfo
## 1 Skeena Lake sockeye          SEL  171                Alastair      Alastair
## 2 Skeena Lake sockeye          SEL  174           Ecstall/Lower Ecstall/Lower
## 3 Skeena Lake sockeye          SEL  175                Johnston      Johnston
## 4 Skeena Lake sockeye          SEL  176             Kitsumkalum   Kitsumkalum
## 5 Skeena Lake sockeye          SEL  177                 Lakelse       Lakelse
## 6 Skeena Lake sockeye          SEL  172 Mcdonell/Dennis/Aldrich       Aldrich
##    cu_index
## 1 SEL-20-01
## 2 SEL-20-04
## 3 SEL-20-05
## 4 SEL-20-06
## 5 SEL-20-07
## 6 SEL-20-02

The DFO file of the stream locations and coordinates (emailed from Wu Zhipeng, DFO, 09/04/2024) is imported. The columns NME and ID correspond to SYSTEM_SITE and GFE_ID in CUSS, respectively.

## # A tibble: 6 × 4
##   NME               ID X_LONGT           Y_LAT           
##   <chr>          <dbl> <chr>             <chr>           
## 1 STANLEY CREEK   1557 -132.691420767338 53.9746116925906
## 2 DAVIDSON CREEK  1558 -132.682181062338 53.9544135568635
## 3 NADEN RIVER     1559 -132.672499482506 53.9448571061959
## 4 LIGNITE CREEK   1560 -132.601901312242 53.97114945092  
## 5 OTUN RIVER      1561 -132.260106013    54.07328672     
## 6 BILL CREEK      1562 -132.395009112518 53.7494926521361

2 Initial modidifications

We remove from NUSEDS and CUSS data related to steelhead, Atlantic salmon and Kokanee salmon, corresponding to 941 rows and 133 populations (POP_ID) in NUSEDS and 0 in CUSS.

We create the column IndexId, which the the combination of the species acronym (i.e., CO, CM, CN, PKO, PKE, SX) and POP_ID:

## [1] "CN_48442"  "CO_590"    "CO_1794"   "CO_1844"   "CO_2172"   "PKE_51264"

We create the column MAX_ESTIMATE which is the maximum value of the fish count columns:

## [1] "NATURAL_ADULT_SPAWNERS"    "NATURAL_JACK_SPAWNERS"    
## [3] "NATURAL_SPAWNERS_TOTAL"    "ADULT_BROODSTOCK_REMOVALS"
## [5] "JACK_BROODSTOCK_REMOVALS"  "TOTAL_BROODSTOCK_REMOVALS"
## [7] "OTHER_REMOVALS"            "TOTAL_RETURN_TO_RIVER"

3 Fixes on NUSEDS and CUSS

3.1 Fix: coordinates of certain locations in CUSS

There are multiple locations in CUSS that have different SYSTEM_SITE and GFE_ID but exactly the same geographic coordinates. This is because the coordinates are not defined as the survey location but the “Location of the mouth of the waterbody if flowing, or the centroid if not.”. There can be large distances between the coordinates in CUSS and the real survey locations for large water bodies. We consequently manually update the coordinates of certain of these locations.

There are 21 unique coordinates with multiple SYSTEM_SITE and GFE_ID (n = 41). We group locations with identical coordinates and for each of the group (identified with a letter in the table below), we manually modify the coordinates of the locations relatively to each other using SYSTEM_SITE:

##                       SYSTEM_SITE  GFE_ID     Y_LAT     X_LONGT group
## 1  FENNELL CREEK AND SASKUM CREEK    2746 51.352577 -119.723268     A
## 2                   FENNELL CREEK     261 51.352577 -119.723268     A
## 3                    CENTRE CREEK      82 49.103185 -121.536585     B
## 4                  CENTRE CHANNEL    2533 49.103185 -121.536585     B
## 5                     FOLEY CREEK      78 49.106138 -121.635958     C
## 6        FOLEY CREEK SIDE CHANNEL     381 49.106138 -121.635958     C
## 7                   NORRISH CREEK      51 49.173172 -122.135356     D
## 8                   HAWKINS CREEK      54 49.173172 -122.135356     D
## 9                    BAXTER BEACH   11513 52.722503 -120.860114     E
## 10             BEAR BEACH - SHORE   11514 52.722503 -120.860114     E
## 11          BETTY FRANK'S - SHORE   11515 52.722503 -120.860114     E
## 12         LILLOOET RIVER - LOWER     177 49.747468 -122.145828     F
## 13                 LILLOOET RIVER    2475 49.747468 -122.145828     F
## 14                  CAYOOSH CREEK     131 50.668727 -121.972610     G
## 15       SETON AND CAYOOSH CREEKS 7990579 50.668727 -121.972610     G
## 16                    BOISE CREEK      24 49.624229 -122.659958     H
## 17            BOISE CREEK - NORTH   26776 49.624229 -122.659958     H
## 18                   FRASER RIVER       1 49.118876 -123.191011     I
## 19   FRASER RIVER AND TRIBUTARIES 7990634 49.118876 -123.191011     I
## 20        CHILCOTIN RIVER - LOWER    2463 51.739953 -122.401309     J
## 21                CHILCOTIN RIVER     285 51.739953 -122.401309     J
## 22                   SOMASS RIVER    1261 49.246397 -124.819840     K
## 23        SOMASS-SPROAT-GC SYSTEM   11485 49.246397 -124.819840     K
## 24        CLAYOQUOT RIVER - LOWER    1310 49.183381 -125.537647     L
## 25                CLAYOQUOT RIVER    2487 49.183381 -125.537647     L
## 26                   PINKUT CREEK  501255 54.446302 -125.460072     M
## 27      PINKUT CREEK - BELOW WEIR    2115 54.446302 -125.460072     M
## 28                 FORESTRY CREEK    2678 52.376885 -126.584031     N
## 29                  MARTY'S CREEK    2680 52.376885 -126.584031     N
## 30                 CAYEGHLE CREEK    1450 50.342470 -127.439898     O
## 31                CAYEGHLE SYSTEM    2302 50.342470 -127.439898     O
## 32                     TAKLA LAKE    2439 55.075216 -125.538582     P
## 33                  TAKLA - SHORE   19719 55.075216 -125.538582     P
## 34                   FULTON RIVER  501254 54.814315 -126.146194     Q
## 35      FULTON RIVER - BELOW WEIR    2111 54.814315 -126.146194     Q
## 36               JENNY BAY CREEKS    1798 52.224030 -127.598389     R
## 37          JENNY BAY SOUTH CREEK    2693 52.224030 -127.598389     R
## 38    FANNIE COVE LEFT HAND CREEK    2687 52.043911 -128.068542     S
## 39   FANNIE COVE RIGHT HAND CREEK    2708 52.043911 -128.068542     S
## 40     SEWELL INLET CREEK L.H. #2    1594 52.876978 -131.949569     T
## 41                  THORSEN CREEK    1595 52.876978 -131.949569     T

For group A:

##   GFE_ID                    SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new
## 1   2746 FENNELL CREEK AND SASKUM CREEK 51.352577 -119.723268        NA
## 2    261                  FENNELL CREEK 51.352577 -119.723268 51.344756
##   X_LONGT_new
## 1          NA
## 2 -119.797289

For group B:

##   GFE_ID    SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 3     82   CENTRE CREEK 49.103185 -121.536585        NA          NA
## 4   2533 CENTRE CHANNEL 49.103185 -121.536585 49.090741 -121.531078

For group C:

##   GFE_ID              SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 5     78              FOLEY CREEK 49.106138 -121.635958 49.107077 -121.636804
## 6    381 FOLEY CREEK SIDE CHANNEL 49.106138 -121.635958        NA          NA

For group D:

##   GFE_ID   SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 7     51 NORRISH CREEK 49.173172 -122.135356        NA          NA
## 8     54 HAWKINS CREEK 49.173172 -122.135356 49.169178  -122.19174

For group E:

##    GFE_ID           SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 9   11513          BAXTER BEACH 52.722503 -120.860114        NA          NA
## 10  11514    BEAR BEACH - SHORE 52.722503 -120.860114 52.720154 -120.867586
## 11  11515 BETTY FRANK'S - SHORE 52.722503 -120.860114 52.746393 -120.837075

For group F:

##    GFE_ID            SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 12    177 LILLOOET RIVER - LOWER 49.747468 -122.145828        NA          NA
## 13   2475         LILLOOET RIVER 49.747468 -122.145828 49.740213  -122.14739

For group G:

##     GFE_ID              SYSTEM_SITE     Y_LAT    X_LONGT Y_LAT_new X_LONGT_new
## 14     131            CAYOOSH CREEK 50.668727 -121.97261 50.680016 -121.929476
## 15 7990579 SETON AND CAYOOSH CREEKS 50.668727 -121.97261        NA          NA

For group H:

##    GFE_ID         SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 16     24         BOISE CREEK 49.624229 -122.659958        NA          NA
## 17  26776 BOISE CREEK - NORTH 49.624229 -122.659958 49.626847 -122.671116

For group I:

##     GFE_ID                  SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new
## 18       1                 FRASER RIVER 49.118876 -123.191011        NA
## 19 7990634 FRASER RIVER AND TRIBUTARIES 49.118876 -123.191011 49.111242
##    X_LONGT_new
## 18          NA
## 19 -123.168505

For group J:

##    GFE_ID             SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 20   2463 CHILCOTIN RIVER - LOWER 51.739953 -122.401309        NA          NA
## 21    285         CHILCOTIN RIVER 51.739953 -122.401309 51.745035 -122.413504

For group K:

##    GFE_ID             SYSTEM_SITE     Y_LAT    X_LONGT Y_LAT_new X_LONGT_new
## 22   1261            SOMASS RIVER 49.246397 -124.81984        NA          NA
## 23  11485 SOMASS-SPROAT-GC SYSTEM 49.246397 -124.81984 49.294603 -124.879074

For group L:

##    GFE_ID             SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 24   1310 CLAYOQUOT RIVER - LOWER 49.183381 -125.537647        NA          NA
## 25   2487         CLAYOQUOT RIVER 49.183381 -125.537647 49.201198  -125.51245

For group M:

##    GFE_ID               SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 26 501255              PINKUT CREEK 54.446302 -125.460072 54.445124 -125.458809
## 27   2115 PINKUT CREEK - BELOW WEIR 54.446302 -125.460072        NA          NA

For group N:

##    GFE_ID    SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 28   2678 FORESTRY CREEK 52.376885 -126.584031 52.378979 -126.581766
## 29   2680  MARTY'S CREEK 52.376885 -126.584031        NA          NA

For group O:

##    GFE_ID     SYSTEM_SITE    Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 30   1450  CAYEGHLE CREEK 50.34247 -127.439898        NA          NA
## 31   2302 CAYEGHLE SYSTEM 50.34247 -127.439898 50.342682 -127.437795

For group P:

##    GFE_ID   SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 32   2439    TAKLA LAKE 55.075216 -125.538582        NA          NA
## 33  19719 TAKLA - SHORE 55.075216 -125.538582  55.08134 -125.560056

For group Q:

##    GFE_ID               SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 34 501254              FULTON RIVER 54.814315 -126.146194        NA          NA
## 35   2111 FULTON RIVER - BELOW WEIR 54.814315 -126.146194  54.81601 -126.163338

For group R:

##    GFE_ID           SYSTEM_SITE    Y_LAT     X_LONGT Y_LAT_new X_LONGT_new
## 36   1798      JENNY BAY CREEKS 52.22403 -127.598389 52.225501 -127.598108
## 37   2693 JENNY BAY SOUTH CREEK 52.22403 -127.598389        NA          NA

For group S:

##    GFE_ID                  SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new
## 38   2687  FANNIE COVE LEFT HAND CREEK 52.043911 -128.068542 52.044541
## 39   2708 FANNIE COVE RIGHT HAND CREEK 52.043911 -128.068542        NA
##    X_LONGT_new
## 38 -128.068576
## 39          NA

For group T:

##    GFE_ID                SYSTEM_SITE     Y_LAT     X_LONGT Y_LAT_new
## 40   1594 SEWELL INLET CREEK L.H. #2 52.876978 -131.949569 52.877146
## 41   1595              THORSEN CREEK 52.876978 -131.949569        NA
##    X_LONGT_new
## 40   -132.0007
## 41          NA

The fields X_LONGT and Y_LAT are updated in CUSS, and the new field coordinates_changed is added to indicate which locations are concerned.

# check is there still are different GFE_IDs with same coordinates:
coord_duplicated <- locations[,c("X_LONGT","Y_LAT")][duplicated(locations[,c("X_LONGT","Y_LAT")]),]
nrow(coord_duplicated) # 0

## [1] 0

3.2 Remove the IndexId & GFE_ID time series in NUSEDS with only NAs and/or 0s

This procedure removes 102845 rows in NUSEDS, corresponding to 24.9% of the original dataset.

The data removed concerns 4491 time series, which are referenced in removed_all:

##     IndexId GFE_ID dataset                   comment
## 1  CM_46125    204  NUSEDS Only NAs for MAX_ESTIMATE
## 2 PKO_48406    427  NUSEDS Only NAs for MAX_ESTIMATE
## 3 PKO_50417   1061  NUSEDS Only NAs for MAX_ESTIMATE
## 4  CM_48735    765  NUSEDS Only NAs for MAX_ESTIMATE
## 5  CN_45966    415  NUSEDS Only NAs for MAX_ESTIMATE
## 6  SX_50720    835  NUSEDS Only NAs for MAX_ESTIMATE

Note that we do not remove these series in CUSS at this stage because they could be alternative series for the series in NUSEDS that do not appear in CUSS. It is only once series in NUSEDS absent in CUSS were found alternative series for, that we remove series in CUSS that have only NAs and/or 0s in NUSEDS.

3.3 Fix time series in CUSS that are not in NUSEDS

Look for each time series in CUSS (i.e., unique IndexId & GFE_ID association) and:

check if there are IndexId associated to multiple locations (i.e., GFE_ID); if yes: trouble shoot manually because this should not happen: IndexId should be associated to one unique location.
else look if there is a time series with the same IndexId & GFE_ID in NUSEDS; if yes: all good :-)
else: that could be due to the wrong attribution of either (i) the IndexId or (ii) the GFE_ID. The rest of the code looks for potential alternative series in NUSEDS with either a different IndexId (but with the same species) or a different GFE_ID. Alternative series identified NOT present in CUSS are kept, the ones present are not considered.

The procedure returns a simple data frame with the IndexId & GFE_ID series concerned and associated comment and eventual potential alternative series that have to be checked manually after:

##     i   IndexId GFE_ID in_cuss in_nused
## 1  23 PKE_40049   1211     yes       no
## 2  32 PKE_52934   1143     yes       no
## 3  60 PKE_40149   1221     yes       no
## 4  62 PKE_50494    812     yes       no
## 5 112 PKO_46644    262     yes       no
## 6 176  CN_46842    285     yes       no
##                                                comment
## 1             There is no alternative series in NUSEDS
## 2             There is no alternative series in NUSEDS
## 3             There is no alternative series in NUSEDS
## 4             There is no alternative series in NUSEDS
## 5             There is no alternative series in NUSEDS
## 6 Alternative series: CN_46842 & 2463, CN_46841 &  285

There are 259 out of 7145 time series in CUSS that are not present in NUSEDS (i.e., 3.6%), and the vast majority of them do not have an alternative series in NUSEDS:

table(trackRecord$comment[cond])

## 
##        Alternative series: CN_46842 & 2463, CN_46841 &  285 
##                                                           1 
##                         Alternative series: CO_46835 & 2463 
##                                                           1 
##                     Alternative series: PKE_54793433 & 1490 
##                                                           1 
##                           Alternative series: SX_47954 & 21 
##                                                           1 
## In CUSS: there are multiple GFE_IDs for CN_7479 : 133, 2373 
##                                                           1 
##                    There is no alternative series in NUSEDS 
##                                                         254

3.3.1 Alternative series present

The goal here is to go over all the cases where alternative series are available. The title of the figures below shows the focal series that is in CUSS but not in NUSEDS. The series on the top left corner are potential alternative series (i.e., they are present in NUSEDS but not in CUSS).

comment <- "Alternative series:"
trackRecord_cuss_alternative <- trackRecord[grepl(comment,trackRecord$comment),]

## [1] 4

##      i   IndexId GFE_ID in_cuss in_nused
## 1  176  CN_46842    285     yes       no
## 2 3489 PKE_42409   1490     yes       no
## 3 4724  CO_46835    285     yes       no
## 4 6685   SX_7763     21     yes       no
##                                                comment
## 1 Alternative series: CN_46842 & 2463, CN_46841 &  285
## 2              Alternative series: PKE_54793433 & 1490
## 3                  Alternative series: CO_46835 & 2463
## 4                    Alternative series: SX_47954 & 21

The focal CUSS time series CN_46842 & GFE_ID = 285 corresponds to the CU_NAME:

## [1] "MIDDLE FRASER RIVER_SP_1.3"

The two alternative time series are noted as two different runs, according to NUSEDS’s POPULATION:

## [1] "Chilcotin River (Williams Lake Area) Chinook Run 2"
## [2] "Chilcotin River (Williams Lake Area) Chinook Run 1"

The distance between the two locations (i.e., the two GFE_ID) is 1.01 km.

Decision: It was decided to merge the two time series CN_46842 & GFE_ID = 2463 and CN_46841 & GFE_ID = 285 to the series CN_46842 & GFE_ID = 285 in NUSEDS because there is only one timing for this Middle Fraser River (Spring 5-2) Chinook CU and the two series are complementary.

Decision: we remove series PKE_42409 & GFE_ID = 1490 from CUSS and PKE_54793433 & GFE_ID = 1490 from NUSEDS.

These two locations are from the same river according the CUSS’s SYSTEM_SITE:

## [1] "CHILCOTIN RIVER - LOWER" "CHILCOTIN RIVER"

The distance between the two locations is 1.01 km (same as before).

Decision: Replace CO_46835 & GFE_ID = 2463 by CO_46835 & GFE_ID = 285 in NUSEDS.

The two time series have the same CU_NAME in CUSS:

##   SX_7763  SX_47954 
## "WIDGEON" "WIDGEON"

Decision: Replace SX_47954 & GFE_ID = 21 by SX_7763 & GFE_ID = 21 in NUSEDS.

3.3.2 Alternative series not present in NUSEDS

The goal is to remove these InsexId & GFE_ID references from CUSS and to specify if they were simply not present in the orginal NUSEDS or only contained NAs and/or 0s.

comment <- "There is no alternative series in NUSEDS"
trackRecord_cuss_noAlernative <- trackRecord[grepl(comment,trackRecord$comment),]

## [1] 254

##     IndexId GFE_ID in_cuss in_nused                                  comment
## 1 PKE_40049   1211     yes       no There is no alternative series in NUSEDS
## 2 PKE_52934   1143     yes       no There is no alternative series in NUSEDS
## 3 PKE_40149   1221     yes       no There is no alternative series in NUSEDS
## 4 PKE_50494    812     yes       no There is no alternative series in NUSEDS
## 5 PKO_46644    262     yes       no There is no alternative series in NUSEDS
## 6   CM_3261   2637     yes       no There is no alternative series in NUSEDS

253 of the 254 InsexId & GFE_ID references were removed previously from NUSEDS because they contained only NAs and/or 0s. Only the following reference was not in NUSEDS initially:

##   GFE_ID   SYSTEM_SITE SPECIES_QUALIFIED      CU_NAME FULL_CU_IN
## 1    150 SEEBACH CREEK               SER UPPER FRASER     SER-06

3.3.3 Multiple GFE_IDs for a same IndexId in CUSS

There should not be any InsexId associated to multiple GFE_ID (it is a MANY TO ONE relationship).

comment <- "In CUSS: there are multiple GFE_IDs for"
trackRecord_cuss_multi_fgeid <- trackRecord[grepl(comment,trackRecord$comment),]

This happens for 2 populations:

##      i  IndexId GFE_ID in_cuss in_nused
## 1 1169  CN_7479    133     yes       no
## 2 1169  CN_7479   2373     yes      yes
## 3 6288 SX_45525   2444     yes      yes
## 4 6288 SX_45525    303     yes      yes
##                                                        comment
## 1  In CUSS: there are multiple GFE_IDs for CN_7479 : 133, 2373
## 2  In CUSS: there are multiple GFE_IDs for CN_7479 : 133, 2373
## 3 In CUSS: there are multiple GFE_IDs for SX_45525 : 2444, 303
## 4 In CUSS: there are multiple GFE_IDs for SX_45525 : 2444, 303

The figure above only show the time series present in NUSEDS. The other time series is removed from CUSS.

These two time series of this CU (CU_NAME = NADINA/FRANCOIS-EARLY SUMMER TIMING, NADINA/FRANCOIS-EARLY SUMMER TIMING) should have a different IndexId (i.e. POP_ID). We keep them as is and do not make any change to NUSEDS and CUSS.

3.4 Fix populations in NUSEDS that are not in CUSS

The goal is to look for each IndexId & GFE_ID series in NUSEDS that are not in CUSS and to:

Remove series with < 4 data points (too much work for not enough data points; eventually rescue this data in future)
Look if there are alternative series in NUSEDS that are also present in CUSS

if not, add the IndexId & GFE_ID reference in CUSS
if yes, see if the alternative series can be merged; if not, add the IndexId & GFE_ID reference to CUSS

There are 171 IndexId & GFE_ID references in NUSEDS that are not present in CUSS, for instance:

##   IndexId GFE_ID nb_dataPt alternative_IndexId alternative_GFE_ID
## 1 CM_1622     11         2                none               none
## 2 CM_2434      4        11                none               none
## 3  CM_297     14         1            CM_47925               none
## 4 CM_3305  11486        15                none              11485
## 5 CM_3305  11487         2                none              11485
## 6 CM_3305  11488         3                none              11485

3.4.1 Remove series number of data points < 4

There are 94 populations removed because they have < 4 data points; for instance:

The procedure removes 1387 rows in NUSEDS.

3.4.2 Alternative IndexId AND no alternative GFE_ID

The section concerns focal IndexId & GFE_ID time series in NUSEDS that are not in CUSS and for which there are alternative series in CUSS with the same IndexId but a different GFE_ID. The goal is to compare the focal vs. the alternative time series and take the following actions:

if the time series are 100% complementary: merge them (i.e., edit IndexId in NUSEDS)
if the time series are 100% duplicated: remove the focal series in NUSEDS
if the series are partially conflictual or if there are multiple potential alternative series: trubble shoot manually

There are 6 time series concerned:

##    IndexId GFE_ID nb_dataPt alternative_IndexId
## 1  CN_3331   1156        13 CN_53069 ; CN_53068
## 2  CN_3334   1194        36 CN_51519 ; CN_51518
## 3 CN_39983   1204         7            CN_39984
## 4  CN_7809    442        17            CN_48442
## 5  CO_7776   2243         5             CO_3136
## 6 SX_46308    227        14            SX_46309

The figures below show for each of these cases the decision made. The figures also show the CU_MAME of focal and alternative IndexId. The i at the bottom left of each plot is used after to reference the figures when fixing time series manually. Several cases are set aside to be manually fixed after.

##    IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 CN_53069   1156           100         0        0       NA        13
## 2 CN_53068   1156           100         0        0       NA        13
## [1] "***"

##    IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 CN_51519   1194            90         0       10       NA        36
## 2 CN_51518   1194           100         0        0       NA        36
## [1] "***"

##    IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 CN_39984   1204            70         0       30       NA         7
## [1] "***"

##    IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 CN_48442    442           100         0        0       NA        17
## [1] "***"
## [1] "Number of rows removed from dataframe = 5"

##   IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 CO_3136   2243             0       100        0       NA         5
## [1] "***"

##    IndexId GFE_ID complementary duplicate conflict decision nb_dataPt
## 1 SX_46309    227            40        40       20       NA        14
## [1] "***"

We now assess the cases set aside.

In the case i = 1 above, series CN_3331 - 1156 was not merged to any of the alternative series because of its unknown run timing (see below). It is consequently removed from NUSEDS.

##    IndexId        type                               POPULATION
## 1  CN_3331       focal Puntledge River Chinook (unknown timing)
## 2 CN_53069 alternative             Puntledge River Fall Chinook
## 3 CN_53068 alternative           Puntledge River Spring Chinook

In the case i = 2 above, series CN_3334 - 1194 was initially going to be merged to the alternative series with which is was 100% compatible. But is was decided to remove it from NUSEDS because its run timing is unknown:

##    IndexId        type                             POPULATION
## 1  CN_3334       focal Nanaimo River Chinook (unknown timing)
## 2 CN_51519 alternative             Nanaimo River Fall Chinook
## 3 CN_51518 alternative           Nanaimo River Summer Chinook

In the case i = 3 above, series CN_39983 - 1204 was not merged to the alternative series because of different run timings:

##    IndexId        type                     POPULATION
## 1 CN_39983       focal Chemainus River Summer Chinook
## 2 CN_39984 alternative   Chemainus River Fall Chinook

It was decided instead to associate the population CN_39983 to the CU East Vancouver Island-Georgia Strait (Summer 4-1) in the PSE because of the same run timing and the fact that the Chemainus river is connected to this CU. The corresponding CU_NAME in CUSS is:

## [1] "EAST VANCOUVER ISLAND-GEORGIA STRAIT_SU_0.3"

The reference of the CN_39983 - 1204 time series is added to CUSS. The CU-related fields in CUSS are filled with the values corresponding to the above CU_NAME. Here below is the row added to CUSS:

##   MAP_LABEL GFE_ID     SYSTEM_SITE GFE_TYPE SPECIES_QUALIFIED     Y_LAT
## 1        NA   1204 CHEMAINUS RIVER   Stream                CK 48.897926
##       X_LONGT FAZ_ACRO MAZ_ACRO JAZ_ACRO
## 1 -123.673181      EVI     GStr EVI+GStr
##                                       CU_NAME     CU_ACRO     CU_LAT
## 1 EAST VANCOUVER ISLAND-GEORGIA STRAIT_SU_0.3 EVIGStr-sum 49.5622666
##      CU_LONGT CU_TYPE CU_INDEX FULL_CU_IN SBJ_ID POP_ID IS_INDICATOR CMNTS
## 1 -125.201122 Current       83      CK-83     NA  39983         <NA>  <NA>
##   EFFECTIVE_DT                                            WATERSHED_CDE
## 1         <NA> 920-303500-00000-00000-0000-0000-000-000-000-000-000-000
##                                                                                                                                 FWA_WATERSHED_CDE
## 1 920-302830-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000-000000
##   SPECIES species_acronym_ncc  IndexId coordinates_changed
## 1 Chinook                  CN CN_39983               FALSE

In the case i = 4 above, series CN_7809 - 442 can be merged to the alternative series CN_48442 - 442 with which it is 100% compatible. The run timing indicated in POPULATION in NUSEDS are not incompatible:

##                                      CN_7809 
## "Okanagan River (Salmon Arm) Summer Chinook" 
##                                     CN_48442 
##  "Okanagan River (Salmon Arm) Chinook Run 1"

However, we decide to merge the alternative time series to the focal one (i.e., change CN_48442 - 442 to CN_7809 - 442) instead of the contrary because the latter is longer and more recent. Consequently both NUSEDS and CUSS are edited.

## [1] "1 rows were edited in conservation_unit_system_sites at the following fields: POP_ID, SPECIES, species_acronym_ncc, IndexId"

## [1] "4 rows were edited in all_areas_nuseds at the following fields: SPECIES, POPULATION, RUN_TYPE, POP_ID, species_acronym_ncc, IndexId"

In the case i = 5 above, the focal series CO_7776 - 2243 is removed from NUSEDS because it is 100% duplicated with the alternative series CO_3136 - 2243.

In the case i = 6 above, the focal series SX_46308 - 227 is removed from NUSEDS because of a different run timing (in POPULATION) compared to the alternative series (see below) and most values are 0s.

##                                       SX_46308 
## "Ross Creek (Salmon Arm) Early Summer Sockeye" 
##                                       SX_46309 
##         "Ross Creek (Salmon Arm) Late Sockeye"

3.4.3 Alternative GFE_ID

The focal time series (i.e., a unique IndexId - GFE_ID not in CUSS) are compared to alternative series with a different GFE_ID and potentially a different IndexId (but of the same species).

There are 34 time series concerned:

##      IndexId     GFE_ID nb_dataPt alternative_IndexId alternative_GFE_ID
## 1    CM_3305      11486        15                none              11485
## 2   CM_47925         15         8            CM_47928                 14
## 3   CM_47925      31516         8                none                 14
## 4   CM_47925      31740         8                none                 14
## 5   CM_50616       2618         8                none                824
## 6    CN_2178        129         5            CN_47189               2476
## 7    CN_3306      11486        38                none              11485
## 8    CN_3306      11488         4                none              11485
## 9   CN_46891       2467         5                none                290
## 10  CN_46892       2466        22                none                290
## 11  CN_47277       2464        46                none                142
## 12  CN_50618       2618         8                none                824
## 13  CO_44539        129         5            CO_47183               2476
## 14  CO_44539       2451        12                none               2476
## 15  CO_46170       2461         7                none                213
## 16  CO_46170      54282         5                none                213
## 17  CO_46582 1754239945        14                none                256
## 18  CO_46582 2049268617        20                none                256
## 19  CO_46582      33045        20                none                256
## 20  CO_46582      57041         5                none                256
## 21  CO_46582  719256264        21                none                256
## 22  CO_46602  212716981        20                none                258
## 23  CO_46602       2746         6            CO_46632                258
## 24  CO_46632        261         7                none               2746
## 25  CO_46795 1921661712        15                none                281
## 26  CO_46795      33103        16                none                281
## 27  CO_50612       2618         6                none                824
## 28 PKE_50614       2618         6                none                824
## 29  PKO_3304      11486         4                none              11485
## 30 PKO_50614       2618         7                none                824
## 31   SX_3302       3416        23             SX_3310              11485
## 32   SX_3302       3444        23             SX_3325              11485
## 33   SX_3416       2746        41                none                261
## 34  SX_50610       2618        13                none                824

Notice in the table above that several time series have the same IndexId. Again this is problematic because normally only one GFE_ID can be associated to a given IndexId (i.e., POP_ID). We show here after the focal time series grouped with their alternative series. On top of each figure is shown the POPULATION and WATERBODY associated with each IndexId (i.e., POP_ID) and GFE_ID, respectively. Belong each figure is shown a summary table providing the number of data points (nb_dataPt) and the % of them that are complementary, duplicate and in conflict compared with the alternative series. The decision made in each case are provided after the figures, using the index i shown at the bottom left of the plots for reference.

## [1] "*** i = 1 ***"
##   IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CM_3305  11486            NA        11485        15           100         0
##   conflict
## 1        0

## [1] "*** i = 2 ***"
##   IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_3306  11486            NA        11485        38           100         0
## 2 CN_3306  11488            NA        11485         4           100         0
##   conflict
## 1        0
## 2        0

## [1] "*** i = 3 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 PKO_3304  11486            NA        11485         4           100         0
##   conflict
## 1        0

## [1] "*** i = 4 ***"
##   IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 SX_3302   3416          <NA>        11485        23           100         0
## 2 SX_3302   3444          <NA>        11485        23           100         0
## 3 SX_3302   3416       SX_3310         <NA>        23           100         0
## 4 SX_3302   3444       SX_3325         <NA>        23           100         0
##   conflict
## 1        0
## 2        0
## 3        0
## 4        0

## [1] "*** i = 5 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CM_47925     15          <NA>           14         8            90         0
## 2 CM_47925  31516          <NA>           14         8            90        10
## 3 CM_47925  31740          <NA>           14         8            90         0
## 4 CM_47925     15      CM_47928         <NA>         8           100         0
##   conflict
## 1       10
## 2        0
## 3       10
## 4        0

## [1] "*** i = 6 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_47277   2464            NA          142        46           100         0
##   conflict
## 1        0

## [1] "*** i = 7 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CO_46170   2461            NA          213         7           100         0
## 2 CO_46170  54282            NA          213         5           100         0
##   conflict
## 1        0
## 2        0

## [1] "*** i = 8 ***"
##   IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_2178    129          <NA>         2476         5            80         0
## 2 CN_2178    129      CN_47189         <NA>         5             0       100
##   conflict
## 1       20
## 2        0

## [1] "*** i = 9 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CO_44539    129          <NA>         2476         5           100         0
## 2 CO_44539   2451          <NA>         2476        12            90         0
## 3 CO_44539    129      CO_47183         <NA>         5           100         0
##   conflict
## 1        0
## 2       10
## 3        0

## [1] "*** i = 10 ***"
##    IndexId     GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary
## 1 CO_46582 1754239945            NA          256        14           100
## 2 CO_46582 2049268617            NA          256        20           100
## 3 CO_46582      33045            NA          256        20           100
## 4 CO_46582      57041            NA          256         5           100
## 5 CO_46582  719256264            NA          256        21           100
##   duplicate conflict
## 1         0        0
## 2         0        0
## 3         0        0
## 4         0        0
## 5         0        0

## [1] "*** i = 11 ***"
##    IndexId    GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary
## 1 CO_46602 212716981          <NA>          258        20            70
## 2 CO_46602      2746          <NA>          258         6            80
## 3 CO_46602      2746      CO_46632         <NA>         6           100
##   duplicate conflict
## 1         0       30
## 2         0       20
## 3         0        0

## [1] "*** i = 12 ***"
##   IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 SX_3416   2746            NA          261        41           100         0
##   conflict
## 1        0

## [1] "*** i = 13 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CO_46632    261            NA         2746         7           100         0
##   conflict
## 1        0

## [1] "*** i = 14 ***"
##    IndexId     GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary
## 1 CO_46795 1921661712            NA          281        15           100
## 2 CO_46795      33103            NA          281        16           100
##   duplicate conflict
## 1         0        0
## 2         0        0

## [1] "*** i = 15 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_46891   2467            NA          290         5             0       100
##   conflict
## 1        0

## [1] "*** i = 16 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_46892   2466            NA          290        22            80        10
##   conflict
## 1        0

## [1] "*** i = 17 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CM_50616   2618            NA          824         8            40         0
##   conflict
## 1       60

## [1] "*** i = 18 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CN_50618   2618            NA          824         8             0         0
##   conflict
## 1      100

## [1] "*** i = 19 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 CO_50612   2618            NA          824         6             0         0
##   conflict
## 1      100

## [1] "*** i = 20 ***"
##     IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 PKE_50614   2618            NA          824         6             0         0
##   conflict
## 1      100

## [1] "*** i = 21 ***"
##     IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 PKO_50614   2618            NA          824         7            10         0
##   conflict
## 1       90

## [1] "*** i = 22 ***"
##    IndexId GFE_ID IndexId_alter GFE_ID_alter nb_dataPt complementary duplicate
## 1 SX_50610   2618            NA          824        13            20         0
##   conflict
## 1       80

In case i = 1: series CM_3305 - 11486 is not merged to alternative series CM_3305 - 11485 because these chum salmon would not be able to make it up the fall. Consequently, the reference of the focal series is added to CUSS.

Note that GFE_ID = 11485 is present in DFO streams file (under ID) but coordinates are not:

## # A tibble: 1 × 4
##   NME                            ID X_LONGT Y_LAT
##   <chr>                       <dbl> <chr>   <chr>
## 1 WATERSHED ABOVE STAMP FALLS 11486 <NA>    <NA>

We consequently attribute coordinates to this location using Google Maps. We selecte the location just above Stamp Falls, on the left of the Stamp river fish ladder:

##       Y_LAT     X_LONGT 
##   49.332224 -124.919717

In case i = 2: series CN_3306 - 11486 in green is not merged to alternative series CN_3306 - 11485 in red for the same reason as above; its reference are instead added to CUSS. The other focal series CN_3306 - 11488 in blue is deleted because it only has 4 data points.

In case i = 3: series PKO_3304 - 11486 in blue is not merged to alternative series PKO_3304 - 11485 in red for the same reason as above; its reference are instead added to CUSS.

In case i = 4: series PKO_3302 - 3416 in black (bottom plot) is merged to alternative series SX_3310 - 3416 in red and series PKO_3302 - 3444 in blue (bottom plot) is merged to alternative series SX_3325 - 3444 in green.

In case i = 5: the focal series CM_47925 - 15 in blue (bottom plot) is merged to the alternative series CM_47928 - 15 in red (bottom plot). The other two focal series CM_47925 - 31516 in black (top plot) and CM_47925 - 31740 in blue (top plot) are added to CUSS.

In case i = 6: the focal series CN_47277 - 2464 in blue is not merged to the alternative series CN_47277 - 142 in red ; it is instead added to CUSS.

The GFE_ID is present in the DFO streams file but coordinates are not:

## # A tibble: 1 × 4
##   NME                         ID X_LONGT Y_LAT
##   <chr>                    <dbl> <chr>   <chr>
## 1 COTTONWOOD RIVER - LOWER  2464 <NA>    <NA>

The coordinate for this location are consequently defined manually using Google Maps; they corresponds to the mouth of Cottonwood river:

##                SYSTEM_SITE GFE_ID     X_LONGT     Y_LAT
## 1 COTTONWOOD RIVER - LOWER   2464 -122.610584 53.119965

In case i = 7: the focal series CO_46170 - 2461 in green is merged to the alternative series CO_46170 - 213 in red ; while the other focal series CO_46170 - 54282 in blue is instead added to CUSS.

In case i = 8: the focal series CN_2178 - 129 in blue is a duplicate of the atlernative series CN_47189 - 129 in red (bottom plot); it is consequently removed from NUSEDS.

In case i = 9: the focal series CO_44539 - 2451 in blue (top plot) is not merged to the alternative series CO_44539 - 2476 in red because of a conflicting point; its reference is instead added to CUSS. The second focal series CO_44539 - 129 in blue (bottom plot) is merged to the alternative series CO_47183 - 129 in red.

In case i = 10: all five focal series are added to CUSS. Their GFE_ID are present in the DFO streams but the coordinates are not:

## # A tibble: 5 × 4
##   NME                                                   ID X_LONGT Y_LAT
##   <chr>                                              <dbl> <chr>   <chr>
## 1 BIRCH ISLAND CHANNEL                               33045 <NA>    <NA> 
## 2 UPPER NORTH THOMPSON RIVER MAINSTEM           2049268617 <NA>    <NA> 
## 3 MIDDLE NORTH THOMPSON RIVER MAINSTEM          1754239945 <NA>    <NA> 
## 4 PIG CHANNEL COMBINED                           719256264 <NA>    <NA> 
## 5 UPPER NORTH THOMPSON RIVER-MILEDGE CONFLUENCE      57041 <NA>    <NA>

The coordinates are consequently defined manually using Google Maps and the map in page 31 of Arbeider et al. 2020. Interior Fraser Coho Salmon Recovery Potential Assessment:

##                                     SYSTEM_SITE     GFE_ID     X_LONGT
## 1                          BIRCH ISLAND CHANNEL      33045 -119.889197
## 2           UPPER NORTH THOMPSON RIVER MAINSTEM 2049268617 -119.177796
## 3          MIDDLE NORTH THOMPSON RIVER MAINSTEM 1754239945 -119.701668
## 4                          PIG CHANNEL COMBINED  719256264 -119.810014
## 5 UPPER NORTH THOMPSON RIVER-MILEDGE CONFLUENCE      57041 -119.173514
##       Y_LAT
## 1 51.598602
## 2 52.345236
## 3 51.593091
## 4 51.579465
## 5 52.281312

In case i = 11: the focal series CO_46602 - 2746 in blue (bottom plot) is merged to the alternative series CO_46632 - 2476 in red. The other focal series CO_46602 - 212716981 in green (top plot) is added to CUSS.

In case i = 12: the focal series SX_3416 - 2746 in blue (bottom plot) is merged to the alternative series SX_3416 - 261 in red.

In case i = 13: the focal series CO_46632 - 261 in blue (bottom plot) is merged to the alternative series CO_46632 - 2746 in red.

In case i = 14: the focal series CO_46795 - 33103 in blue is merged to the alternative series CO_46795 - 281 in red. The other focal series CO_46795 - 1921661712 in green is added to CUSS. Its GFE_ID is in the DFO streams file but coordinates are not:

## # A tibble: 2 × 4
##   NME                        ID X_LONGT       Y_LAT       
##   <chr>                   <dbl> <chr>         <chr>       
## 1 BLUE RIVER - LOWER      33103 -119.27964147 52.104610605
## 2 BLUE RIVER - UPPER 1921661712 <NA>          <NA>

The coordinates are consequently defined manually using Google Maps:

##          SYSTEM_SITE     GFE_ID     Y_LAT     X_LONGT
## 1 BLUE RIVER - UPPER 1921661712 52.112252 -119.289006

In case i = 15: the focal series CN_46891 - 2467 in blue is a duplicate of the alternative series CN_46891 - 290 in red; it is consequently removed from NUSEDS.

In case i = 16: the focal series CN_46892 - 2466 in blue has three duplicated points with the alternative time series CN_46892 - 290 in red, and a conflicting point. However, this conflicting point in the alternative time series is a duplicated point in the third time series CN_46891 - 290 in purple in the bottom plot below. We consequently (1) remove the duplicated point in the alternative series (in red) and (2) merge the focal series to the latter.

In the remaining cases (i = 17, 18, 19, 20, 21, 22): the focal series have the same IndexId (i.e. POP_ID) but a different GFE_ID compared to their alternative series: the location CLEAR WATER CREEK is not present in CUSS and the location PHILLIPS RIVER is proposed as an alternative. We do not merge the focal series to their alternative series and add them to CUSS instead.

3.4.4 No alternative series

There remain 37 series in NUSEDS with no alternative series, i.e., both their IndexId (POP_ID) and SPECIES & GFE_ID are absent from CUSS. These series are removed from NUSEDS because we cannot associate them to a CU_NAME. It future, it would be worth attempting to salvage the longest series in the list. The figures below show these time series.

3.5 Merge NuSEDS and CUSS and replace 0s by NAs

At this stage NUSEDS and CUSS have the same 6911 time series and can consequently be merged using the common fields IndexId (i.e., POP_ID) and GFE_ID.

col_common <- c("IndexId","POP_ID","GFE_ID")

col_nuseds <- c("SPECIES","WATERBODY","AREA","Year","MAX_ESTIMATE",
                "ENUMERATION_METHODS",            # 
                "ESTIMATE_CLASSIFICATION",
                "ESTIMATE_METHOD",
                "GAZETTED_NAME",
                "LOCAL_NAME_1",
                "LOCAL_NAME_2",
                "ADULT_PRESENCE",
                "JACK_PRESENCE")

col_cuss <- c("SPECIES_QUALIFIED","CU_NAME","CU_TYPE","FAZ_ACRO","JAZ_ACRO","MAZ_ACRO",
              "FULL_CU_IN","SYSTEM_SITE","Y_LAT","X_LONGT","IS_INDICATOR",
              "CU_LAT","CU_LONGT","coordinates_changed")

nuseds_final <- base::merge(x = all_areas_nuseds[,c(col_common,col_nuseds)], 
                            y = conservation_unit_system_sites[,c(col_common,col_cuss)], 
                            by = col_common, 
                            all.x = T)

3.6 Replace zeros by NAs (optional)

So far we have not been able to determine which zeros are actual 0s and those which should be NAs. We consequently implemented the option to replace all 0s by NAs (n = 2996, or 0.98% of the data points).

Decision made:

## [1] "Zeros are replaced by NAs."

3.7 Fix duplicated series and points

There are many cases where a CU in a given location (i.e., a unique GFE_ID) is associated to multiple series (i.e. IndexId/POP_ID), which should not happen. Observing these cases reveals clear duplicated data points or single data point that are not worth keeping. To fix these issues we proceed as follow:

Case 1: one of the duplicated series has only one data point:
- if it is complementary: merge to the other (longer) series
- if it is in conflict or a duplicate: remove the focal series
Case 2: the shorter series is 100% duplicated: removed the focal series
Case 3: for the rest of the duplicated series,
- points that are conflictual or duplicated are summed up
- points that are complementary are merged

The figures below show the time series before and after correction with the action made (i.e., “MERGED”, “DELETED”, “SUMMED”) indicated

There are three instance where a special fix is done, as indicated below in the corresponding figures.

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "Special fix (above): the blue points before 1950 are deleted, the other ones are summed to the red series."
## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "Special fix (above): the blue points is deleted, the green series is merged to the red one."
## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "Special fix (above): the two blue points are deleted; one is a duplicate, there other one is very close to be one."
## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

## [1] "***"

4 Export datasets

The following files are exported:

1_NuSEDS_escapement_data_collated_DATE.csv: the cleaned combined NUSEDS and CUSS datasets
1_series_inNUSEDS_noInCUSS_DATE.csv: information about the series in NUSEDS and not in CUSS
1_series_removed_DATE.csv: the series removed from either NUSEDS or CUSS and why
1_series_added_DATE.csv: the series added to either NUSEDS or CUSS and why
1_log_file.csv: the log file reporting the name of the original NUSEDS or CUSS files and the choices related to removing zeros or not.

1_nuseds_collation

Bruno S. Carturan

2025-01-24