6  Protocol for providing data to be entered into the database

The database is an active resource, and will be updated with relevant data as it becomes available. If you have macroinvertebrate assemblage data that could be suitable for inclusion in the database, please contact Chris Walsh at cwalsh@unimelb.edu.au.

Data for the database should be prepared in a format consistent with the database structure (Figure fig-erd). If your standard approach to data preparation and curation is to use a spreadsheet program such as Microsoft Excel, a useful approach would be to download some data from the data explorer interface (see next section), and use the downloaded xlsx file as a template for data preparation. The data entry app may also prove useful.

Any data supplied should be accompanied (in a cover letter) by information about the project that led to the samples being collected. Did the project have a name? Who funded it? Are there any publications that report on or use the data? (If so please provide citation data, and preferably a copy of the publications.) Who collected the data (names and organization)? Who processed the data (names and organization)? What quality assurance protocols were used in the data collection and processing? This information will be used to populate the samples_provenance table and to assign a sourcecode to the samples.

6.1 Preparation of sites table

A sites table is necessary to build the correct smpcode for each sample. For sites already in the database, you can just subset the sites table. To check for sites already in the database (if you don’t already have correct sitecode values), each site in the supplied data should (at a minimum) be identified by the correct 3-letter streamcode used by the Melbourne Water Stream network, either using the stream line data (downloadable from the stream network download page), or by entering stream names in the stream selection drop-down box at stream network data explorer. You can then check for equivalent sites by site locations by searching the database. For instance to check for previously sampled sites on Lyrebird Gully Creek (strcode LYR);

# assuming the database gpkg is in a subfolder of `wd`, named 'mwbugs'
# connect to the database
mwbugs_dir <- paste0(wd, "mwbugs/")
sql_drv <- RSQLite::SQLite()
mwbugs_db <- RSQLite::dbConnect(sql_drv, paste0(mwbugs_dir,"mwbugs_public.gpkg"))
sites <- sf::st_read(mwbugs_db, "sites")
sites[sites$strcode == "LYR",] # returns two sites alreadly in the database.
Simple feature collection with 2 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 358630.8 ymin: 5811206 xmax: 358990.7 ymax: 5811927
CRS:           NA
      fid  sitecode   reach strcode sitecode_v12 reach_v12               str_nm
534   534 LYR-276-0 LYR-276     LYR     LYR_276a   LYR_276 LYREBIRD GULLY CREEK
1137 1137 LYR-722-3 LYR-722     LYR     LYR_728e   LYR_728 LYREBIRD GULLY CREEK
                           location alt_location     subc mgae_7855 mgan_7855
534  Olinda Ck Rd nr Eagles Nest Rd               YARR432  358630.8   5811206
1137               d/s Olinda Ck Rd              YARR8823  358990.7   5811927
              lga dus old_sitecode old_sitecode_v10 mw_subcat mw_cat
534  Yarra Ranges   0      LYR0006       LYR-4441-0        45      5
1137 Yarra Ranges   3      LYR0007       LYR-4859-3        45      5
534  POINT (358630.8 5811206)
1137 POINT (358990.7 5811927)

Or, for sites on longer rivers, such as the Yarra River (YAR), with many sites already sampled, you could restrict the search by catchment area.

# First extract catchment areas in ha from the sitecode
sitecode_split <- strsplit(sites$sitecode,"-")
sites$carea_km2 <- as.numeric(sapply(sitecode_split,function(x) x[2]))/100
# YAR sites with catchment areas between 300 and 500 sq km 
sites[sites$strcode == "YAR" & sites$carea_km2 > 300 & sites$carea_km2 < 500,] 
Simple feature collection with 7 features and 19 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 396840.2 ymin: 5826631 xmax: 402561.8 ymax: 5830244
CRS:           NA
      fid    sitecode     reach strcode sitecode_v12 reach_v12      str_nm
1106 1106 YAR-34257-1 YAR-34257     YAR   YAR_34210b YAR_34210 YARRA RIVER
1107 1107 YAR-34549-5 YAR-34549     YAR   YAR_34502f YAR_34502 YARRA RIVER
1108 1108 YAR-35245-8 YAR-35245     YAR   YAR_35198i YAR_35198 YARRA RIVER
1109 1109 YAR-35911-5 YAR-35911     YAR   YAR_35863f YAR_35863 YARRA RIVER
1110 1110 YAR-35962-3 YAR-35962     YAR   YAR_35914d YAR_35914 YARRA RIVER
1113 1113 YAR-42313-3 YAR-42313     YAR   YAR_42265d YAR_42265 YARRA RIVER
1114 1114 YAR-49518-8 YAR-49518     YAR   YAR_49470i YAR_49470 YARRA RIVER
                                                   location alt_location
1106            downstram of Upper Yarra Reservoir spillway             
1107                                 d/s of Upper Yarra Dam             
1108       Entrance to Upper Yarra Reservoir Park - Reefton             
1109                     Warburton-Woods Point Rd - Reefton         <NA>
1110                                   Armstrong Creek Road             
1113                              d/s Armstrong Ck- Reefton             
1114 Yarra River upstream of McMahons Bridge, Warburton Hwy             
         subc mgae_7855 mgan_7855          lga dus old_sitecode
1106 YARR6238  402561.8   5830244 Yarra Ranges   1      YAR0338
1107 YARR6347  401393.9   5830083 Yarra Ranges   5      YAR0345
1108 YARR6396  400800.1   5829890 Yarra Ranges   8      YAR0351
1109 YARR6377  399322.7   5829548 Yarra Ranges   5      YAR0358
1110 YARR6377  398668.5   5829833 Yarra Ranges   3      YAR0359
1113 YARR6088  398530.7   5829792 Yarra Ranges   3      YAR0423
1114 YARR6758  396840.2   5826631 Yarra Ranges   8      YAR0517
     old_sitecode_v10 mw_subcat mw_cat                     geom carea_km2
1106       YAR-6534-1        37      5 POINT (402561.8 5830244)    342.57
1107       YAR-6537-5        37      5 POINT (401393.9 5830083)    345.49
1108       YAR-6547-8        37      5 POINT (400800.1 5829890)    352.45
1109       YAR-6555-5        37      5 POINT (399322.7 5829548)    359.11
1110      YAR-6555C-3        37      5 POINT (398668.5 5829833)    359.62
1113       YAR-6625-3        37      5 POINT (398530.7 5829792)    423.13
1114       YAR-6694-8        37      5 POINT (396840.2 5826631)    495.18
# Returns 7 sites between the Upper Yarra Dam and McMahons Creek

To add new sites to the database, create a table of the new sites containing the following fields:

  • strcode, derived using the above advice;

  • old_sitecode, the site code that was used during your data collection;

  • location, a description of the site location in the same style as that of the database sites table (as informative as possible, including distance upstream or downstream from a landmark that is unlikely to change);

  • X, easting coordinate (in one of MGA 1994 Zone 55 [crs 28355], MGA 2020 Zone 55 [crs 7785], or decimal latitude);

  • Y, northing coordinate (in one of MGA 1994 Zone 55 [crs 28355], MGA 2020 Zone 55 [crs 7785], or decimal longitude [a negative value]);

and save this table as an excel file for uploading to the Melbourne Stream network site

This is an example of the process for preparing site data to match the database’s site table. The site data as supplied provides the above information for two sites (and for this example, I assume that a check has already been made to confirm that these two sites are not already included in the sites table)

# Download example data to the local 'temp' directory and open
eg_site <- as.data.frame(readxl::read_excel("~/temp/example_data_entry.xlsx"))
  old_sitecode strcode                                     location        X
1        GCEE2     GDM at McMullins Rd of Bacchus Marsh-Gisborne Rd 274986.7
2        PCEN4     PYR    at end of Lebner Track, d/s of gauge weir 277101.9
1 5837043
2 5836311

X and Y in this example are easting and northing coordinates in crs 7855. Coordinates for sampled sites should ideally be collected using a GPS device or a geocoded photo taken at the most downstream point of the sampled reach. Alternatively, coordinates could be mapped in a GIS program with reference to the Melbourne Water stream network (Walsh 2023). I used this excel file as input to the site snapper app, to produce an excel file1 and a gpkg file that snap the sites to the nearest point on the named stream in the stream network and derive sitecodes and other information about each site.

First check that the snapping process has not moved the sites further than expected:

# Download output xlsx file to the local 'temp' directory and open
eg_sites_snapped <- as.data.frame(readxl::read_excel("~/temp/snapped_points.xlsx"))
  old_sitecode sitecode_v11 d_moved_m
1        GCEE2   GDM-3708-8 5.0565398
2        PCEN4   PYR-6295-9 0.4422798

In this case the points were both only a few m from the stream line: coordinates of points that are moved >~25 m should be checked in a GIS program. It is also worth checking that no existing sites in database that should be considered the same site, which can also be done in a GIS program or in R using code like the following, after opening the output gpkg file from the site snapper app

eg_sites_formatted <- sf::st_read("~/temp/snapped_points.gpkg", quiet = TRUE)
# Stream network data from mwstr/data_dl
mwstr_dir <- paste0(wd, "mwstr_v13/")
mwstr_db <- RSQLite::dbConnect(sql_drv, paste0(mwstr_dir,"mwstr_v13.gpkg"))
streams <- sf::st_read(mwstr_db, query = 
    "SELECT * FROM streams WHERE strcode IN ('GDM','PYR');")
plot(sites$geom[sites$strcode %in% eg_sites_formatted$strcode], col = "darkgreen")
plot(eg_sites_formatted$geom, col = "maroon", cex = 2, add = TRUE)
plot(streams$geom, col = "blue", add = TRUE)

This shows that two new sites (large maroon circles) are both close matches for existing sites (smaller green circles), and should not be added to the sites table. The gpkg table compiled by the site snapper app does however provide a table of the same structure of the sites table in the database and can be used for compilation of the samples table (or for addition to the sites table if they were new sites).

Simple feature collection with 2 features and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 274991.7 ymin: 5836311 xmax: 277101.9 ymax: 5837043
Projected CRS: GDA2020 / MGA zone 55
    sitecode    reach strcode sitecode_v12 reach_v12        str_nm
1 GDM-3708-8 GDM-3708     GDM    GDM_3708i  GDM_3708 GOODMAN CREEK
2 PYR-6295-9 PYR-6295     PYR    PYR_6296j  PYR_6296 PYRITES CREEK
                                      location alt_location subc mgae_7855
1 at McMullins Rd of Bacchus Marsh-Gisborne Rd           NA   NA  274991.7
2    at end of Lebner Track, d/s of gauge weir           NA   NA  277101.9
  mgan_7855       lga dus old_sitecode old_sitecode_v10 mw_subcat mw_cat
1   5837043 Moorabool  NA        GCEE2               NA        16      3
2   5836311 Moorabool  NA        PCEN4               NA        61      3
1 POINT (274991.7 5837043)
2 POINT (277101.9 5836311)

6.2 Preparation of samples table

The samples table should contain relevant details of all samples collected (and those unable to be collected if the stream was dry) in the project. If the sample contains no biotic data (for instance if the stream was dry, and the sample was unable to be collected), then that should be noted in the comment field for that sample and the field ‘dry_no_sample’ should be set to 1 in the final samples table.

The samples table should contain the fields:

  • smpcode (see below).

  • date. The date field should be formatted as a date in yyyy-mm-dd format.

  • sitecode. All values must have a matching value in the sites table.

  • monthcode. This can be calculated from the date using the calcMonthDate() function in bugDatabaseFunctions.R in R, or by the following formula in Excel (assuming the date is entered in cell A1): “=VALUE((YEAR(A1) - 1990)*12 + MONTH(A1))”.

  • area_m2. Surface area of the sample, for quantitative collection methods only. Leave blank (or NA in R) if sampled using RBA or another qualitative method.

  • replicate. The default value for replicate is 1. If more than one sample were collected from the same habitat using the same collection method, then the second sample should have replicate = 2, etc. Note it is possible for the same sample to have been processed in different ways to arrive at different biological data. (For instance a single sample could be field-sorted, and the residue kept and sorted in the lab, permitting assembly of a lab-sorted version of the sample). In such cases, the two versions of the sample would have the same replicate, but different method suffixes. (e.g. 143-LYR-722-3-1-AA, a field-sorted sample, and and 143-LYR-722-3-1-AC, a compiled version of the same sample equivalent to a subsampled lab-sorted sample: Walsh (2006)).

  • collection_method. All values must have a matching value in the collection_methods table.

  • processing_method. All values must have a matching value in the processing_methods table.

  • nsamples. If not a composite sample from multiple sample units, this should = 1.

  • subsample_perc. The percentage of the sample that was subsampled for sorting. Leave empty if field-sorted (NA in R), 100 if fully sorted in the lab.

  • sourcecode. Left empty: we will allocate appropriate sourcecodes on entering the data into the database.

  • comment. Any relevant explanatory comments about the sample.

  • embargoed. Probably 0. Let us know if you want the data embargoed for a period.

  • old_sitecode. Must either have a matching value in the sites table (a record of sitecodes used in preparation of dataset), or be empty.

  • old_samplecode. Like old_sitecode, this can be used to keep a record of the sample codes used in preparation of the dataset.

  • samppr. Code for sample pairs. If samples are one of a pair, then create a shared code by shortening the smpcodes.

  • three fields that can be left blank until biota data are compiled: signal_wov2003, signal2, and n_ept_fam

  • dry_no_sample. 1 if no sample could be taken because the stream was dry, 0 if a sample was taken.

With all other fields compiled, the smpcode field can then be compiled by concatenation:

  • in R, “paste(samples$monthcode,”-“, samples$sitecode,”-“, samples$replicate,”-“, cm$ccode[cm$collection_method == samples$collection_method], pm$ccode[pm$processing_method == samples$processing_method], sep =”“)” (where cm and pm are the collection_methods and processing_methods tables, respectively);

  • in Excel, ‘=A1&“-”&A2&“-”&A3&“-”&A4&A5’, assuming that monthcode, sitecode, replicate, ccode, and pcode are in the first five colums of row A.

This is an example of the process for preparing samples data to match the database’s site table, using data for four samples from the example dataset downloaded above.

eg_sample <- as.data.frame(readxl::read_excel("~/temp/example_data_entry.xlsx", 
                                              sheet = 2))

#Data provided for this study was undesirably minimalist
cm <- RPostgreSQL::dbReadTable(mwbugs_db, "collection_methods")
pm <- RPostgreSQL::dbReadTable(mwbugs_db, "processing_methods")
samples <- RPostgreSQL::dbReadTable(mwbugs_db, "samples")
spv <- RPostgreSQL::dbReadTable(mwbugs_db, "spv")

eg_sample$sitecode <- eg_sites_formatted$sitecode[match(eg_sample$old_sitecode, eg_sites_formatted$old_sitecode)]
eg_sample$sitecode_v12 <- eg_sites_formatted$sitecode_v12[match(eg_sample$old_sitecode, eg_sites_formatted$old_sitecode)]
#calculate monthcode using calcMonthDate() in bug_database_functions.R
eg_sample$date <- as.Date(eg_sample$date)
eg_sample$monthcode <- calcMonthDate(eg_sample$date)
#samples are all rapid bioassessment samples lab-sorted to 300 individuals, identified to family
#2 samples at all sites, some riffle-edge, some two edges...
eg_sample$collection_method <- "RBA edge (sweep)"
eg_sample$collection_method[tolower(eg_sample$hab) == "kick"] <- "RBA riffle (kick)"
eg_sample$processing_method <- "lab-subsample to 300, ID to family"
eg_sample$area_m2 <- eg_sample$comment <- eg_sample$signal_wov2003 <- 
                          eg_sample$signal2 <- eg_sample$n_ept_fam <- NA
eg_sample$nsamples <- 1
eg_sample$sourcecode <- 58
eg_sample$embargoed <- 0
eg_sample$replicate <- 1
eg_sample$dry_no_sample <- 1
eg_sample$smpcode <- with(eg_sample, 
                            paste(monthcode, "-", sitecode, "-", replicate, "-", 
                                  cm$ccode[match(collection_method, cm$collection_method)], 
                                  pm$pcode[match(processing_method, pm$processing_method)], 
                                  sep = ""))
eg_sample$replicate[duplicated(eg_sample$smpcode)] <- 2
eg_sample$smpcode <- with(eg_sample, 
                            paste(monthcode, "-", sitecode, "-", replicate, "-", 
                                  cm$ccode[match(collection_method, cm$collection_method)], 
                                  pm$pcode[match(processing_method, pm$processing_method)], 
                                  sep = ""))
# In this case the four samples constitute two sample pairs
eg_sample$samppr <- substr(eg_sample$smpcode, 1, nchar(eg_sample$smpcode) - 3)
eg_sample <- eg_sample[match(names(samples),names(eg_sample))]
              smpcode       date   sitecode sitecode_v12 monthcode area_m2
1 288-GDM-3708-8-1-BC 2013-12-12 GDM-3708-8    GDM_3708i       288      NA
2 288-GDM-3708-8-1-AC 2013-12-12 GDM-3708-8    GDM_3708i       288      NA
3 288-PYR-6295-9-1-BC 2013-12-12 PYR-6295-9    PYR_6296j       288      NA
4 288-PYR-6295-9-1-AC 2013-12-12 PYR-6295-9    PYR_6296j       288      NA
  replicate collection_method                  processing_method nsamples
1         1 RBA riffle (kick) lab-subsample to 300, ID to family        1
2         1  RBA edge (sweep) lab-subsample to 300, ID to family        1
3         1 RBA riffle (kick) lab-subsample to 300, ID to family        1
4         1  RBA edge (sweep) lab-subsample to 300, ID to family        1
  subsample_perc sourcecode comment embargoed old_sitecode old_samplecode
1             10         58      NA         0        GCEE2        GCEE2-K
2             25         58      NA         0        GCEE2        GCEE2-S
3             10         58      NA         0        PCEN4        PCEN4-K
4             10         58      NA         0        PCEN4        PCEN4-S
  signal_wov2003           samppr signal2 n_ept_fam dry_no_sample
1             NA 288-GDM-3708-8-1      NA        NA             1
2             NA 288-GDM-3708-8-1      NA        NA             1
3             NA 288-PYR-6295-9-1      NA        NA             1
4             NA 288-PYR-6295-9-1      NA        NA             1

6.3 Preparation of biota table

If you are beginning with written lab-sheets containing taxon names, then the data-entry app is a useful resource for preparing a biota table with the required structure and content. The table should be provided as a table with 5 or 6 columns:

  1. smpcode, with each value having a matching entry in the samples table. The only entries in the samples table that should not have matches in the biota table are those samples that contained no biota (for instance because the stream was dry at the time of sampling).

  2. taxoncode, with each value being a valid taxoncode (i.e. an 8-character code equivalent to a shortcode value in the taxon_all table, filled with trailing 9s or 0s if the shortcode is <8 characters long).

  3. taxon. While not absolutely necessary, including taxon names in the table is a useful level of redundancy to aid checking for any inconsistencies between taxoncodes and taxon names.

  4. count. The number of specimens of each taxon counted. For subsampled samples, these should be raw counts (Do not multiply the counts by 100/subsample size). For any coarse-picked specimens, multiply their count by the subsample proportion. (For instance, if two Cherax specimens are picked out of the sample before sorting a 15% subsample, enter a count value for these two specimens as 2*15/100 = 0.3, and set the coarsepick field for this specimen to 1). If a further Cherax specimen is found in the subsample, enter that specimen as a separate entry, with a count of 1 and coarsepick = 0. Thus, in this case Cherax would have two entries in the one sample.

  5. coarsepick. Default value is zero. Should only be 1 for specimens that have been picked from a sample before (or after) subsampling.

  6. notes. Any relevant notes on the specimen. For instance explain why a specimen could not be identified to the desired taxonomic level because (was it immature or damaged?)

6.4 Compilation of data file for submission

The three core tables (sites, samples, biota) should be supplied in an appropriate format (e.g samples, and biota could a 2-sheet excel file, and new_sites could be gpkg file) together with a cover letter providing information about the project that led to the data (see above).

Walsh, C J. 2006. “Biological Indicators of Stream Health Using Macroinvertebrate Assemblage Composition: A Comparison of Sensitivity to an Urban Gradient.” Marine and Freshwater Research 57 (1): 37–47.
———. 2023. “Melbourne Water Stream Network. Version 1.3. Melbourne Water Research Practice Partnership Technical Report 19.4d.” Report. School of Agriculture, Food and Ecosystem Sciences, The University of Melbourne. https://tools.thewerg.unimelb.edu.au/mwstr_manual/.

  1. Note that this excel file contains 4 worksheets: snapped_points with information about how far the points were moved; site_env_data, with environmental variables for each reach from the Melbourne Water stream network database; metadata, defining each field in each table; and input_data, equivalent to the data_entry table uploaded to the site snapper app.↩︎