6Protocol for providing data to be entered into the database
The database is an active resource, and will be updated with relevant data as it becomes available. If you have macroinvertebrate assemblage data that could be suitable for inclusion in the database, please contact Chris Walsh at cwalsh@unimelb.edu.au.
Data for the database should be prepared in a format consistent with the database structure (Figure fig-erd). If your standard approach to data preparation and curation is to use a spreadsheet program such as Microsoft Excel, a useful approach would be to download some data from the data explorer interface (see next section), and use the downloaded xlsx file as a template for data preparation. The data entry app may also prove useful.
Any data supplied should be accompanied (in a cover letter) by information about the project that led to the samples being collected. Did the project have a name? Who funded it? Are there any publications that report on or use the data? (If so please provide citation data, and preferably a copy of the publications.) Who collected the data (names and organization)? Who processed the data (names and organization)? What quality assurance protocols were used in the data collection and processing? This information will be used to populate the samples_provenance table and to assign a sourcecode to the samples.
6.1 Preparation of sites table
A sites table is necessary to build the correct smpcode for each sample. For sites already in the database, you can just subset the sites table. To check for sites already in the database (if you don’t already have correct sitecode values), each site in the supplied data should (at a minimum) be identified by the correct 3-letter streamcode used by the Melbourne Water Stream network, either using the stream line data (downloadable from the stream network download page), or by entering stream names in the stream selection drop-down box at stream network data explorer. You can then check for equivalent sites by site locations by searching the database. For instance to check for previously sampled sites on Lyrebird Gully Creek (strcode LYR);
# assuming the database gpkg is in a subfolder of `wd`, named 'mwbugs'# connect to the databasemwbugs_dir <-paste0(wd, "mwbugs/")sql_drv <- RSQLite::SQLite()mwbugs_db <- RSQLite::dbConnect(sql_drv, paste0(mwbugs_dir,"mwbugs_public.gpkg"))sites <- sf::st_read(mwbugs_db, "sites")sites[sites$strcode =="LYR",] # returns two sites alreadly in the database.
Simple feature collection with 2 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 358630.8 ymin: 5811206 xmax: 358990.7 ymax: 5811927
CRS: NA
fid sitecode reach strcode sitecode_v12 reach_v12 str_nm
534 534 LYR-276-0 LYR-276 LYR LYR_276a LYR_276 LYREBIRD GULLY CREEK
1137 1137 LYR-722-3 LYR-722 LYR LYR_728e LYR_728 LYREBIRD GULLY CREEK
location alt_location subc mgae_7855 mgan_7855
534 Olinda Ck Rd nr Eagles Nest Rd YARR432 358630.8 5811206
1137 d/s Olinda Ck Rd YARR8823 358990.7 5811927
lga dus old_sitecode old_sitecode_v10 mw_subcat mw_cat
534 Yarra Ranges 0 LYR0006 LYR-4441-0 45 5
1137 Yarra Ranges 3 LYR0007 LYR-4859-3 45 5
geom
534 POINT (358630.8 5811206)
1137 POINT (358990.7 5811927)
Or, for sites on longer rivers, such as the Yarra River (YAR), with many sites already sampled, you could restrict the search by catchment area.
# First extract catchment areas in ha from the sitecodesitecode_split <-strsplit(sites$sitecode,"-")sites$carea_km2 <-as.numeric(sapply(sitecode_split,function(x) x[2]))/100# YAR sites with catchment areas between 300 and 500 sq km sites[sites$strcode =="YAR"& sites$carea_km2 >300& sites$carea_km2 <500,]
Simple feature collection with 7 features and 19 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 396840.2 ymin: 5826631 xmax: 402561.8 ymax: 5830244
CRS: NA
fid sitecode reach strcode sitecode_v12 reach_v12 str_nm
1106 1106 YAR-34257-1 YAR-34257 YAR YAR_34210b YAR_34210 YARRA RIVER
1107 1107 YAR-34549-5 YAR-34549 YAR YAR_34502f YAR_34502 YARRA RIVER
1108 1108 YAR-35245-8 YAR-35245 YAR YAR_35198i YAR_35198 YARRA RIVER
1109 1109 YAR-35911-5 YAR-35911 YAR YAR_35863f YAR_35863 YARRA RIVER
1110 1110 YAR-35962-3 YAR-35962 YAR YAR_35914d YAR_35914 YARRA RIVER
1113 1113 YAR-42313-3 YAR-42313 YAR YAR_42265d YAR_42265 YARRA RIVER
1114 1114 YAR-49518-8 YAR-49518 YAR YAR_49470i YAR_49470 YARRA RIVER
location alt_location
1106 downstram of Upper Yarra Reservoir spillway
1107 d/s of Upper Yarra Dam
1108 Entrance to Upper Yarra Reservoir Park - Reefton
1109 Warburton-Woods Point Rd - Reefton <NA>
1110 Armstrong Creek Road
1113 d/s Armstrong Ck- Reefton
1114 Yarra River upstream of McMahons Bridge, Warburton Hwy
subc mgae_7855 mgan_7855 lga dus old_sitecode
1106 YARR6238 402561.8 5830244 Yarra Ranges 1 YAR0338
1107 YARR6347 401393.9 5830083 Yarra Ranges 5 YAR0345
1108 YARR6396 400800.1 5829890 Yarra Ranges 8 YAR0351
1109 YARR6377 399322.7 5829548 Yarra Ranges 5 YAR0358
1110 YARR6377 398668.5 5829833 Yarra Ranges 3 YAR0359
1113 YARR6088 398530.7 5829792 Yarra Ranges 3 YAR0423
1114 YARR6758 396840.2 5826631 Yarra Ranges 8 YAR0517
old_sitecode_v10 mw_subcat mw_cat geom carea_km2
1106 YAR-6534-1 37 5 POINT (402561.8 5830244) 342.57
1107 YAR-6537-5 37 5 POINT (401393.9 5830083) 345.49
1108 YAR-6547-8 37 5 POINT (400800.1 5829890) 352.45
1109 YAR-6555-5 37 5 POINT (399322.7 5829548) 359.11
1110 YAR-6555C-3 37 5 POINT (398668.5 5829833) 359.62
1113 YAR-6625-3 37 5 POINT (398530.7 5829792) 423.13
1114 YAR-6694-8 37 5 POINT (396840.2 5826631) 495.18
# Returns 7 sites between the Upper Yarra Dam and McMahons Creek
To add new sites to the database, create a table of the new sites containing the following fields:
strcode, derived using the above advice;
old_sitecode, the site code that was used during your data collection;
location, a description of the site location in the same style as that of the database sites table (as informative as possible, including distance upstream or downstream from a landmark that is unlikely to change);
X, easting coordinate (in one of MGA 1994 Zone 55 [crs 28355], MGA 2020 Zone 55 [crs 7785], or decimal latitude);
Y, northing coordinate (in one of MGA 1994 Zone 55 [crs 28355], MGA 2020 Zone 55 [crs 7785], or decimal longitude [a negative value]);
and save this table as an excel file for uploading to the Melbourne Stream network site
This is an example of the process for preparing site data to match the database’s site table. The site data as supplied provides the above information for two sites (and for this example, I assume that a check has already been made to confirm that these two sites are not already included in the sites table)
# Download example data to the local 'temp' directory and opendownload.file("https://tools.thewerg.unimelb.edu.au/data/mwbugs/example_data_entry.xlsx", "~/temp/example_data_entry.xlsx")eg_site <-as.data.frame(readxl::read_excel("~/temp/example_data_entry.xlsx"))eg_site
old_sitecode strcode location X
1 GCEE2 GDM at McMullins Rd of Bacchus Marsh-Gisborne Rd 274986.7
2 PCEN4 PYR at end of Lebner Track, d/s of gauge weir 277101.9
Y
1 5837043
2 5836311
X and Y in this example are easting and northing coordinates in crs 7855. Coordinates for sampled sites should ideally be collected using a GPS device or a geocoded photo taken at the most downstream point of the sampled reach. Alternatively, coordinates could be mapped in a GIS program with reference to the Melbourne Water stream network (Walsh 2023). I used this excel file as input to the site snapper app, to produce an excel file1 and a gpkg file that snap the sites to the nearest point on the named stream in the stream network and derive sitecodes and other information about each site.
First check that the snapping process has not moved the sites further than expected:
# Download output xlsx file to the local 'temp' directory and opendownload.file("https://tools.thewerg.unimelb.edu.au/data/mwbugs/snapped_points_2023-08-13.xlsx", "~/temp/snapped_points.xlsx")eg_sites_snapped <-as.data.frame(readxl::read_excel("~/temp/snapped_points.xlsx"))eg_sites_snapped[c("old_sitecode","sitecode_v11","d_moved_m")]
In this case the points were both only a few m from the stream line: coordinates of points that are moved >~25 m should be checked in a GIS program. It is also worth checking that no existing sites in database that should be considered the same site, which can also be done in a GIS program or in R using code like the following, after opening the output gpkg file from the site snapper app
download.file("https://tools.thewerg.unimelb.edu.au/data/mwbugs/snapped_points_2023-08-13.gpkg", "~/temp/snapped_points.gpkg")eg_sites_formatted <- sf::st_read("~/temp/snapped_points.gpkg", quiet =TRUE)# Stream network data from mwstr/data_dlmwstr_dir <-paste0(wd, "mwstr_v13/")mwstr_db <- RSQLite::dbConnect(sql_drv, paste0(mwstr_dir,"mwstr_v13.gpkg"))streams <- sf::st_read(mwstr_db, query ="SELECT * FROM streams WHERE strcode IN ('GDM','PYR');")plot(sites$geom[sites$strcode %in% eg_sites_formatted$strcode], col ="darkgreen")plot(eg_sites_formatted$geom, col ="maroon", cex =2, add =TRUE)plot(streams$geom, col ="blue", add =TRUE)
This shows that two new sites (large maroon circles) are both close matches for existing sites (smaller green circles), and should not be added to the sites table. The gpkg table compiled by the site snapper app does however provide a table of the same structure of the sites table in the database and can be used for compilation of the samples table (or for addition to the sites table if they were new sites).
eg_sites_formatted
Simple feature collection with 2 features and 17 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 274991.7 ymin: 5836311 xmax: 277101.9 ymax: 5837043
Projected CRS: GDA2020 / MGA zone 55
sitecode reach strcode sitecode_v12 reach_v12 str_nm
1 GDM-3708-8 GDM-3708 GDM GDM_3708i GDM_3708 GOODMAN CREEK
2 PYR-6295-9 PYR-6295 PYR PYR_6296j PYR_6296 PYRITES CREEK
location alt_location subc mgae_7855
1 at McMullins Rd of Bacchus Marsh-Gisborne Rd NA NA 274991.7
2 at end of Lebner Track, d/s of gauge weir NA NA 277101.9
mgan_7855 lga dus old_sitecode old_sitecode_v10 mw_subcat mw_cat
1 5837043 Moorabool NA GCEE2 NA 16 3
2 5836311 Moorabool NA PCEN4 NA 61 3
geom
1 POINT (274991.7 5837043)
2 POINT (277101.9 5836311)
6.2 Preparation of samples table
The samples table should contain relevant details of all samples collected (and those unable to be collected if the stream was dry) in the project. If the sample contains no biotic data (for instance if the stream was dry, and the sample was unable to be collected), then that should be noted in the comment field for that sample and the field ‘dry_no_sample’ should be set to 1 in the final samples table.
The samples table should contain the fields:
smpcode (see below).
date. The date field should be formatted as a date in yyyy-mm-dd format.
sitecode. All values must have a matching value in the sites table.
monthcode. This can be calculated from the date using the calcMonthDate() function in bugDatabaseFunctions.R in R, or by the following formula in Excel (assuming the date is entered in cell A1): “=VALUE((YEAR(A1) - 1990)*12 + MONTH(A1))”.
area_m2. Surface area of the sample, for quantitative collection methods only. Leave blank (or NA in R) if sampled using RBA or another qualitative method.
replicate. The default value for replicate is 1. If more than one sample were collected from the same habitat using the same collection method, then the second sample should have replicate = 2, etc. Note it is possible for the same sample to have been processed in different ways to arrive at different biological data. (For instance a single sample could be field-sorted, and the residue kept and sorted in the lab, permitting assembly of a lab-sorted version of the sample). In such cases, the two versions of the sample would have the same replicate, but different method suffixes. (e.g. 143-LYR-722-3-1-AA, a field-sorted sample, and and 143-LYR-722-3-1-AC, a compiled version of the same sample equivalent to a subsampled lab-sorted sample: Walsh (2006)).
collection_method. All values must have a matching value in the collection_methods table.
processing_method. All values must have a matching value in the processing_methods table.
nsamples. If not a composite sample from multiple sample units, this should = 1.
subsample_perc. The percentage of the sample that was subsampled for sorting. Leave empty if field-sorted (NA in R), 100 if fully sorted in the lab.
sourcecode. Left empty: we will allocate appropriate sourcecodes on entering the data into the database.
comment. Any relevant explanatory comments about the sample.
embargoed. Probably 0. Let us know if you want the data embargoed for a period.
old_sitecode. Must either have a matching value in the sites table (a record of sitecodes used in preparation of dataset), or be empty.
old_samplecode. Like old_sitecode, this can be used to keep a record of the sample codes used in preparation of the dataset.
samppr. Code for sample pairs. If samples are one of a pair, then create a shared code by shortening the smpcodes.
three fields that can be left blank until biota data are compiled: signal_wov2003, signal2, and n_ept_fam
dry_no_sample. 1 if no sample could be taken because the stream was dry, 0 if a sample was taken.
With all other fields compiled, the smpcode field can then be compiled by concatenation:
in R, “paste(samples$monthcode,”-“, samples$sitecode,”-“, samples$replicate,”-“, cm$ccode[cm$collection_method == samples$collection_method], pm$ccode[pm$processing_method == samples$processing_method], sep =”“)” (where cm and pm are the collection_methods and processing_methods tables, respectively);
in Excel, ‘=A1&“-”&A2&“-”&A3&“-”&A4&A5’, assuming that monthcode, sitecode, replicate, ccode, and pcode are in the first five colums of row A.
This is an example of the process for preparing samples data to match the database’s site table, using data for four samples from the example dataset downloaded above.
eg_sample <-as.data.frame(readxl::read_excel("~/temp/example_data_entry.xlsx", sheet =2))#Data provided for this study was undesirably minimalistcm <- RPostgreSQL::dbReadTable(mwbugs_db, "collection_methods")pm <- RPostgreSQL::dbReadTable(mwbugs_db, "processing_methods")samples <- RPostgreSQL::dbReadTable(mwbugs_db, "samples")spv <- RPostgreSQL::dbReadTable(mwbugs_db, "spv")eg_sample$sitecode <- eg_sites_formatted$sitecode[match(eg_sample$old_sitecode, eg_sites_formatted$old_sitecode)]eg_sample$sitecode_v12 <- eg_sites_formatted$sitecode_v12[match(eg_sample$old_sitecode, eg_sites_formatted$old_sitecode)]#calculate monthcode using calcMonthDate() in bug_database_functions.Reg_sample$date <-as.Date(eg_sample$date)eg_sample$monthcode <-calcMonthDate(eg_sample$date)#samples are all rapid bioassessment samples lab-sorted to 300 individuals, identified to family#2 samples at all sites, some riffle-edge, some two edges...eg_sample$collection_method <-"RBA edge (sweep)"eg_sample$collection_method[tolower(eg_sample$hab) =="kick"] <-"RBA riffle (kick)"eg_sample$processing_method <-"lab-subsample to 300, ID to family"eg_sample$area_m2 <- eg_sample$comment <- eg_sample$signal_wov2003 <- eg_sample$signal2 <- eg_sample$n_ept_fam <-NAeg_sample$nsamples <-1eg_sample$sourcecode <-58eg_sample$embargoed <-0eg_sample$replicate <-1eg_sample$dry_no_sample <-1eg_sample$smpcode <-with(eg_sample, paste(monthcode, "-", sitecode, "-", replicate, "-", cm$ccode[match(collection_method, cm$collection_method)], pm$pcode[match(processing_method, pm$processing_method)], sep =""))eg_sample$replicate[duplicated(eg_sample$smpcode)] <-2eg_sample$smpcode <-with(eg_sample, paste(monthcode, "-", sitecode, "-", replicate, "-", cm$ccode[match(collection_method, cm$collection_method)], pm$pcode[match(processing_method, pm$processing_method)], sep =""))# In this case the four samples constitute two sample pairseg_sample$samppr <-substr(eg_sample$smpcode, 1, nchar(eg_sample$smpcode) -3)eg_sample <- eg_sample[match(names(samples),names(eg_sample))]DBI::dbDisconnect(mwbugs_db)eg_sample
smpcode date sitecode sitecode_v12 monthcode area_m2
1 288-GDM-3708-8-1-BC 2013-12-12 GDM-3708-8 GDM_3708i 288 NA
2 288-GDM-3708-8-1-AC 2013-12-12 GDM-3708-8 GDM_3708i 288 NA
3 288-PYR-6295-9-1-BC 2013-12-12 PYR-6295-9 PYR_6296j 288 NA
4 288-PYR-6295-9-1-AC 2013-12-12 PYR-6295-9 PYR_6296j 288 NA
replicate collection_method processing_method nsamples
1 1 RBA riffle (kick) lab-subsample to 300, ID to family 1
2 1 RBA edge (sweep) lab-subsample to 300, ID to family 1
3 1 RBA riffle (kick) lab-subsample to 300, ID to family 1
4 1 RBA edge (sweep) lab-subsample to 300, ID to family 1
subsample_perc sourcecode comment embargoed old_sitecode old_samplecode
1 10 58 NA 0 GCEE2 GCEE2-K
2 25 58 NA 0 GCEE2 GCEE2-S
3 10 58 NA 0 PCEN4 PCEN4-K
4 10 58 NA 0 PCEN4 PCEN4-S
signal_wov2003 samppr signal2 n_ept_fam dry_no_sample
1 NA 288-GDM-3708-8-1 NA NA 1
2 NA 288-GDM-3708-8-1 NA NA 1
3 NA 288-PYR-6295-9-1 NA NA 1
4 NA 288-PYR-6295-9-1 NA NA 1
6.3 Preparation of biota table
If you are beginning with written lab-sheets containing taxon names, then the data-entry app is a useful resource for preparing a biota table with the required structure and content. The table should be provided as a table with 5 or 6 columns:
smpcode, with each value having a matching entry in the samples table. The only entries in the samples table that should not have matches in the biota table are those samples that contained no biota (for instance because the stream was dry at the time of sampling).
taxoncode, with each value being a valid taxoncode (i.e. an 8-character code equivalent to a shortcode value in the taxon_all table, filled with trailing 9s or 0s if the shortcode is <8 characters long).
taxon. While not absolutely necessary, including taxon names in the table is a useful level of redundancy to aid checking for any inconsistencies between taxoncodes and taxon names.
count. The number of specimens of each taxon counted. For subsampled samples, these should be raw counts (Do not multiply the counts by 100/subsample size). For any coarse-picked specimens, multiply their count by the subsample proportion. (For instance, if two Cherax specimens are picked out of the sample before sorting a 15% subsample, enter a count value for these two specimens as 2*15/100 = 0.3, and set the coarsepick field for this specimen to 1). If a further Cherax specimen is found in the subsample, enter that specimen as a separate entry, with a count of 1 and coarsepick = 0. Thus, in this case Cherax would have two entries in the one sample.
coarsepick. Default value is zero. Should only be 1 for specimens that have been picked from a sample before (or after) subsampling.
notes. Any relevant notes on the specimen. For instance explain why a specimen could not be identified to the desired taxonomic level because (was it immature or damaged?)
6.4 Compilation of data file for submission
The three core tables (sites, samples, biota) should be supplied in an appropriate format (e.g samples, and biota could a 2-sheet excel file, and new_sites could be gpkg file) together with a cover letter providing information about the project that led to the data (see above).
Walsh, C J. 2006. “Biological Indicators of Stream Health Using Macroinvertebrate Assemblage Composition: A Comparison of Sensitivity to an Urban Gradient.”Marine and Freshwater Research 57 (1): 37–47.
———. 2023. “Melbourne Water Stream Network. Version 1.3. Melbourne Water Research Practice Partnership Technical Report 19.4d.” Report. School of Agriculture, Food and Ecosystem Sciences, The University of Melbourne. https://tools.thewerg.unimelb.edu.au/mwstr_manual/.
Note that this excel file contains 4 worksheets: snapped_points with information about how far the points were moved; site_env_data, with environmental variables for each reach from the Melbourne Water stream network database; metadata, defining each field in each table; and input_data, equivalent to the data_entry table uploaded to the site snapper app.↩︎