The samples table contains 217,200 records of taxa from 9,308 samples.
The biota table links to the samples table by the smpcode field. Every smpcode has a list of taxa identified by taxoncode, and a count. The count value is the raw count of specimens: Data provided for the database should not be multiplied out.. If the sample was subsampled, a rough, albeit naive, estimate of true abundance in the sample can be estimated by multiplying count by 100/subsample_perc (found in the samples table). However, a better way to use count and subsample_perc data is to model subsample error in any statistical model of taxon counts: see C. J. Walsh et al. (2023) for an example.
When samples are subsampled, it is standard practice to scan the whole sample for any large, rare taxa. Such specimens are identified using the coarsepick field (= 1 if the specimen is the result of such a scan, 0 otherwise).
If the original supplied data provided taxon names derived correctly from the database’s taxon tables, this is included in the ‘taxon’ field of the biota table. Originally supplied taxoncodes are recorded in the ‘originalbugcode’ field (and may be corrected in the process of import into the database (see Chapter 6).
taxoncode values are consistent with past conventions. All are eight-characters long: the first two or three characters indicating phylum, class or order; the first four characters indicating family (or in some cases sub-family or tribe), the first six characters indicating genus; and the first eight characters indicating species. Specimens not identified to species level have trailing 9s or 0s to indicate that they have not been identified beyond a certain level.
shortcode is the taxoncode with trailing 9s or 0s removed. This field links to the shortcode field in taxon_all (see below).
5.2 Taxonomy tables
In preparing the database, I revised the structure of taxonomic data from earlier versions of the database to improve ease of updating taxonomic names and relations as they are revised, and of keeping track of those revisions. The original taxonomy table (called EPA bugcodes in some database versions, and referred to as the source table below) had >8800 rows, including taxoncode variants on partially unidentified specimens, and 15 fields with inconsistent treatment of taxonomic rank above genus-level, and inconsistent means of recording code and taxon-name revisions.
The eight new taxonomic tables (Fig. 1) take advantage of the hierarchical logic of the Victorian 8-character EPA taxoncodes developed by John Dean, that have become widely used across Australia. The database taxonomic tables can be accessed via an excel file from the data downloads tab
Taxon names can be derived from codes and vice-versa using R functions, discussed below. The data entry app uses the tables to translate inputted taxon names into taxoncodes, and compiles a ‘biota’ table suitable for input into the database.
The four main taxonomic table list all Australian stream macroinvertebrate taxa at:
Phylum, Class, Order (table taxon_rank1);
Family, sub-family, tribe (table taxon_fam);
Genus (table taxon_gen);
Species (table taxon_spp);
linked in hierarchical order by an increasingly long shortcode derived from the first portion of the 8-digit EPA taxoncode. Species and other taxa that have not been fullly assigned to the taxonomic hierarchy are stored in a separate table (morpho_spp). The shortcodes in these five tables (called t1code,famcode,gencode and spcode, in each table respectively) are compiled into the taxon_all table (where they are called shortcode) which lists all taxa (and the child table that they come from), and the taxon_all shortcode field is linked to the shortcode field in the biota table. The shortcode field of taxon_all is also linked to the biotic_indices table, which lists a range of biotic indices of stream health. Finally, the unlinked taxon_sym table lists synonyms for taxa that have been re-named or revised. The following gives more details on the structure of each table.
Table taxon_rank1 lists all unique taxa above family level (phylum, class, and order). These are variously identified by the first 1, 2 or 3 characters in the taxoncode, abbreviated as the t1code. Table 5.1 (built using the following code) shows several examples of these variations. The full table contains 90 entries.
# assuming the database gpkg is in a subfolder of `wd`, named 'mwbugs'# connect to the databasemwbugs_dir <-paste0(wd, "mwbugs/")sql_drv <- RSQLite::SQLite()mwbugs_db <- RSQLite::dbConnect(sql_drv, paste0(mwbugs_dir,"mwbugs_public.gpkg"))#load all database tablessource(paste0("https://tools.thewerg.unimelb.edu.au/data/mwbugs/","bug_database_functions.R"))load_all_mwbugs_tables(mwbugs_db)# disconnect from databaseDBI::dbDisconnect(mwbugs_db)flextable::regulartable(taxon_rank1[taxon_rank1$t1code %in%c("IA","IB","IF2","K","KG","O","OJ","OJ3","Q","QT"),])
Table 5.1: A selection of the 90 rows in the taxon_rank1 table to illustrate the variations of the t1 code.
t1code
phylum
class
order
lowesttaxon
IA
Porifera
Porifera
IB
Cnidaria
Hydrozoa
Hydrozoa
IF2
Playhelminthes
Turbellaria
Rhabdocoela
Rhabdocoela
K
Mollusca
Mollusca
KG
Mollusca
Gastropoda
Gastropoda
O
Arthropoda
Crustacea
Crustacea
OJ
Arthropoda
Crustacea
Copepoda
Copepoda
OJ3
Arthropoda
Crustacea
Cyclopoida
Cyclopoida
Q
Arthropoda
Insecta
Insecta
QT
Arthropoda
Insecta
Trichoptera
Trichoptera
taxon_fam lists all families, subfamilies, and tribes (where relevant) specified by the first four characters of the taxoncode, abbreviated as the famcode. taxon_fam is linked to taxon_rank1 by t1code. Revisions to family names are handled by recording the old family name of renamed families in the oldfamily field, and where the revision involved more complex splitting, it is explained in the taxonomic_note column (Table 5.2). For this level and below, Rotifera (t1code = “J”) have been omitted, because the taxon codes in the source table were not as consistent as other groups, and rotifers are rarely identified in macroinvertebrate samples. The table contains 389 rows.
Table 5.2: A selection of the 389 rows in the taxon_fam table to illustrate the table structure, and the nature of revision notes.
t1code
famcode
family
subfamily
tribe
oldfamily
taxonomic_note
lowesttaxon
KG
KG02
Tateidae
Hydrobiidae
Tateidae
KG
KG13
Pomatiopsidae
Hydrobiidae
Coxiella moved to this family
Pomatiopsidae
QC
QC03
Sphaeriusidae
Microsporidae
Revised to combine Microsporidae and Sphaeriidae
Sphaeriusidae
QD
QDAF
Chironomidae
Orthocladiinae
Orthocladiinae
QD
QDAG
Chironomidae
Chironominae
Pseudochironomini
Pseudochironomini
QO
QO16
Corduliidae
Most genera and species transferred to new Families
Corduliidae
QT
QT25
Leptoceridae
Leptoceridae
taxon_gen lists 1,114 genera, specified by the first 6 characters of the taxoncode, abbreviated as gencode. The table is linked to taxon_fam by famcode. As for families, revisions are recorded by noting the old name used for a revised genus in the old_genus field. In its current state, the table includes superceded genus names, as noted in the taxonomic_note field (Table 5.3).
Table 5.3: A selection of the 1114 rows in the taxon_gen table to illustrate the table structure, and the nature of revision notes.
famcode
gencode
genus
old_genus
taxonomic_note
IB02
IB0201
Cordylophora
LO05
LO0506
Pristinella
Transferred to Pristina
QD06
QD0602
Dixella
Paradixa
QDAF
QDAF02
?Acricotopus
0
QE02
QE0201
Baetis
Genus does not occur in Australia
QE06
QE0606
Tillyardophlebia
Leptophlebiidae Genus D
QE08
QE0803
Irpacaenis
Caenidae Genus C
taxon_spp lists 4,611 species, specified by full 8-character taxoncodes (called spcode in this table). Table taxon_spp is linked to taxon_gen by gencode, and to taxon_rank1 by t1code. This table only contains taxa with matches to entries in the taxon_gen and taxon_rank1 tables. It contains some undescribed morphospecies. Morphospecies without links to taxon_gen are listed in the next table (morphospp_etc). Revisions at the species level have not been fully checked at the time of writing: the field tempNote contains notes on revisions taken directly from the source table (Table 5.4).
Table 5.4: A selection of the 4611 rows in the taxon_spp table to illustrate the table structure, and the nature of revision notes.
gencode
spcode
species
tempNote
taxonomic_note
KP0104
KP010401
Velesunio ambiguus
LH0104
LH010401
Glossiphonia australiensis
QO1618
QO161808
Synthemis ofarrelli
[Now QO230802]
QT0605
QT0605B1
Cheumatopsyche sp.AV10
[= QT060503 Ch.kakaduensis]
QT2507
QT2507B1
Oecetis EPA sp.10
(= EPA sp.1, QT2507A1)
QT2511
QT251102
Triplectides australicus
morphospp_etc lists 1008 lowesttaxon entries in the source table of three types (field type):
‘voucher’ taxa, that were identified as morphospecies by various analysts. These taxa have not been well curated, and likely contain substantial redundancy. Their primary value is for within-study use, where they are unlikely to be ambiguous identifications of other taxa. However, some better curated voucher taxa can be corrected as taxonomic knowledge advances, and this has been done for many of the Tipulidae morphospecies (see taxon_syn) below.
‘partially unidentified taxa’ such as two morphologically similar genera that are difficult to tell apart (e.g. Physa/Physastra). Such taxa can be of use within datasets that have consistently been used, but are ambiguous more broadly, and should be used with caution.
‘DNA barcoded species without confirmed identification’ are species that have been identified by DNA barcoding analysis, but not (yet) asssigned to a formal species. DNA barcode data will be a growing element of the database and are discussed further below.
taxon_all is simply a compilation of the shortcode (i.e. t1code,famcode,gencode and spcode) and taxon fields from the preceding taxonomic tables. The name of the shortcode differs among tables to permit the hierarchical linking of the tables: t1code for taxon_rank1, famcode for taxon_fam, gencode for taxon_gen, spcode for taxon_spp, and taxoncode for morphospp_etc). taxon_all has a third field, table, indicating the source table for each taxon.
taxon_syn lists any taxonomic changes to permit tracking of deprecated taxon names and taxoncodes. It is not intended to be a comprehensive list of taxonomic synonyms but rather ensure internal ability to track changes noted in the originalbugcode column of the biota table or inconsistencies in the taxon column in the biota table (which contains taxon names as they were in the supplied data).
While most taxonomic revisions recorded in the taxon_syn table will have no effect on data compilation and inference, two major revisions leading to version 2 of the database could lead to potential ambiguity for mites (t1code MM), and for Tipulidae and Limoniidae (famcodes QD01 and QD02). See Appendix A for more detail.
5.3 Tables and functions for taxonomic analysis
The SIGNAL scores listed in the samples and sample pairs tables were calculated using the calcSIGNAL() function in the melbstreambiota R package (C. Walsh C J 2019), using grades as listed in the biotic_indices table. That table lists all taxa (and their shortcodes) that have been used for SIGNAL and other biotic indices of stream health which assign sensitivity grades to taxa. The sensitivity grades listed in this table are:
SIGNALWoV2001 and SIGNALWoV2003, variants on the SIGNAL scores used in the State Environment Protection Authority (Waters of Victoria) (SEPP Wov, EPA Victoria 2004);
SIGNALWestPt2003 SIGNAL score for streams of the Westernport catchment variation to the SEPP WoV (Government of Victoria 2001);
SIGNALYarra2003 SIGNAL score for streams of the Yarra catchment variation to the SEPP WoV (Victoria 1999);
Note that the biotic_indices table uses shortcode to specify families and does not include the trailing 9s or 0s that are typically used in specifying (say) families (as in the taxoncode field of the biota table. The function taxoncodeToShortcode() in “bug_database_functions.R”1 can be used to convert taxoncodes with trailing 9s and 0s to shortcodes. This function was used to populate the shortcode column in the biota table.
A large proportion, but not all, of the database follows a useful convention that is not widely followed. Taxoncodes that end in trailing 00s indicate specimens that were identified to the specified supra-specific level (genus, family, etc.) because that was the taxonomic level for that taxon used in the sample or study. Taxoncodes that end in trailing 99s indicate specimens that were damaged or immature and were identified to a level higher than was used for other specimens of that taxon in the sample or study. A convention that is retained for past data is coleopteran specimens ending in “9I” are larvae, and dipteran specimens ending in “9I” are pupae. I propose that all future data follow these conventions (with the last convention being optional).
The data entry app aids compilation of sample data by taking inputted taxon names, converting them to the correct taxoncodes and compiling them into a table consistent with the biota table.
A vector of taxoncodes can be imported into R, and converted to their taxonomic hierarchy using the function “codeTaxonomy()”. As an example, the taxoncodes of specimens collected in the 2018-2019 Melbourne Water biomonitoring program as received can be converted to taxon names, producing output as in Table 5.5. The taxonomy table in “mwbugs_taxonomy_tables.xlsx” (downloadble from the data downloads tab of the database web site) was compiled using this function.
# testBiota loaded from supplied data archive (code not shown)# Note that fields for taxon code and taxon name need to be "taxoncode" and # "lowesttaxon", consistent with databasetestBiota <-unique(testBiota[c("bugcode","lowesttaxon")])names(testBiota) <-c("taxoncode","lowesttaxon")bugnames <-codeTaxonomy(testBiota$taxoncode)#show first 6 taxaflextable::regulartable(head(bugnames))
Table 5.5: Example output from the codeTaxonomy() function, and code for running it.
taxoncode
phylum
class
order
family
subfamily
tribe
species
MM999999
Arthropoda
Mites
KG069999
Mollusca
Gastropoda
Planorbidae
Ancylinae
QO999998
Arthropoda
Insecta
Odonata
QDAA9999
Arthropoda
Insecta
Diptera
Chironomidae
Aphroteniinae
QT239999
Arthropoda
Insecta
Trichoptera
Atriplectididae
OT019999
Arthropoda
Crustacea
Caridea
Atyidae
Furthermore, the naming conventions in the supplied data can be checked for consistency with the database. In the supplied 2018-2019 MW biomonitoring data there were 21 mismatches between the names with the supplied data and the database data. The example output in Table 5.6 shows that the mismatches were a combination of taxonomic revisions not being up-to-date in the supplied data (e.g. Sphaeriusidae, Platyncemididae, Cordylophoridae), of non-standard terminology (e.g. Talitridae sp.(Unident.)), and of misspellings (e.g. Neurorthidae, rather than Nevrorthidae), allowing useful checks of data correctness.
mismatches <-checkTaxonNames(testBiota)#show last 7 taxaflextable::regulartable(tail(mismatches,7))
Table 5.6: Example code for running the function checkTaxonNames(), and its output, showing mismatches between supplied taxon names and database names.
taxoncode
phylum
class
order
family
subfamily
tribe
species
lowesttaxon
suppliedNames
QC039999
Arthropoda
Insecta
Coleoptera
Sphaeriusidae
Sphaeriusidae
Microsporidae (Unident.)
OP019999
Arthropoda
Crustacea
Amphipoda
Talitridae
Talitridae
Talitridae sp.(Unident.)
QO049999
Arthropoda
Insecta
Odonata
Platycnemididae
Platycnemididae
Protoneuridae (Unident.)
QN049999
Arthropoda
Insecta
Neuroptera
Nevrorthidae
Nevrorthidae
Neurorthidae (Unident.)
QD0999I1
Arthropoda
Insecta
Diptera
Ceratopogonidae
Ceratopogonidae
Ceratopogoninae (Unident.)
OR259999
Arthropoda
Crustacea
Isopoda
Oniscidae
Oniscidae
Oniscidae sp.(Unident.)
IB029999
Cnidaria
Hydrozoa
Cordylophoridae
Cordylophoridae
Clavidae (Unident.)
5.4 DNA barcode data
At the time of writing no samples identified by DNA barcoding or metabarcoding have been added to the database, but such samples will become an increasingly common element of the database, and the taxonomic tables are being populated with DNA barcoding information.
Many species in the database have DNA barcodes, and the table bold_id_numbers table matches spcode (in the taxon_spp table)/taxoncode (in the taxon_all and morphospp_etc tables) to BOLD identifier codes https://boldsystems.org/.
The large variety of organisms that can be identified in metabarcoding samples requires explicit criteria for which taxa should be excluded from a stream macroinvertebrate database. At the data import stage only species from the phyla Annelida, Arthropoda, Mollusca, Cnidaria, Platyhelminthes, Nemertea, and Nematoda are included. This excludes non-animals, tardigrades, rotifera and non-invertebrates. Terrestrial or largely-terrestrial invertebrate classes Collembola and Diplopoda, and families Cicadellidae, Delphacidae and Thripidae are also excluded. Microcrustacea, which are unlikely to be reliably caught in the 250 micron nets used for macroinvertebrate sampling, are also excluded: class Ostracoda, and and families Chydoridae and Daphniidae. After applying those filters, additional species-level exclusions are recorded in (eith explanation) in the species_excluded table.
Chessman, B C. 2003. “New Sensitivity Grades for Australian River Macroinvertebrates.” Journal Article. Marine and Freshwater Research 54 (2): 95–103.
EPA Victoria. 2004. “Biological Objectives for Rivers and Streams–Ecosystem Protection.” Report 793.2. Environment Protection Authority Victoria.
Government of Victoria. 2001. “Variation to StateEnvironmentProtectionPolicy (Waters of Victoria): Insertion of ScheduleF8, Waters of the WesternPort and Catchment.” Report. The Craftsman Press. http://www.craftpress.com.au/gazette/Gazettes1999/GG1999S089.pdf.
Victoria, Government of. 1999. “Variation to State Environment Protection Policy (Waters of Victoria): Insertion of Schedule F7, Waters of the Yarra Catchment.” Report No. S89. The Craftsman Press. http://www.craftpress.com.au/gazette/Gazettes1999/GG1999S089.pdf.
Walsh, C J, J A Webb, Gwinn D C, and P F Breen. 2023. “Constructed Rock Riffles Increase Habitat Heterogeneity but Not Biodiversity in Streams Constrained by Urban Impacts.”Ecosphere, in review.
Walsh, Chee, C J. 2019. “Assessing and Understanding in-Stream Habitat Suitability for Invertebrate Families, Fish Species and Platypus of the Melbourne Region with the Melbstreambiota Package (Version 0.1.2).”https://osf.io/v2tda.