# assuming the database gpkg is in a subfolder of `wd`, named 'mwbugs'
# connect to the database
mwbugs_dir <- paste0(wd, "mwbugs/")
sql_drv <- RSQLite::SQLite()
db_m <- RSQLite::dbConnect(sql_drv, paste0(mwbugs_dir,"mwbugs_public.gpkg"))
#load all database tables
source(paste0("https://tools.thewerg.unimelb.edu.au/mwbugs/data/",
"bug_database_functions.R"))
load_all_mwbugs_tables(db_m)5 Biota and taxonomy tables
5.1 Biota table
The samples table contains 247,687 records of taxa from 10,007 samples.
The biota table links to the samples table by the smpcode field. Every smpcode has a list of taxa identified by taxoncode, and a count. The count value is the raw count of specimens: Data provided for the database should not be multiplied out. If the sample was subsampled, a rough, albeit naive, estimate of true abundance in the sample can be estimated by multiplying count by 100/subsample_perc (found in the samples table). However, a better way to use count and subsample_perc data is to model subsample error in any statistical model of taxon counts: see Walsh et al. (2023) for an example.
When samples are subsampled, it is standard practice to scan the whole sample for any large, rare taxa. Such specimens are identified using the coarsepick field (= 1 if the specimen is the result of such a scan, 0 otherwise).
If the original supplied data provided taxon names derived correctly from the database’s taxon tables, this is included in the taxon field of the biota table. Originally supplied taxoncodes are recorded in the ‘originalbugcode’ field (and may be corrected in the process of import into the database (see Chapter 7).
taxoncode values are consistent with past conventions. All are eight-characters long: the first two or three characters indicating phylum, class or order; the first four characters indicating family (or in some cases sub-family or tribe), the first six characters indicating genus; and the first eight characters indicating species. Specimens not identified to species level have trailing 9s or 0s to indicate that they have not been identified beyond a certain level.
shortcode is the taxoncode with trailing 9s or 0s removed. This field links to the shortcode field in taxon_all (see below).
5.2 Taxonomy tables
In preparing the database, I revised the structure of taxonomic data from earlier versions of the database to improve ease of updating taxonomic names and relations as they are revised, and of keeping track of those revisions. The original taxonomy table (called EPA bugcodes in some database versions, and referred to as the source table below) had >8800 rows, including taxoncode variants on partially unidentified specimens, and 15 fields with inconsistent treatment of taxonomic rank above genus-level, and inconsistent means of recording code and taxon-name revisions.
This EPA source table included all known Australian stream macroinvertebrates at the time (and some microinvertebrates). In building the new tables, new species have been added as they have been identified from Melbourne region samples, and taxonomic revisions have been made (while recording the nature of name changes).
The eight new taxonomic tables (Figure 2.1) take advantage of the hierarchical logic of the Victorian 8-character EPA taxoncodes developed by John Dean, that have become widely used across Australia. The database taxonomic tables can be accessed as an excel file from the data downloads tab
Taxon names can be derived from codes and vice-versa using R functions, discussed below. The data entry app uses the tables to translate inputted taxon names into taxoncodes, and compiles a ‘biota’ table suitable for input into the database.
The four main taxonomic table list all Australian stream macroinvertebrate taxa at:
Phylum, Class, Order (table taxon_rank1);
Family, sub-family, tribe (table taxon_fam);
Genus (table taxon_gen);
Species (table taxon_spp);
linked in hierarchical order by an increasingly long shortcode derived from the first portion of the 8-digit EPA taxoncode. Species and other taxa that have not been fullly assigned to the taxonomic hierarchy are stored in a separate table (morpho_spp). The shortcodes in these five tables (called t1code,famcode,gencode and spcode, in each table respectively) are compiled into the taxon_all table (where they are called shortcode) which lists all taxa (and the child table that they come from), and the taxon_all shortcode field is linked to the shortcode field in the biota table. The shortcode field of taxon_all is also linked to the biotic_indices table, which lists a range of biotic indices of stream health. Finally, the unlinked taxon_sym table lists synonyms for taxa that have been re-named or revised. The following gives more details on the structure of each table.
- Table taxon_rank1 lists all unique taxa above family level (phylum, class, and order). These are variously identified by the first 1, 2 or 3 characters in the taxoncode, abbreviated as the
t1code. Table 5.1 (built using the following code) shows several examples of these variations. The full table contains 90 entries.
flextable::regulartable(taxon_rank1[taxon_rank1$t1code %in%
c("IA","IB","IF2","K","KG","O","OJ","OJ3","Q","QT"),])t1code | phylum | class | order | lowesttaxon |
|---|---|---|---|---|
IA | Porifera | Porifera | ||
IF2 | Platyhelminthes | Turbellaria | Rhabdocoela | Rhabdocoela |
K | Mollusca | Mollusca | ||
KG | Mollusca | Gastropoda | Gastropoda | |
O | Arthropoda | Crustacea | Crustacea | |
OJ | Arthropoda | Crustacea | Copepoda | Copepoda |
OJ3 | Arthropoda | Crustacea | Cyclopoida | Cyclopoida |
Q | Arthropoda | Insecta | Insecta | |
QT | Arthropoda | Insecta | Trichoptera | Trichoptera |
- taxon_fam lists all families, subfamilies, and tribes (where relevant) specified by the first four characters of the
taxoncode, abbreviated as thefamcode. taxon_fam is linked to taxon_rank1 byt1code. Revisions to family names are handled by recording the old family name of renamed families in theoldfamilyfield, and where the revision involved more complex splitting, it is explained in thetaxonomic_notecolumn (Table 5.2). For this level and below, Rotifera (t1code = “J”) have been omitted, because the taxon codes in the source table were not as consistent as other groups, and rotifers are rarely identified in macroinvertebrate samples. The table contains 397 rows.
flextable::regulartable(taxon_fam[taxon_fam$famcode %in%
c("KG02","KG13","QC03","QDAF","QDAG","QO09","QO16","QT25"),])t1code | famcode | family | subfamily | tribe | oldfamily | taxonomic_note | lowesttaxon |
|---|---|---|---|---|---|---|---|
KG | KG02 | Tateidae | Hydrobiidae | Tateidae | |||
KG | KG13 | Pomatiopsidae | Hydrobiidae | Coxiella moved to this family | Pomatiopsidae | ||
QC | QC03 | Sphaeriusidae | Microsporidae | Revised to combine Microsporidae and Sphaeriidae | Sphaeriusidae | ||
QD | QDAF | Chironomidae | Orthocladiinae | Orthocladiinae | |||
QD | QDAG | Chironomidae | Chironominae | Pseudochironomini | Pseudochironomini | ||
QO | QO16 | Corduliidae | Most genera and species transferred to new Families | Corduliidae | |||
QT | QT25 | Leptoceridae | Leptoceridae |
- taxon_gen lists 1,175 genera, specified by the first 6 characters of the
taxoncode, abbreviated asgencode. The table is linked to taxon_fam byfamcode. As for families, revisions are recorded by noting the old name used for a revised genus in theold_genusfield. In its current state, the table includes superceded genus names, as noted in the taxonomic_note field (Table 5.3).
flextable::regulartable(taxon_gen[taxon_gen$gencode %in% c("IB0201","LO0506","QDAF02","QD0602","QE0201","QE0606","QE0803"),])famcode | gencode | genus | old_genus | taxonomic_note |
|---|---|---|---|---|
IB02 | IB0201 | Cordylophora | ||
LO05 | LO0506 | Pristinella | Transferred to Pristina | |
QD06 | QD0602 | Dixella | Paradixa | |
QDAF | QDAF02 | ?Acricotopus | 0 | |
QE02 | QE0201 | Baetis | Genus does not occur in Australia | |
QE06 | QE0606 | Tillyardophlebia | Leptophlebiidae Genus D | |
QE08 | QE0803 | Irpacaenis | Caenidae Genus C |
taxon_spp lists 5,030 species, specified by full 8-character
taxoncodes(calledspcodein this table). Table taxon_spp is linked to taxon_gen bygencode, and to taxon_rank1 byt1code. This table only contains taxa with matches to entries in the taxon_gen and taxon_rank1 tables. It contains some undescribed morphospecies. Morphospecies without links to taxon_gen are listed in the next table (morphospp_etc). The table also contains species that have a DNA sequence that closely matches (>= 97% similarity) one or more BIN codes (Ratnasingham and Hebert 2013) which have been identified to at least genus-level (see Chapter 6).Revisions at the species level have not been fully checked at the time of writing: the field
tempNotecontains notes on revisions taken directly from the source table (Table 5.4).
flextable::regulartable(taxon_spp[taxon_spp$spcode %in%
c("KP010401","LH010401","QO161808","QT2507B1","QT0605B1","QT251102"),])gencode | spcode | species | temp_note | taxonomic_note |
|---|---|---|---|---|
KP0104 | KP010401 | Velesunio ambiguus | ||
LH0104 | LH010401 | Glossiphonia australiensis | ||
QT0605 | QT0605B1 | Cheumatopsyche sp. AV10 | [= QT060503 Ch.kakaduensis] | |
QT2507 | QT2507B1 | Oecetis EPA sp. 10 | (= EPA sp.1, QT2507A1) | |
QT2511 | QT251102 | Triplectides australicus |
morphospp_etc lists 727 lowesttaxon entries in the source table of three types (field
type):‘voucher’ taxa, that were identified as morphospecies by various analysts. These taxa have not been well curated, and likely contain substantial redundancy. Their primary value is for within-study use, where they are unlikely to be ambiguous identifications of other taxa. However, some better curated voucher taxa can be corrected as taxonomic knowledge advances, and this has been done for many of the Tipulidae morphospecies (see taxon_syn, below).
‘partially unidentified taxa’ such as two morphologically similar genera that are difficult to tell apart (e.g. Physa/Physastra). Such taxa can be of use within datasets that have consistently been used, but are ambiguous more broadly, and should be used with caution.
species that have a DNA sequence that closely matches (>= 97% similarity) one or more BIN codes (Ratnasingham and Hebert 2013) which have been identified to family-level or higher (see Chapter 6).
taxon_all is simply a compilation of the
shortcode(i.e.t1code,famcode,gencodeandspcode) andtaxonfields from the preceding taxonomic tables. The name of the shortcode differs among tables to permit the hierarchical linking of the tables:t1codefor taxon_rank1,famcodefor taxon_fam,gencodefor taxon_gen,spcodefor taxon_spp, andtaxoncodefor morphospp_etc). taxon_all has a third field,table, indicating the source table for each taxon.taxon_syn lists any taxonomic changes to permit tracking of deprecated taxon names and taxoncodes. It is not intended to be a comprehensive list of taxonomic synonyms but rather ensure internal ability to track changes noted in the
originalbugcodecolumn of the biota table or inconsistencies in thetaxoncolumn in the biota table (which contains taxon names as they were in the supplied data).
While most taxonomic revisions recorded in the taxon_syn table will have no effect on data compilation and inference, two major revisions leading to the incorporation of DNA-derived data into the database database could lead to potential ambiguity for mites (t1code MM), and for Tipulidae and Limoniidae (famcodes QD01 and QD02). See Appendix A for more detail.
5.3 Tables and functions for taxonomic analysis
The SIGNAL scores listed in the samples and sample pairs tables were calculated using the calcSIGNAL() function in the melbstreambiota R package (Walsh et al. 2019), using grades as listed in the biotic_indices table. That table lists all taxa (and their shortcodes) that have been used for SIGNAL and other biotic indices of stream health which assign sensitivity grades to taxa. The sensitivity grades listed in this table are:
SIGNAL2 (Chessman 2003);
SIGNALWoV2001 and SIGNALWoV2003, variants on the SIGNAL scores used in the State Environment Protection Authority (Waters of Victoria) (SEPP Wov, EPA Victoria 2004);
SIGNALWestPt2003 SIGNAL score for streams of the Westernport catchment variation to the SEPP WoV (Government of Victoria 2001);
SIGNALYarra2003 SIGNAL score for streams of the Yarra catchment variation to the SEPP WoV (Victoria 1999);
Rheophily.index, an index of sensitivity to flow stress (Bond, Thomson, and Reich 2012).
Note that the biotic_indices table uses shortcode to specify families and does not include the trailing 9s or 0s that are typically used in specifying (say) families (as in the taxoncode field of the biota table. The function taxoncodeToShortcode() in “bug_database_functions.R”1 can be used to convert taxoncodes with trailing 9s and 0s to shortcodes. This function was used to populate the shortcode column in the biota table.
A large proportion, but not all, of the database follows a useful convention that is not widely followed. Taxoncodes that end in trailing 00s indicate specimens that were identified to the specified supra-specific level (genus, family, etc.) because that was the taxonomic level for that taxon used in the sample or study. Taxoncodes that end in trailing 99s indicate specimens that were damaged or immature and were identified to a level higher than was used for other specimens of that taxon in the sample or study. For example, if a project required genus-level identification, taxoncode for identifiable specimens of Cricotopus would be recorded as QDAF1500, while a damaged or immature Orthocladiinae specimen, which may or may not be Cricotopus, would be recorded as QDAF9999”.
A convention that is retained for past data is that coleopteran specimens ending in “9I” are larvae, and dipteran specimens ending in “9I” are pupae. I propose that future data follow these conventions (with the last convention being optional).
The data entry app aids compilation of sample data by taking inputted taxon names, converting them to the correct taxoncodes and compiling them into a table consistent with the biota table.
A vector of taxoncodes can be imported into R, and converted to their taxonomic hierarchy using the function “codeTaxonomy()”. As an example, the taxoncodes of specimens collected in the 2018-2019 Melbourne Water biomonitoring program as received from the consultants who identified them can be converted to taxon names, producing output as in Table 5.5. The taxonomy table in “mwbugs_taxonomy_tables.xlsx” (downloadble from the data downloads tab of the database web site) was compiled using this function.
# testBiota loaded from supplied data archive (code not shown)
# Note that fields for taxon code and taxon name need to be "taxoncode" and
# "lowesttaxon", consistent with database
testBiota <- unique(testBiota[c("bugcode","lowesttaxon")])
names(testBiota) <- c("taxoncode","lowesttaxon")
bugnames <- codeTaxonomy(testBiota$taxoncode)
#show first 6 taxa
flextable::regulartable(head(bugnames))taxoncode | phylum | class | order | family | subfamily | tribe | species |
|---|---|---|---|---|---|---|---|
MM999999 | Arthropoda | Mites | |||||
KG069999 | Mollusca | Gastropoda | Ancylinae (Planorbidae) | Ancylinae | |||
QO999998 | Arthropoda | Insecta | Odonata | ||||
QDAA9999 | Arthropoda | Insecta | Diptera | Chironomidae | Aphroteniinae | ||
QT239999 | Arthropoda | Insecta | Trichoptera | Atriplectididae | |||
OT019999 | Arthropoda | Crustacea | Caridea | Atyidae |
Furthermore, the naming conventions in the supplied data can be checked for consistency with the database. In the supplied 2018-2019 MW biomonitoring data there were 21 mismatches between the names with the supplied data and the database data. The example output in Table 5.6 shows that the mismatches were a combination of taxonomic revisions not being up-to-date in the supplied data (e.g. Sphaeriusidae, Platyncemididae, Cordylophoridae), of non-standard terminology (e.g. Talitridae sp.(Unident.)), and of misspellings (e.g. Neurorthidae, rather than Nevrorthidae), allowing useful checks of data correctness.
mismatches <- checkTaxonNames(testBiota)
#show last 7 taxa
flextable::regulartable(tail(mismatches,7))taxoncode | phylum | class | order | family | subfamily | tribe | species | lowesttaxon | suppliedNames |
|---|---|---|---|---|---|---|---|---|---|
QC039999 | Arthropoda | Insecta | Coleoptera | Sphaeriusidae | Sphaeriusidae | Microsporidae (Unident.) | |||
OP019999 | Arthropoda | Crustacea | Amphipoda | Talitridae | Talitridae | Talitridae sp.(Unident.) | |||
QO049999 | Arthropoda | Insecta | Odonata | Platycnemididae | Platycnemididae | Protoneuridae (Unident.) | |||
QN049999 | Arthropoda | Insecta | Neuroptera | Nevrorthidae | Nevrorthidae | Neurorthidae (Unident.) | |||
QD0999I1 | Arthropoda | Insecta | Diptera | Ceratopogonidae | Ceratopogonidae | Ceratopogoninae (Unident.) | |||
OR259999 | Arthropoda | Crustacea | Isopoda | Oniscidae | Oniscidae | Oniscidae sp.(Unident.) | |||
IB029999 | Cnidaria | Hydrozoa | Anthoathecata | Cordylophoridae | Cordylophoridae | Clavidae (Unident.) |
see the data downloads tab of the database web site.↩︎