5  Biota and taxonomy tables

5.1 Biota table

The samples table contains 217,200 records of taxa from 9,308 samples.

The biota table links to the samples table by the smpcode field. Every smpcode has a list of taxa identified by taxoncode, and a count. The count value is the raw count of specimens: Data provided for the database should not be multiplied out.. If the sample was subsampled, a rough, albeit naive, estimate of true abundance in the sample can be estimated by multiplying count by 100/subsample_perc (found in the samples table). However, a better way to use count and subsample_perc data is to model subsample error in any statistical model of taxon counts: see C. J. Walsh et al. (2023) for an example.

When samples are subsampled, it is standard practice to scan the whole sample for any large, rare taxa. Such specimens are identified using the coarsepick field (= 1 if the specimen is the result of such a scan, 0 otherwise).

If the original supplied data provided taxon names derived correctly from the database’s taxon tables, this is included in the ‘taxon’ field of the biota table. Originally supplied taxoncodes are recorded in the ‘originalbugcode’ field (and may be corrected in the process of import into the database (see Chapter 6).

taxoncode values are consistent with past conventions. All are eight-characters long: the first two or three characters indicating phylum, class or order; the first four characters indicating family (or in some cases sub-family or tribe), the first six characters indicating genus; and the first eight characters indicating species. Specimens not identified to species level have trailing 9s or 0s to indicate that they have not been identified beyond a certain level.

shortcode is the taxoncode with trailing 9s or 0s removed. This field links to the shortcode field in taxon_all (see below).

5.2 Taxonomy tables

In preparing the database, I revised the structure of taxonomic data from earlier versions of the database to improve ease of updating taxonomic names and relations as they are revised, and of keeping track of those revisions. The original taxonomy table (called EPA bugcodes in some database versions, and referred to as the source table below) had >8800 rows, including taxoncode variants on partially unidentified specimens, and 15 fields with inconsistent treatment of taxonomic rank above genus-level, and inconsistent means of recording code and taxon-name revisions.

The eight new taxonomic tables (Fig. 1) take advantage of the hierarchical logic of the Victorian 8-character EPA taxoncodes developed by John Dean, that have become widely used across Australia. The database taxonomic tables can be accessed via an excel file from the data downloads tab

Taxon names can be derived from codes and vice-versa using R functions, discussed below. The data entry app uses the tables to translate inputted taxon names into taxoncodes, and compiles a ‘biota’ table suitable for input into the database.

The four main taxonomic table list all Australian stream macroinvertebrate taxa at:

  • Phylum, Class, Order (table taxon_rank1);

  • Family, sub-family, tribe (table taxon_fam);

  • Genus (table taxon_gen);

  • Species (table taxon_spp);

linked in hierarchical order by an increasingly long shortcode derived from the first portion of the 8-digit EPA taxoncode. Species and other taxa that have not been fullly assigned to the taxonomic hierarchy are stored in a separate table (morpho_spp). The shortcodes in these five tables (called t1code,famcode,gencode and spcode, in each table respectively) are compiled into the taxon_all table (where they are called shortcode) which lists all taxa (and the child table that they come from), and the taxon_all shortcode field is linked to the shortcode field in the biota table. The shortcode field of taxon_all is also linked to the biotic_indices table, which lists a range of biotic indices of stream health. Finally, the unlinked taxon_sym table lists synonyms for taxa that have been re-named or revised. The following gives more details on the structure of each table.

  1. Table taxon_rank1 lists all unique taxa above family level (phylum, class, and order). These are variously identified by the first 1, 2 or 3 characters in the taxoncode, abbreviated as the t1code. Table 5.1 (built using the following code) shows several examples of these variations. The full table contains 90 entries.
# assuming the database gpkg is in a subfolder of `wd`, named 'mwbugs'
# connect to the database
mwbugs_dir <- paste0(wd, "mwbugs/")
sql_drv <- RSQLite::SQLite()
mwbugs_db <- RSQLite::dbConnect(sql_drv, paste0(mwbugs_dir,"mwbugs_public.gpkg"))

#load all database tables
source(paste0("https://tools.thewerg.unimelb.edu.au/data/mwbugs/",
              "bug_database_functions.R"))
load_all_mwbugs_tables(mwbugs_db)
# disconnect from database
DBI::dbDisconnect(mwbugs_db)

flextable::regulartable(taxon_rank1[taxon_rank1$t1code %in% 
                        c("IA","IB","IF2","K","KG","O","OJ","OJ3","Q","QT"),])
Table 5.1: A selection of the 90 rows in the taxon_rank1 table to illustrate the variations of the t1 code.

t1code

phylum

class

order

lowesttaxon

IA

Porifera

Porifera

IB

Cnidaria

Hydrozoa

Hydrozoa

IF2

Playhelminthes

Turbellaria

Rhabdocoela

Rhabdocoela

K

Mollusca

Mollusca

KG

Mollusca

Gastropoda

Gastropoda

O

Arthropoda

Crustacea

Crustacea

OJ

Arthropoda

Crustacea

Copepoda

Copepoda

OJ3

Arthropoda

Crustacea

Cyclopoida

Cyclopoida

Q

Arthropoda

Insecta

Insecta

QT

Arthropoda

Insecta

Trichoptera

Trichoptera

  1. taxon_fam lists all families, subfamilies, and tribes (where relevant) specified by the first four characters of the taxoncode, abbreviated as the famcode. taxon_fam is linked to taxon_rank1 by t1code. Revisions to family names are handled by recording the old family name of renamed families in the oldfamily field, and where the revision involved more complex splitting, it is explained in the taxonomic_note column (Table 5.2). For this level and below, Rotifera (t1code = “J”) have been omitted, because the taxon codes in the source table were not as consistent as other groups, and rotifers are rarely identified in macroinvertebrate samples. The table contains 389 rows.
flextable::regulartable(taxon_fam[taxon_fam$famcode %in% 
      c("KG02","KG13","QC03","QDAF","QDAG","QO09","QO16","QT25"),])
Table 5.2: A selection of the 389 rows in the taxon_fam table to illustrate the table structure, and the nature of revision notes.

t1code

famcode

family

subfamily

tribe

oldfamily

taxonomic_note

lowesttaxon

KG

KG02

Tateidae

Hydrobiidae

Tateidae

KG

KG13

Pomatiopsidae

Hydrobiidae

Coxiella moved to this family

Pomatiopsidae

QC

QC03

Sphaeriusidae

Microsporidae

Revised to combine Microsporidae and Sphaeriidae

Sphaeriusidae

QD

QDAF

Chironomidae

Orthocladiinae

Orthocladiinae

QD

QDAG

Chironomidae

Chironominae

Pseudochironomini

Pseudochironomini

QO

QO16

Corduliidae

Most genera and species transferred to new Families

Corduliidae

QT

QT25

Leptoceridae

Leptoceridae

  1. taxon_gen lists 1,114 genera, specified by the first 6 characters of the taxoncode, abbreviated as gencode. The table is linked to taxon_fam by famcode. As for families, revisions are recorded by noting the old name used for a revised genus in the old_genus field. In its current state, the table includes superceded genus names, as noted in the taxonomic_note field (Table 5.3).
flextable::regulartable(taxon_gen[taxon_gen$gencode %in% c("IB0201","LO0506","QDAF02","QD0602","QE0201","QE0606","QE0803"),])
Table 5.3: A selection of the 1114 rows in the taxon_gen table to illustrate the table structure, and the nature of revision notes.

famcode

gencode

genus

old_genus

taxonomic_note

IB02

IB0201

Cordylophora

LO05

LO0506

Pristinella

Transferred to Pristina

QD06

QD0602

Dixella

Paradixa

QDAF

QDAF02

?Acricotopus

0

QE02

QE0201

Baetis

Genus does not occur in Australia

QE06

QE0606

Tillyardophlebia

Leptophlebiidae Genus D

QE08

QE0803

Irpacaenis

Caenidae Genus C

  1. taxon_spp lists 4,611 species, specified by full 8-character taxoncodes (called spcode in this table). Table taxon_spp is linked to taxon_gen by gencode, and to taxon_rank1 by t1code. This table only contains taxa with matches to entries in the taxon_gen and taxon_rank1 tables. It contains some undescribed morphospecies. Morphospecies without links to taxon_gen are listed in the next table (morphospp_etc). Revisions at the species level have not been fully checked at the time of writing: the field tempNote contains notes on revisions taken directly from the source table (Table 5.4).
flextable::regulartable(taxon_spp[taxon_spp$spcode %in%
      c("KP010401","LH010401","QO161808","QT2507B1","QT0605B1","QT251102"),])
Table 5.4: A selection of the 4611 rows in the taxon_spp table to illustrate the table structure, and the nature of revision notes.

gencode

spcode

species

tempNote

taxonomic_note

KP0104

KP010401

Velesunio ambiguus

LH0104

LH010401

Glossiphonia australiensis

QO1618

QO161808

Synthemis ofarrelli

[Now QO230802]

QT0605

QT0605B1

Cheumatopsyche sp.AV10

[= QT060503 Ch.kakaduensis]

QT2507

QT2507B1

Oecetis EPA sp.10

(= EPA sp.1, QT2507A1)

QT2511

QT251102

Triplectides australicus

  1. morphospp_etc lists 1008 lowesttaxon entries in the source table of three types (field type):

    1. ‘voucher’ taxa, that were identified as morphospecies by various analysts. These taxa have not been well curated, and likely contain substantial redundancy. Their primary value is for within-study use, where they are unlikely to be ambiguous identifications of other taxa. However, some better curated voucher taxa can be corrected as taxonomic knowledge advances, and this has been done for many of the Tipulidae morphospecies (see taxon_syn) below.

    2. ‘partially unidentified taxa’ such as two morphologically similar genera that are difficult to tell apart (e.g. Physa/Physastra). Such taxa can be of use within datasets that have consistently been used, but are ambiguous more broadly, and should be used with caution.

    3. ‘DNA barcoded species without confirmed identification’ are species that have been identified by DNA barcoding analysis, but not (yet) asssigned to a formal species. DNA barcode data will be a growing element of the database and are discussed further below.

  2. taxon_all is simply a compilation of the shortcode (i.e. t1code,famcode,gencode and spcode) and taxon fields from the preceding taxonomic tables. The name of the shortcode differs among tables to permit the hierarchical linking of the tables: t1code for taxon_rank1, famcode for taxon_fam, gencode for taxon_gen, spcode for taxon_spp, and taxoncode for morphospp_etc). taxon_all has a third field, table, indicating the source table for each taxon.

  3. taxon_syn lists any taxonomic changes to permit tracking of deprecated taxon names and taxoncodes. It is not intended to be a comprehensive list of taxonomic synonyms but rather ensure internal ability to track changes noted in the originalbugcode column of the biota table or inconsistencies in the taxon column in the biota table (which contains taxon names as they were in the supplied data).

While most taxonomic revisions recorded in the taxon_syn table will have no effect on data compilation and inference, two major revisions leading to version 2 of the database could lead to potential ambiguity for mites (t1code MM), and for Tipulidae and Limoniidae (famcodes QD01 and QD02). See Appendix A for more detail.

5.3 Tables and functions for taxonomic analysis

The SIGNAL scores listed in the samples and sample pairs tables were calculated using the calcSIGNAL() function in the melbstreambiota R package (C. Walsh C J 2019), using grades as listed in the biotic_indices table. That table lists all taxa (and their shortcodes) that have been used for SIGNAL and other biotic indices of stream health which assign sensitivity grades to taxa. The sensitivity grades listed in this table are:

  1. SIGNAL2 (Chessman 2003);

  2. SIGNALWoV2001 and SIGNALWoV2003, variants on the SIGNAL scores used in the State Environment Protection Authority (Waters of Victoria) (SEPP Wov, EPA Victoria 2004);

  3. SIGNALWestPt2003 SIGNAL score for streams of the Westernport catchment variation to the SEPP WoV (Government of Victoria 2001);

  4. SIGNALYarra2003 SIGNAL score for streams of the Yarra catchment variation to the SEPP WoV (Victoria 1999);

  5. Rheophily.index, an index of sensitivity to flow stress (Bond, Thomson, and Reich 2012).

Note that the biotic_indices table uses shortcode to specify families and does not include the trailing 9s or 0s that are typically used in specifying (say) families (as in the taxoncode field of the biota table. The function taxoncodeToShortcode() in “bug_database_functions.R”1 can be used to convert taxoncodes with trailing 9s and 0s to shortcodes. This function was used to populate the shortcode column in the biota table.

A large proportion, but not all, of the database follows a useful convention that is not widely followed. Taxoncodes that end in trailing 00s indicate specimens that were identified to the specified supra-specific level (genus, family, etc.) because that was the taxonomic level for that taxon used in the sample or study. Taxoncodes that end in trailing 99s indicate specimens that were damaged or immature and were identified to a level higher than was used for other specimens of that taxon in the sample or study. A convention that is retained for past data is coleopteran specimens ending in “9I” are larvae, and dipteran specimens ending in “9I” are pupae. I propose that all future data follow these conventions (with the last convention being optional).

The data entry app aids compilation of sample data by taking inputted taxon names, converting them to the correct taxoncodes and compiling them into a table consistent with the biota table.

A vector of taxoncodes can be imported into R, and converted to their taxonomic hierarchy using the function “codeTaxonomy()”. As an example, the taxoncodes of specimens collected in the 2018-2019 Melbourne Water biomonitoring program as received can be converted to taxon names, producing output as in Table 5.5. The taxonomy table in “mwbugs_taxonomy_tables.xlsx” (downloadble from the data downloads tab of the database web site) was compiled using this function.

# testBiota loaded from supplied data archive (code not shown)
# Note that fields for taxon code and taxon name need to be "taxoncode" and 
#    "lowesttaxon", consistent with database
testBiota <- unique(testBiota[c("bugcode","lowesttaxon")])
names(testBiota) <- c("taxoncode","lowesttaxon")
bugnames <- codeTaxonomy(testBiota$taxoncode)
#show first 6 taxa
flextable::regulartable(head(bugnames))
Table 5.5: Example output from the codeTaxonomy() function, and code for running it.

taxoncode

phylum

class

order

family

subfamily

tribe

species

MM999999

Arthropoda

Mites

KG069999

Mollusca

Gastropoda

Planorbidae

Ancylinae

QO999998

Arthropoda

Insecta

Odonata

QDAA9999

Arthropoda

Insecta

Diptera

Chironomidae

Aphroteniinae

QT239999

Arthropoda

Insecta

Trichoptera

Atriplectididae

OT019999

Arthropoda

Crustacea

Caridea

Atyidae

Furthermore, the naming conventions in the supplied data can be checked for consistency with the database. In the supplied 2018-2019 MW biomonitoring data there were 21 mismatches between the names with the supplied data and the database data. The example output in Table 5.6 shows that the mismatches were a combination of taxonomic revisions not being up-to-date in the supplied data (e.g. Sphaeriusidae, Platyncemididae, Cordylophoridae), of non-standard terminology (e.g. Talitridae sp.(Unident.)), and of misspellings (e.g. Neurorthidae, rather than Nevrorthidae), allowing useful checks of data correctness.

mismatches <- checkTaxonNames(testBiota)
#show last 7 taxa
flextable::regulartable(tail(mismatches,7))
Table 5.6: Example code for running the function checkTaxonNames(), and its output, showing mismatches between supplied taxon names and database names.

taxoncode

phylum

class

order

family

subfamily

tribe

species

lowesttaxon

suppliedNames

QC039999

Arthropoda

Insecta

Coleoptera

Sphaeriusidae

Sphaeriusidae

Microsporidae (Unident.)

OP019999

Arthropoda

Crustacea

Amphipoda

Talitridae

Talitridae

Talitridae sp.(Unident.)

QO049999

Arthropoda

Insecta

Odonata

Platycnemididae

Platycnemididae

Protoneuridae (Unident.)

QN049999

Arthropoda

Insecta

Neuroptera

Nevrorthidae

Nevrorthidae

Neurorthidae (Unident.)

QD0999I1

Arthropoda

Insecta

Diptera

Ceratopogonidae

Ceratopogonidae

Ceratopogoninae (Unident.)

OR259999

Arthropoda

Crustacea

Isopoda

Oniscidae

Oniscidae

Oniscidae sp.(Unident.)

IB029999

Cnidaria

Hydrozoa

Cordylophoridae

Cordylophoridae

Clavidae (Unident.)

5.4 DNA barcode data

At the time of writing no samples identified by DNA barcoding or metabarcoding have been added to the database, but such samples will become an increasingly common element of the database, and the taxonomic tables are being populated with DNA barcoding information.

Many species in the database have DNA barcodes, and the table bold_id_numbers table matches spcode (in the taxon_spp table)/taxoncode (in the taxon_all and morphospp_etc tables) to BOLD identifier codes https://boldsystems.org/.

The large variety of organisms that can be identified in metabarcoding samples requires explicit criteria for which taxa should be excluded from a stream macroinvertebrate database. At the data import stage only species from the phyla Annelida, Arthropoda, Mollusca, Cnidaria, Platyhelminthes, Nemertea, and Nematoda are included. This excludes non-animals, tardigrades, rotifera and non-invertebrates. Terrestrial or largely-terrestrial invertebrate classes Collembola and Diplopoda, and families Cicadellidae, Delphacidae and Thripidae are also excluded. Microcrustacea, which are unlikely to be reliably caught in the 250 micron nets used for macroinvertebrate sampling, are also excluded: class Ostracoda, and and families Chydoridae and Daphniidae. After applying those filters, additional species-level exclusions are recorded in (eith explanation) in the species_excluded table.


  1. see the data downloads tab of the database web site.↩︎