Version 2.1 of the database and this manual have arisen from the generation of a new, large set of DNA-derived species-level data between 2018 and 2024. The collection of such data began in 2018 with a study comparing the processing of standard RBA samples from 46 sites across Melbourne, sorted in the laboratory, using standard morphological identification approaches and two DNA methods: barcoding of individual specimens and metabarcoding of all sorted specimens in a sample combined (Carew et al. 2019). This led to a larger study, which used metabarcoding analysis to identify macroinvertebrate species in pairs of combined, unsorted RBA samples collected from 225 sites over 3 years (papers in preparation). Combined, these studies identified 931 aquatic macroinvertebrate species across the region.
In this chapter, we describe the metabarcoding analysis of samples, and the incorporation of such data into the database. But we begin by describing the structure of taxonomic tables in the database that we compiled from a library linking barcodes to taxonomic identities. Such a library is a pre-requisite for metabarcoding analysis.
6.1 Table taxon_seq : a reference library of barcodes, sequences and taxonomic identities
In the context of this database, DNA barcoding refers to the DNA sequencing of a select region of the mitochondrial gene, cytochrome c oxidase 1 (COI), which permits identification of almost all animal species (Hebert et al. 2003). DNA metabarcoding refers to the DNA sequencing of bulk (or mixed DNA) samples (Yu et al. 2012). Our metabarcoding workflow uses two amplicons with sequence de-noising performed using DADA2 (Callahan et al. 2016), producing amplicon sequence variants (ASVs), each with a unique asv_code. DADA2 ensures that each ASV has a single asv_code independent of the input dataset.
We used the public reference databases GenBank (Benson et al. 2012) and BOLD Systems version 5 (Ratnasingham et al. 2024), and additional unpublished DNA barcodes for freshwater macroinvertebrates from greater Melbourne (Carew, unpublished data) to determine the taxonomic identity (mostly species) of ASVs from metabarcoding of stream macroinvertebrate samples collected across the Melbourne region.
All ASVs derived from metabarcoding 1, and their taxonomic identities, are stored in table taxon_seq2. The ASVs include the target taxa (stream macroinvertebrates), but also other taxa caught incidentally in the nets, and unassigned sequences. While the focus of the database is stream macroinvertebrates, we include the full range of ASVs in the database to permit filtering of sequences and the updating of taxonomic identities as speciments are DNA-barcoded, or taxonomy is determined or revised. Table taxon_seq has three central fields:
The table has 11 fields, the three most central being:
asv-code, a unique 32-character code for each ASV;
lowest_taxon The taxonomic identity of the sequence: 66% of sequences have been identified to species.
asv-seq, the DNA sequence of each ASV (a unique string of 140–430 characters, each representing a DNA nucleotide (A, C, G, T);
The remaining 8 fields (taxon_level, max_p_identity, max_p_update, amplicon, bold_match, source_no, animal, and aq-macro: all described below) provide criteria for searching and subsetting the table. But first we describe several tables added to the database to aid exploration of the taxon_seq taxonomic identities, and the methods we used to assign those taxonomic identities.
6.1.1 Nested taxonomical tables t_kingdom_phylum, t_phylum_class, t_class_order, t_order_fam, and t_fam_gen
The five nested taxonomic tables, t_kingdom_phylum, t_phylum_class, t_class_order, t_order_fam, and t_fam_gen, each with two columns as suggested by their names, provide a hierarchical set of taxonomic identities to permit building of taxonomic tables for every record in taxon_seq. A sixth, unnested table t_fam_subfam records sub-family names, used in species names in some cases. These tables differ from the nested taxonomic tables for assigning taxoncodes to aquatic macroinvertebrates (Section 5.2), as they include non-aquatic-macroinvertebrate taxa, and lack taxoncodes. However, all genera and family names in the aquatic invertebrate tables taxon_fam and taxon_gen are also in t_fam_gen, permitting links between the tables.
The function build_taxonomic_table()3 automates the process of building taxonomic tables. We demonstrate the function by selecting four example records from taxon_seq (Code and tables 6.1). In doing so, we also introduce five other fields in the taxon_seq table.
taxon_level, one of “species”,“genus”,“family”,“order”,“class”,“phylum”, and “kingdom”, indicates the taxonomic level to which lowest_taxon is identified. We used this field to select three example records identified to species and one identified to family.
aq_macro, an integer code indicating if the taxon to which the sequence belongs is a freshwater macroinvertebrate (1); 0 if not. Derived using the `only_aqu_macro_invert() function and the non_aqu_macro_invert table (see below). Taxa identified to higher taxonomic levels in groups for which some members are freshwater (e.g. Chrysomelidae) have aq_macro = 0.5. We used this to select the two aquatic macroinvertebrate examples (Gripopterygidae and Austrosimulium furiosum), and the non-aquatic animal (the terrestrial oligochaete, Aporrectodea caliginosa; aq_macro == 0 & animal = 1).
animal, an integer code indicates if the taxon to which the sequence belongs is an animal (1); 0 if not = 0, or -1 if identification is uncertain at phylum level or above. We used this field to select the first example—a non-animal (the red alga, Audouinella hermannii)—and the example of a non-aquatic animal (aq_macro == 0 & animal = 1).
Finally, in presenting the taxonomic table, we added two fields from taxon_seq commonly used in the metabarcoding analysis (which would also require addition of the asv_code and asv_seq fields, which we omit for this example):
max_p_identity, the maximum percentage similarity of each sequence to records in all of the libraries used for identification (minimum 97 for species-level identifications);
amplicon, one of “long”, “short”, “left” or “right”; descriptors based on the size or position of the amplicon within the DNA barcode region4.
*Code and tables 6.1. Code for selecting an example set of four taxon_seq records and for building a taxonomic table for them.
############ Note: the following code assumes you have created db_m, a connection to the# mwbugs database as described in section 2.2, to permit loading the following taxon_all <- DBI::dbReadTable(db_m, "taxon_all") taxon_seq <- DBI::dbReadTable(db_m, "taxon_seq") asv_bin <- DBI::dbReadTable(db_m, "asv_bin") non_aqu_macro_invert <- DBI::dbReadTable(db_m, "non_aqu_macro_invert") miseq_samples <- DBI::dbReadTable(db_m, "miseq_samples") miseq_asv <- DBI::dbReadTable(db_m, "miseq_asv") biota <- DBI::dbGetQuery(db_m, "SELECT * FROM biota;") samples <- DBI::dbReadTable(db_m, "samples")# and source the metabarcoding functions atsource("https://tools.thewerg.unimelb.edu.au/mwbugs/data/metabarcoding_functions.R" )############ Examples from taxon_seqeg_taxon_seq <-rbind(# a non-animal, species-level record # (i.e. taxon_seq$animal == 0 & taxon_seq$taxon_level == "species") taxon_seq[taxon_seq$lowest_taxon =="Audouinella hermannii",][1,],# a non-aquatic animal species-level record (i.e. taxon_seq$aq_macro == 0 & # taxon_seq$animal == 1 & taxon_seq$taxon_level == "species") taxon_seq[taxon_seq$lowest_taxon =="Aporrectodea caliginosa",][1,],# An aquatic macroinvertebrate identified only to family taxon_seq[taxon_seq$aq_macro ==1& taxon_seq$taxon_level =="family",][326,],# An aquatic macroinvertebrate identified to species taxon_seq[taxon_seq$aq_macro ==1& taxon_seq$taxon_level =="species",][4000,] )eg_taxon_seq$asv_seq <-paste0(substr(eg_taxon_seq$asv_seq,1,6), "... ", nchar(eg_taxon_seq$asv_seq), " characters")flextable::flextable(eg_taxon_seq[c("asv_code","lowest_taxon","asv_seq")])
asv_code
lowest_taxon
asv_seq
8ad32b8d3aa74bb3ca1dc9f2919de4ba
Audouinella hermannii
TTTAAG... 313 characters
20528b32484e8f6dbf288c84f748a328
Aporrectodea caliginosa
TCTAGC... 313 characters
036f19da8274ac7ebd219f54fa480730
Chironomidae
CCTTTC... 313 characters
30af995b0d7eaa478289cfc7f38c84b2
Dinotoperla hirsuta
TTTATC... 313 characters
# use build_taxonomic_table function (mwbugs_connection not necessary if all # taxonomic tables (t_fam_gen etc) and taxon_all are already in the environment)eg_taxonomy <-build_taxonomic_table(eg_taxon_seq$lowest_taxon,mwbugs_connection = db_m,taxon_tables_loaded =FALSE)# remove the "unmatched" column, which flags any unmatched lowest_taxon values# (None in this case), and add max_p_identity and amplicon fieldstab <-cbind(eg_taxonomy[names(eg_taxonomy) !="unmatched"], eg_taxon_seq[c("max_p_identity","amplicon")])flextable::flextable(tab)
kingdom
phylum
class
order
family
genus
species
taxoncode
max_p_identity
amplicon
Eukaryota
Rhodophyta
Florideophyceae
Acrochaetiales
Acrochaetiaceae
Audouinella
Audouinella hermannii
99.00
right
Eukaryota
Annelida
Clitellata
Crassiclitellata
Lumbricidae
Aporrectodea
Aporrectodea caliginosa
100.00
right
Eukaryota
Arthropoda
Insecta
Diptera
Chironomidae
QDAZ
94.17
right
Eukaryota
Arthropoda
Insecta
Plecoptera
Gripopterygidae
Dinotoperla
Dinotoperla hirsuta
QP030210
99.36
right
6.1.2 Assigning taxonomic identities and table asv_bin
We assigned a species name to an ASV if it had ≥97% similarity to a reference barcode, with a few minor exceptions to avoid unnecessary splitting or lumping that would be caused by a match near 97%. Matches with 95–97% similarity were identified to genus, 92–95% to family, and 85–92% to order. Matches were first assigned from GenBank and M Carew’s private library, and then cross-checked with the BOLD database.
Two additional taxon_seq fields can be used to find records with differing levels of match to the BOLD database
bold_match, indicates the degree of match to the BOLD database: 2 if the record has a species level match to the BOLD database (22,914 records); 1 if matched at a higher taxonomic level (5,958) 0 if not matched (8,087);
max_p_update, a binary variable: 1 if max_p_identity has been changed from the original vsearch value (usually using the maximum similarity with a BOLD database record), 0 if not.
The BOLD database groups barcodes into BIN (bin_uri) codes (Ratnasingham and Hebert 2013), most of which correspond to a single species. Many sequences have ≥97% similarity with more than one bin_uri. At its simplest, this means that some species in our dataset have more than one bin_uri, but in some cases bin_uris corresponding to a formal species name have ≥97% similarity with bin-uris corresponding to other names, resulting in the need to group some species. We used the following logic to naming species and species groups5.
The BOLD database groups barcodes into BIN (bin_uri) codes (Ratnasingham and Hebert 2013), most of which correspond to a single species. Many sequences have ≥97% similarity with more than one bin_uri. At its simplest, this means that some species in our dataset have more than one bin_uri, but in some cases bin_uris corresponding to a formal species name have ≥97% similarity with bin-uris corresponding to other names, resulting in the need to group some species. We used the following logic to naming species and species groups4.
Species with a formal name, which match one or more bin_uris (none of which are ≥97% similar to bin-uris that match other formally named species) are given the formal species name as it is. (e.g. Austroaeschna obscura, which matches bin_uri ADC44036;
Species that match a single bin_uri, but have not yet been linked to a formal species name, are given a name using their bin_uri (e.g. Hydrozetes sp. B-AEI5130, which matches bin_uri AEI5130);
Species that match two or more bin_uris, none of which have a formal identification, are given a species names using one bin_uri and the term “group” (e.g. Apsectrotanypus sp. B-AEM8177 group, which matches bin_uris AEM8177 and AEM8936);
Species that match more than one bin_uri that have two formal species names are given both species names separated by “/” (e.g. Hydra oligactis/robusta, which matches bin_uris AAI8740 and AAX2383);
Species which match more than two bin_uris that have more than two formal species names are given the most common formal species name followed by group (e.g. Austropyrgus centralia group, which matches bin-uris AAW3113, AAW3094, AAW3122, ADK5456,and AAW3117, with names A. angasi, A. turbatus, A. centralia, A. niger, A. sparsus, and A. simsonianus, and with similarity between some sequences as low as 94%).
Because DNA sequences can be matched to more than one bin_uri, the table asv_bin separately stores the bin_uris for each of the 22,264 matching sequences identified to species level (except bacteria and fungal species). The table has four fields:
asv_code, linking to the asv_code field in the taxon_sequence table;
species, Species name (equals lowest_taxon in taxon_seq table);
bin_uri, the BOLD codes matched to asv_codes;
similarity, the maximum percentage similarity of the sequence identified by asv_code to a BOLD record.
As an example, here are the first 4 records in asv_bin
flextable::flextable(head(asv_bin,4))
asv_code
species
group_no
bin_uri
similarity
772aa07cdc8d686d12455c601651f4ea
Aacanthocnema dobsoni
1
ADS1933
100.00
bb17cabc92f6f8b7798a7de1193dfc7e
Aacanthocnema dobsoni
1
ADS1933
99.36
542978486d7676cd9a4ec0b3ef1adc2c
Ablabesmyia sp. B-AAP5135 group
2
AAP5135
98.14
641827978e11fb1be0854880946edfb3
Ablabesmyia sp. B-AAP5135 group
2
AAP5135
99.51
A shorter table (3,336 records) of species names and their matching bin_uri can be compiled as follows:
14,045 records in taxon_seq matched records in the BOLD database with <97% similarity (bold_match <2). 1,530 of these had >97% similarity to voucher specimens of identified species in M. Carew’s private library, and were given those species names (bold_match < 2 & taxon_level = “species”). 4,858 had sufficient similarity to a BOLD record to identify them to genus or a higher taxonomic level (bold_match = 1 & taxon_level != “species”), and 8,087 had no match to the BOLD database (bold_match = 0). In most cases this was because we had only searched for matches to these records with genus-level or species-level searches. It is likely that higher level matches (similarity <90%) would be found for these records with an exhaustive search of the BOLD database. This last group retained the higher-level taxonomic identity assigned in the initial reference database search.
The metabarcoding sample data (in the biota table) only uses records identified to species. However, we have retained the full set of detected sequences in taxon_seq, including unmatched records and those matched only to genus or above, to permit updates of taxonomic identities, to permit updates of taxonomic identities when new sequence-taxon matches are determined, or when taxonomy is revised.
The bulk-processing method for metabarcoding samples (pcode P, processing_methods table) homogenizes entire samples without thorough sorting. The homogenized samples thus typically contain terrestrial and non-invertebrate biological material. As a result, the raw metabarcoding data for import into the database includes records of many non-macroinverterbrate and non-aquatic taxa (and occasionally marine taxa).
The aq_macro field in taxon_seq distinguishes aquatic macroinvertebrates (aq_macro = 1) from other taxa. This classification was determined using the function only_aqu_inverts()7, which classifies taxonomy tables. The function reads the table non_aqu_macro_invert which lists those taxa considered non-aquatic or non-macroinvertebrate.
The function only_aqu_inverts() requires an input table with fields phylum, class, order, family, genus and species, as built by the function build_taxonomy_table() (Code and tables 6.1, above). It reads the input table hierarchically, splitting it into “include” (aquatic macroinvertebrate) and “exclude” (non-aquatic or non-macroinvertebrate) tables, by matching taxa listed in the non_aqu_macro_invert table for exclusion and inclusion.
Broadly the logic of the table and the function is as follows. Only phyla Annelida, Arthropoda, Mollusca, Cnidaria, Platyhelminthes, Nemertea, and Nematoda are included. This excludes non-animals, tardigrades, Rotifera and non-invertebrates. Terrestrial or largely-terrestrial invertebrate classes Collembola and Diplopoda, and families such as Cicadellidae, Delphacidae and Thripidae are also excluded. Macroinvertebrate families and genera with non-aqautic members in the minority are dealt with by excluding relevant genera or species. Mostly aquatic families and genera are dealt with by conditionally excluding the family/genus with relevant exceptions listed for inclusion.
Microcrustacea, which are unlikely to be reliably caught in the 250-\(\mu\)m nets used for macroinvertebrate sampling, are also excluded: class Ostracoda, and and families Chydoridae and Daphniidae.
Note that the table (see Table 6.1 for an indicative selection of its content) was not constructed systematically from a taxonomic base, but iteratively and reactively to data received for import into the database. Additions and refinements to the table are likely with further use.
Table 6.1: An indicative selection of rows from the non_aqu_macro_invert table
source_no, an integer code indicating the source for this sequence in the database, which links to the asv_sources table.
Table asv_sources records the original data sources from which each sequence in the taxon_seq table was compiled (field source). The field script records scripts that were used for compiling each set of data, and repo records the data repository where the scripts are kept.
6.1.5 Assigning taxoncodes to new species
The process of compiling taxon_seq identified or created 501 species that were not already in the database taxonomic tables taxon_spp or morphospp_etc. We assigned a shortcode to each species using the conventions of the shortcode field of the taxon_all table, as described in (Section 5.2). This was achieved in a automated process using code in section “Revision of taxonomic tables” in 01_add_asv_data_to_mwbugs_miseq13_22_mw46.qmd8.
6.2 Source metabarcoding data: tables miseq_samples and miseq_asv
The extraction and identification of barcode sequences from the macroinvertebrate sample (smpcode) generates intermediate data that is stored separately in the database. The DNA in each sample is extracted to an extraction (given an extraction_name in the laboratory). Technical replicates (seq_code) are taken from the extraction and submitted to GenBank for analysis. For most of the samples in the database, three replicates from each sample were taken, but in earlier studies two were taken.
Each smpcode, detailed in the samples table, is therefore the product of (usually) two or three replicate seq_codes. Each seq_code has a record in the miseq_samples table, which links to the samples table by the smpcode field.
The ASVs detected in each replicate are stored in the miseq_asv table, linked to miseq_samples by the seq_code field. The taxonomic identity of each asv_code in miseq_asv is linked to taxon_seq by the asv_code field.
Table miseq_asv contains all sequences detected in each seq_code (with consensus >=0.7 and constituting >=0.0001% of the reads in a replicate). We aggregated and filtered the seq-code sequence data to produce the set of species present in each smpcode that was added to the biota table. We aggregated the asv_codes to taxonomic identity and filtered them to include only aquatic macroinvertebrate taxa identified to species level. We filtered only a small number of likely false positive records, but otherwise, the species recorded for each smpcode are not filtered based on the number of reads.
6.2.1 Table miseq_samples
Table miseq_samples details each replicate miSeq sample (seq_code) sent to GenBank. Each processed smpcode was given a unique extraction_name, and the seq_code for each replicate is the extraction_name + “rep” + replicate number (replicate).
smpcode_exp and subsamp_perc are fields added for the ten samples taken for the bulk_processing experiment [ref]. 4 subsamples of each smpcode were processed separately, and each subsample of each sample has a smpcode_exp (= smpcode followed by the percentage subsample), and a percentage size subsamp_perc.
The remaining fields record technical details about the replicate: [MEL: can you add a short description of what each of these mean, or alternatively let me know if any of them do not need to be put in the database?]
samp_no, the order of samples in the 96-well plate;
plate_position, sample position in in the plate (rows A-H, column 1-12);
i7_index, Illumina Nextera index (8 bp) used for sample labelling with the Illumina p7 primer;
i5_index, Illumina Nextera index (8 bp) used for sample labelling with the Illumina p5 primer;
amplicons, the amplified regiosn of the DNA used for metabarcoding (see footnote 4).
The final field, miseq_run, permits grouping of seq_codes that were part of the same process, important for checking potential contamination among samples across a run (see Section 6.2.3).
6.2.2 Table miseq_asv
Table miseq_asv records all asv_codes recorded in each miseq_code, with the number of reads. This table contains the fields:
miseq_code, linking to table miseq_samples;
asv_code, linking to tables asv_bugcode and taxon_seq;
reads, the number of sequences produced for each ASV;
likely_true_positive, a logical variable—TRUE for all but 28 records that are likely false positives (see next section).
This table is the source data for the smpcode-species data added to the biota table. The following code replicates the process used to aggregate and filter these data for appending to the biota table.
# Filter miseq_asv to include only aquatic macroinvertebrate species that are# likely true positivesadd_asvs <- miseq_asv[miseq_asv$asv_code %in% taxon_seq$asv_code[taxon_seq$taxon_level =="species"& taxon_seq$aq_macro ==1] & miseq_asv$likely_true_positive,]# Add species field from taxon_seq and smpcode field from miseq_samples add_asvs <- dplyr::mutate(add_asvs, species = taxon_seq$lowest_taxon[match(add_asvs$asv_code, taxon_seq$asv_code)],smpcode = miseq_samples$smpcode[match(add_asvs$seq_code, miseq_samples$seq_code)],.after = asv_code)# Add taxoncode field from taxon_alladd_asvs <- dplyr::mutate(add_asvs,taxoncode = taxon_all$shortcode[match(add_asvs$species, taxon_all$taxon)],.after = asv_code)# Aggregate species occurrence by smpcode, recording criteria that may be used # for filtering (maximum reads, no. replicates detected in)add_biota <-aggregate(add_asvs$reads, by =list(smpcode = add_asvs$smpcode,taxoncode = add_asvs$taxoncode, taxon = add_asvs$species), FUN = max)add_biota$notes <-paste0("metabarcoding: maximum no. reads = ", add_biota$x)for(i in1:nrow(add_biota)){ miseq_asv_i <-unique(add_asvs[add_asvs$species == add_biota$taxon[i] & add_asvs$smpcode == add_biota$smpcode[i],c("species","seq_code")]) add_biota$notes[i] <-paste0(add_biota$notes[i], "; detected in ", nrow(miseq_asv_i), " out of ", sum(miseq_samples$smpcode == add_biota$smpcode[i]), " replicates")}# Order and arrange fields for appending to biota tableadd_biota <- add_biota[order(add_biota$smpcode),]add_biota$count <-1# record only as presenceadd_biota$coarsepick <- add_biota$originalbugcode <-NAadd_biota$shortcode <- add_biota$taxoncodeadd_biota <- add_biota[match(c("smpcode","taxoncode","count","taxon","coarsepick","notes","originalbugcode","shortcode"),names(add_biota))]
6.2.3 Filtering miseq_asv data before aggregating to biota table.
As coded above, the biota table includes only aquatic macroinvertebrate taxa (taxon_seq field aq_macro = 1) identified to species (taxon_seq field taxon_level = “species”).
We also excluded likely false positives (miseq_asv field likely_true_positive = FALSE). 28 records in miseq_asv were identified as likely false positives resulting from cross-contamination of sequences with many (>16,000) reads in sample in the same run. The method used to identify these likely false positives is described in supplementary material A of Wilkins et al. (in preparation).
The frequency of false positives arising from cross-contamination from sequences with few reads was very small (~0.01% of all records). Exclusion of records with small numbers of reads to remove such false positives removes many more true positives. We thus judged the small number potential remaining false positives to be an acceptably small source of error in the dataset.
Every metabarcoding record added to the biota table has a comment in the notes field that allows parsing of the maximum number of reads for each record and how many replicates it was detected in. All such notes are in this form:
metabarcoding: maximum no. reads = 16; detected in 2 out of 3 replicates
107 records from ten smpcodes have an additional phrase
; EXCLUDE this record from 3-rep 30% sample
The purpose and use of this phrase is illustrated in the example code for extracting project 3 data below (Section 6.3.3).
6.3 Metabarcoding addenda to other database tables
6.3.1 Tables samples and biota
The spring 2018 project added 92 DNA-based samples to the samples table (smpcodes ending in L—individually barcoded specimens—, or M—picked samples homogenized). These samples added 3,421 species-level records to the biota table.
The combined pairs of RBA samples collected for bulk-processed metabarcoding added 275 samples to the samples table (smpcodes ending in P). These samples added 20,691 species-level records to the biota table.
6.3.2 Tables collection_methods and processing_methods
The standard sampling approach for the metabarcoding survey was to collect two separate RBA samples and combine them. As a result, we have added two methods to the collection_methods table (Table 4.1):
And we added three new processing methods to the processing_methods table (Table 4.2)
pcode L: picked samples combined and homegenized and DNA-metabarcoded (used only for the 2018 ‘46-site’ study);
pcode M: specimens from picked samples non-destructively DNA-metabarcoded allowing subsequent morphological identification (used only for the 2018 ‘46-site’ study);
pcode P: two RBA samples combined and homogenized and DNA-metabarcoded (the standard method for subsequent metabarcoding samples in the database).
6.3.3 Tables projects and sample_project_groups
With the addition of DNA-based samples, we initiated these two tables to allow easier extraction of data from the database by project. Table projects describes each project (projects 1,2,3 and 5 encompass the metabarcoding data, as described below), and table sample_project_groups contains all smpcodes that form part of the project.
The following code demonstrates the use of these tables to extract data for each of the projects:
#### # Project 1: Trial comparison of morphological family-level ID and DNA barcoding # and metabarcoding on spring 2018 RBA samples (nick-name, the 46-site study)samples_1 <- DBI::dbGetQuery(db_m, "SELECT * FROM samples JOIN sample_project_groups ON samples.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 1;")# returns 184 samples from 46 sites biota_1 <- DBI::dbGetQuery(db_m, "SELECT * FROM biota JOIN sample_project_groups ON biota.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 1;")# returns 5466 recordssites_1 <- sf::st_read(db_m, query =paste0("SELECT * FROM sites WHERE sitecode IN ('",paste(samples_1$sitecode, collapse ="','"),"');"))# returns 46 sites (a spatial table, so sf:st_read() was used)#### # Project 2: Experiment to optimise the processing strategy for combined pairs # of RBA samples (nick-name, the bulk-processing experiment)samples_2 <- DBI::dbGetQuery(db_m, "SELECT * FROM samples JOIN sample_project_groups ON samples.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 2;")# returns 10 samples from 10 sites, which were each split into 4 sub-samples# So, need to use the smpcode_exp and the miseq-replicate-level datamiseq_samples_2 <- DBI::dbGetQuery(db_m, "SELECT * FROM miseq_samples JOIN sample_project_groups ON miseq_samples.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 2;")# returns 86 replicates (seq_codes) of 40 subsamples (smpcode_exp):# 2-5 replicates per subsamplemiseq_asv_2 <- DBI::dbGetQuery(db_m,"SELECT miseq_asv.*, taxon_seq.lowest_taxon FROM miseq_asv JOIN taxon_seq ON miseq_asv.asv_code = taxon_seq.asv_code WHERE taxon_seq.aq_macro = 1 AND taxon_seq.taxon_level = 'species' AND likely_true_positive;")# returns 128,961 records (excludes 107 false positives). sites_2 <- sf::st_read(db_m, query =paste0("SELECT * FROM sites WHERE sitecode IN ('",paste(samples_2$sitecode, collapse ="','"),"');"))# returns 10 sites#### # Project 3: Metabarcoded combined-RBA-sample-pair survey of 225 sites across# Melbourne 2021-2024 (Nickname MW metabarcoding survey)samples_3 <- DBI::dbGetQuery(db_m, "SELECT * FROM samples JOIN sample_project_groups ON samples.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 3;")# returns 184 samples from 46 sites biota_3 <- DBI::dbGetQuery(db_m, "SELECT * FROM biota JOIN sample_project_groups ON biota.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 3;")# returns 20691 records # NOTE project 3 includes the samples collected for project 2, which were fully# processed, with 8-11 replicate MiSeq samples. All species detected in those# samples are included in the biota table. The samples can be adjusted to # approximate the standard 30% subsampled, 3-replicate bulk-processing method# by excluding records by excluding records with the phrase: # 'EXCLUDE this record from 3-rep 30% sample' in the notes column biota_3 <- biota_3[!grepl("EXCLUDE this record from 3-rep", biota_3$notes),]# removes 107 records that were found in other replicates/subsamplessites_3 <- sf::st_read(db_m, query =paste0("SELECT * FROM sites WHERE sitecode IN ('",paste(samples_3$sitecode, collapse ="','"),"');"))# returns 225 sites (a spatial table, so sf:st_read() was used)#### # Project 5: Comparison of morphological family-level ID samples and # bulk-processed metabarcoding from 130 sites sampled in Spring 2021# (Nickname 'Paired morphological/metabarcoding sample set)samples_5 <- DBI::dbGetQuery(db_m, "SELECT * FROM samples JOIN sample_project_groups ON samples.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 5;")# returns 390 samples from 130 sites (2 pairs of morphologically ID-ed samples# and a combined pair of samples bulk-processed as a single metabarcoding sample)biota_5 <- DBI::dbGetQuery(db_m, "SELECT * FROM biota JOIN sample_project_groups ON biota.smpcode = sample_project_groups.smpcode WHERE sample_project_groups.project_code = 5;")# returns 17235 recordssites_5 <- sf::st_read(db_m, query =paste0("SELECT * FROM sites WHERE sitecode IN ('",paste(samples_5$sitecode, collapse ="','"),"');"))# returns 132 sites (a spatial table, so sf:st_read() was used)# The additional two samples are because of morphological and metabarcoding # samples having been collected in slightly different locations. These reaches# can be treated as the same for comparison of sample types (group by site_no)sites_5$site_no[!sites_5$sitecode %in%c("STV-3149-0","PYR-6295-0")] <-1:sum(!sites_5$sitecode %in%c("STV-3149-0","PYR-6295-0"))sites_5$site_no[sites_5$sitecode =="STV-3149-0"] <- sites_5$site_no[sites_5$sitecode =="STV-2279-5"]sites_5$site_no[sites_5$sitecode =="PYR-6295-0"] <- sites_5$site_no[sites_5$sitecode =="PYR-6295-2"]# 130 site_nos# Add site_no to samples_5 for groupingsamples_5$site_no <- sites_5$site_no[match(samples_5$sitecode,sites_5$sitecode)]# Note there are no sample_pairs statistics for the metabarcoding samples.
Benson, D. A., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. 2012. “GenBank.”Nucleic Acids Research 41 (D1): D36–42. https://doi.org/10.1093/nar/gks1195.
Callahan, B.J., P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.”Nature Methods 13 (7): 581–83. https://doi.org/10.1038/nmeth.3869.
Carew, M, A Hoffman, C Kellar, C J Walsh, K Stevenson, and R Coleman. 2019. “Integrating DNA Metabarcoding into Melbourne Water’s Macroinvertebrate Monitoring Program.” Report to {{Melbourne Water}}. School of Biosciences, The University of Melbourne.
Elbrecht, V., and F. Leese. 2017. “Validation and Development of COI Metabarcoding Primers for Freshwater Macroinvertebrate Bioassessment.”Frontiers in Environmental Science 5. https://doi.org/10.3389/fenvs.2017.00011.
Hajibabaei, M., J. L. Spall, S. Shokralla, and S. Van Konynenburg. 2012. “Assessing Biodiversity of a Freshwater Benthic Macroinvertebrate Community Through Non-Destructive Environmental Barcoding of DNA from Preservative Ethanol.”BMC Ecology 12 (1): 28. https://doi.org/10.1186/1472-6785-12-28.
Hebert, P. D. N., A. Cywinska, S. L. Ball, and J. R. DeWaard. 2003. “Biological Identifications Through DNA Barcodes.”Proceedings of the Royal Society of London. Series B: Biological Sciences 270 (1512): 313–21.
Hebert, P. D. N., E. H. Penton, J. M. Burns, D. H. Janzen, and W. Hallwachs. 2004. “Ten Species in One: DNA Barcoding Reveals Cryptic Species in the Neotropical Skipper Butterfly Astraptes Fulgerator.”Proceedings of the National Academy of Sciences 101 (41): 14812–17. https://doi.org/10.1073/pnas.0406166101.
Leray, M., J. Y. Yang, C. P. Meyer, S. C. Mills, N. Agudelo, V. Ranwez, J. T. Boehm, and R. J. Machida. 2013. “A New Versatile Primer Set Targeting a Short Fragment of the Mitochondrial COI Region for Metabarcoding Metazoan Diversity: Application for Characterizing Coral Reef Fish Gut Contents.”Frontiers in Zoology 10 (1): 34. https://doi.org/10.1186/1742-9994-10-34.
Ratnasingham, S., and P. D. N. Hebert. 2013. “A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System.” Edited by Diego Fontaneto. PLoS ONE 8 (7): e66213. https://doi.org/10.1371/journal.pone.0066213.
Ratnasingham, S., C. Wei, D. Chan, J. Agda, J. Agda, L. Ballesteros-Mejia, H. A. Boutou, et al. 2024. “BOLD V4: A Centralized Bioinformatics Platform for DNA-Based Biodiversity Data.” In DNA Barcoding: Methods and Protocols, 403–41. Springer.
Yu, D. W., Y. Ji, B. C. Emerson, X. Wang, C. Ye, C. Yang, and Z. Ding. 2012. “Biodiversity Soup: Metabarcoding of Arthropods for Rapid Biodiversity Assessment and Biomonitoring.”Methods in Ecology and Evolution 3 (4): 613–23. https://doi.org/10.1111/j.2041-210X.2012.00198.x.
sequences with <0.7 consensus with a sequence in a reference database were not included.↩︎
We use the term ‘group’ to indicate a group of species that cannot be confidently separated by variation in the COI gene, without any implication of a species complex.↩︎
Note these hyperlinks point to the BOLD API. To find the BOLD page for any bin_uri, simply replace ‘XXXXXX’ in the following URL with the 7-character bin_uri: https://portal.boldsystems.org/bin/BOLD:XXXXXX. Some bin_uris have DOIs and can similarly be linked as https://doi.org/10.5883/BOLD:XXXXXXX.↩︎
In bug_database_functions.R: see the data downloads page of the database website.↩︎