6  DNA-derived data tables

Christopher J Walsh and Melissa E Carew

Version 2.1 of the database and this manual have arisen from the generation of a new, large set of DNA-derived species-level data between 2018 and 2024. The collection of such data began in 2018 with a study comparing the processing of standard RBA samples from 46 sites across Melbourne, sorted in the laboratory, using standard morphological identification approaches and two DNA methods: barcoding of individual specimens and metabarcoding of all sorted specimens in a sample combined (Carew et al. 2019). This led to a larger study, which used metabarcoding analysis to identify macroinvertebrate species in pairs of combined, unsorted RBA samples collected from 225 sites over 3 years (papers in preparation). Combined, these studies identified 931 aquatic macroinvertebrate species across the region.

In this chapter, we describe the metabarcoding analysis of samples, and the incorporation of such data into the database. But we begin by describing the structure of taxonomic tables in the database that we compiled from a library linking barcodes to taxonomic identities. Such a library is a pre-requisite for metabarcoding analysis.

6.1 Table taxon_seq : a reference library of barcodes, sequences and taxonomic identities

In the context of this database, DNA barcoding refers to the DNA sequencing of a select region of the mitochondrial gene, cytochrome c oxidase 1 (COI), which permits identification of almost all animal species (Hebert et al. 2003). DNA metabarcoding refers to the DNA sequencing of bulk (or mixed DNA) samples (Yu et al. 2012). Our metabarcoding workflow uses two amplicons with sequence de-noising performed using DADA2 (Callahan et al. 2016), producing amplicon sequence variants (ASVs), each with a unique asv_code. DADA2 ensures that each ASV has a single asv_code independent of the input dataset.

We used the public reference databases GenBank (Benson et al. 2012) and BOLD Systems version 5 (Ratnasingham et al. 2024), and additional unpublished DNA barcodes for freshwater macroinvertebrates from greater Melbourne (Carew, unpublished data) to determine the taxonomic identity (mostly species) of ASVs from metabarcoding of stream macroinvertebrate samples collected across the Melbourne region.

All ASVs derived from metabarcoding 1, and their taxonomic identities, are stored in table taxon_seq2. The ASVs include the target taxa (stream macroinvertebrates), but also other taxa caught incidentally in the nets, and unassigned sequences. While the focus of the database is stream macroinvertebrates, we include the full range of ASVs in the database to permit filtering of sequences and the updating of taxonomic identities as speciments are DNA-barcoded, or taxonomy is determined or revised. Table taxon_seq has three central fields:

The table has 11 fields, the three most central being:

  • asv-code, a unique 32-character code for each ASV;

  • lowest_taxon The taxonomic identity of the sequence: 66% of sequences have been identified to species.

  • asv-seq, the DNA sequence of each ASV (a unique string of 140–430 characters, each representing a DNA nucleotide (A, C, G, T);

The remaining 8 fields (taxon_level, max_p_identity, max_p_update, amplicon, bold_match, source_no, animal, and aq-macro: all described below) provide criteria for searching and subsetting the table. But first we describe several tables added to the database to aid exploration of the taxon_seq taxonomic identities, and the methods we used to assign those taxonomic identities.

6.1.1 Nested taxonomical tables t_kingdom_phylum, t_phylum_class, t_class_order, t_order_fam, and t_fam_gen

The five nested taxonomic tables, t_kingdom_phylum, t_phylum_class, t_class_order, t_order_fam, and t_fam_gen, each with two columns as suggested by their names, provide a hierarchical set of taxonomic identities to permit building of taxonomic tables for every record in taxon_seq. A sixth, unnested table t_fam_subfam records sub-family names, used in species names in some cases. These tables differ from the nested taxonomic tables for assigning taxoncodes to aquatic macroinvertebrates (Section 5.2), as they include non-aquatic-macroinvertebrate taxa, and lack taxoncodes. However, all genera and family names in the aquatic invertebrate tables taxon_fam and taxon_gen are also in t_fam_gen, permitting links between the tables.

The function build_taxonomic_table()3 automates the process of building taxonomic tables. We demonstrate the function by selecting four example records from taxon_seq (Code and tables 6.1). In doing so, we also introduce five other fields in the taxon_seq table.

  • taxon_level, one of “species”,“genus”,“family”,“order”,“class”,“phylum”, and “kingdom”, indicates the taxonomic level to which lowest_taxon is identified. We used this field to select three example records identified to species and one identified to family.

  • aq_macro, an integer code indicating if the taxon to which the sequence belongs is a freshwater macroinvertebrate (1); 0 if not. Derived using the `only_aqu_macro_invert() function and the non_aqu_macro_invert table (see below). Taxa identified to higher taxonomic levels in groups for which some members are freshwater (e.g. Chrysomelidae) have aq_macro = 0.5. We used this to select the two aquatic macroinvertebrate examples (Gripopterygidae and Austrosimulium furiosum), and the non-aquatic animal (the terrestrial oligochaete, Aporrectodea caliginosa; aq_macro == 0 & animal = 1).

  • animal, an integer code indicates if the taxon to which the sequence belongs is an animal (1); 0 if not = 0, or -1 if identification is uncertain at phylum level or above. We used this field to select the first example—a non-animal (the red alga, Audouinella hermannii)—and the example of a non-aquatic animal (aq_macro == 0 & animal = 1).

Finally, in presenting the taxonomic table, we added two fields from taxon_seq commonly used in the metabarcoding analysis (which would also require addition of the asv_code and asv_seq fields, which we omit for this example):

  • max_p_identity, the maximum percentage similarity of each sequence to records in all of the libraries used for identification (minimum 97 for species-level identifications);

  • amplicon, one of “long”, “short”, “left” or “right”; descriptors based on the size or position of the amplicon within the DNA barcode region4.

*Code and tables 6.1. Code for selecting an example set of four taxon_seq records and for building a taxonomic table for them.

###########
# Note: the following code assumes you have created db_m, a connection to the
# mwbugs database as described in section 2.2, to permit loading the following
  taxon_all <- DBI::dbReadTable(db_m, "taxon_all")
  taxon_seq <- DBI::dbReadTable(db_m, "taxon_seq")
  asv_bin <- DBI::dbReadTable(db_m, "asv_bin")
  non_aqu_macro_invert <- DBI::dbReadTable(db_m, "non_aqu_macro_invert")
  miseq_samples <- DBI::dbReadTable(db_m, "miseq_samples")
  miseq_asv <- DBI::dbReadTable(db_m, "miseq_asv")
  biota <- DBI::dbGetQuery(db_m, "SELECT * FROM biota;")
  samples <- DBI::dbReadTable(db_m, "samples")
# and source the metabarcoding functions at
source(
  "https://tools.thewerg.unimelb.edu.au/mwbugs/data/metabarcoding_functions.R"
  )
###########

# Examples from taxon_seq
eg_taxon_seq <- rbind(
   # a non-animal, species-level record 
   #  (i.e. taxon_seq$animal == 0 & taxon_seq$taxon_level == "species")
  taxon_seq[taxon_seq$lowest_taxon == "Audouinella hermannii",][1,],
   # a non-aquatic animal species-level record (i.e. taxon_seq$aq_macro == 0 & 
   #       taxon_seq$animal == 1 & taxon_seq$taxon_level == "species")
  taxon_seq[taxon_seq$lowest_taxon == "Aporrectodea caliginosa",][1,],
   # An aquatic macroinvertebrate identified only to family
  taxon_seq[taxon_seq$aq_macro == 1 & taxon_seq$taxon_level == "family",][326,],
   # An aquatic macroinvertebrate identified to species
  taxon_seq[taxon_seq$aq_macro == 1 & taxon_seq$taxon_level == "species",][4000,]
                      )
eg_taxon_seq$asv_seq <- paste0(substr(eg_taxon_seq$asv_seq,1,6), "... ", 
                               nchar(eg_taxon_seq$asv_seq), " characters")
flextable::flextable(eg_taxon_seq[c("asv_code","lowest_taxon","asv_seq")])

asv_code

lowest_taxon

asv_seq

8ad32b8d3aa74bb3ca1dc9f2919de4ba

Audouinella hermannii

TTTAAG... 313 characters

20528b32484e8f6dbf288c84f748a328

Aporrectodea caliginosa

TCTAGC... 313 characters

036f19da8274ac7ebd219f54fa480730

Chironomidae

CCTTTC... 313 characters

30af995b0d7eaa478289cfc7f38c84b2

Dinotoperla hirsuta

TTTATC... 313 characters

# use build_taxonomic_table function (mwbugs_connection not necessary if all 
# taxonomic tables (t_fam_gen etc) and taxon_all are already in the environment)
eg_taxonomy <- build_taxonomic_table(eg_taxon_seq$lowest_taxon,
                                     mwbugs_connection = db_m,
                                     taxon_tables_loaded = FALSE)
# remove the "unmatched" column, which flags any unmatched lowest_taxon values
# (None in this case), and add max_p_identity and amplicon fields
tab <- cbind(eg_taxonomy[names(eg_taxonomy) != "unmatched"],
             eg_taxon_seq[c("max_p_identity","amplicon")])
flextable::flextable(tab)

kingdom

phylum

class

order

family

genus

species

taxoncode

max_p_identity

amplicon

Eukaryota

Rhodophyta

Florideophyceae

Acrochaetiales

Acrochaetiaceae

Audouinella

Audouinella hermannii

99.00

right

Eukaryota

Annelida

Clitellata

Crassiclitellata

Lumbricidae

Aporrectodea

Aporrectodea caliginosa

100.00

right

Eukaryota

Arthropoda

Insecta

Diptera

Chironomidae

QDAZ

94.17

right

Eukaryota

Arthropoda

Insecta

Plecoptera

Gripopterygidae

Dinotoperla

Dinotoperla hirsuta

QP030210

99.36

right

6.1.2 Assigning taxonomic identities and table asv_bin

We assigned a species name to an ASV if it had ≥97% similarity to a reference barcode, with a few minor exceptions to avoid unnecessary splitting or lumping that would be caused by a match near 97%. Matches with 95–97% similarity were identified to genus, 92–95% to family, and 85–92% to order. Matches were first assigned from GenBank and M Carew’s private library, and then cross-checked with the BOLD database.

Two additional taxon_seq fields can be used to find records with differing levels of match to the BOLD database

  • bold_match, indicates the degree of match to the BOLD database: 2 if the record has a species level match to the BOLD database (22,914 records); 1 if matched at a higher taxonomic level (5,958) 0 if not matched (8,087);

  • max_p_update, a binary variable: 1 if max_p_identity has been changed from the original vsearch value (usually using the maximum similarity with a BOLD database record), 0 if not.

The BOLD database groups barcodes into BIN (bin_uri) codes (Ratnasingham and Hebert 2013), most of which correspond to a single species. Many sequences have ≥97% similarity with more than one bin_uri. At its simplest, this means that some species in our dataset have more than one bin_uri, but in some cases bin_uris corresponding to a formal species name have ≥97% similarity with bin-uris corresponding to other names, resulting in the need to group some species. We used the following logic to naming species and species groups5.

The BOLD database groups barcodes into BIN (bin_uri) codes (Ratnasingham and Hebert 2013), most of which correspond to a single species. Many sequences have ≥97% similarity with more than one bin_uri. At its simplest, this means that some species in our dataset have more than one bin_uri, but in some cases bin_uris corresponding to a formal species name have ≥97% similarity with bin-uris corresponding to other names, resulting in the need to group some species. We used the following logic to naming species and species groups4.

  • Species with a formal name, which match one or more bin_uris (none of which are ≥97% similar to bin-uris that match other formally named species) are given the formal species name as it is. (e.g. Austroaeschna obscura, which matches bin_uri ADC44036;

  • Species that match a single bin_uri, but have not yet been linked to a formal species name, are given a name using their bin_uri (e.g. Hydrozetes sp. B-AEI5130, which matches bin_uri AEI5130);

  • Species that match two or more bin_uris, none of which have a formal identification, are given a species names using one bin_uri and the term “group” (e.g. Apsectrotanypus sp. B-AEM8177 group, which matches bin_uris AEM8177 and AEM8936);

  • Species that match more than one bin_uri that have two formal species names are given both species names separated by “/” (e.g. Hydra oligactis/robusta, which matches bin_uris AAI8740 and AAX2383);

  • Species which match more than two bin_uris that have more than two formal species names are given the most common formal species name followed by group (e.g. Austropyrgus centralia group, which matches bin-uris AAW3113, AAW3094, AAW3122, ADK5456,and AAW3117, with names A. angasi, A. turbatus, A. centralia, A. niger, A. sparsus, and A. simsonianus, and with similarity between some sequences as low as 94%).

Because DNA sequences can be matched to more than one bin_uri, the table asv_bin separately stores the bin_uris for each of the 22,264 matching sequences identified to species level (except bacteria and fungal species). The table has four fields:

  • asv_code, linking to the asv_code field in the taxon_sequence table;

  • species, Species name (equals lowest_taxon in taxon_seq table);

  • bin_uri, the BOLD codes matched to asv_codes;

  • similarity, the maximum percentage similarity of the sequence identified by asv_code to a BOLD record.

As an example, here are the first 4 records in asv_bin

flextable::flextable(head(asv_bin,4))

asv_code

species

group_no

bin_uri

similarity

772aa07cdc8d686d12455c601651f4ea

Aacanthocnema dobsoni

1

ADS1933

100.00

bb17cabc92f6f8b7798a7de1193dfc7e

Aacanthocnema dobsoni

1

ADS1933

99.36

542978486d7676cd9a4ec0b3ef1adc2c

Ablabesmyia sp. B-AAP5135 group

2

AAP5135

98.14

641827978e11fb1be0854880946edfb3

Ablabesmyia sp. B-AAP5135 group

2

AAP5135

99.51

A shorter table (3,336 records) of species names and their matching bin_uri can be compiled as follows:

species_bin <- unique(asv_bin[c("species","bin_uri")])

14,045 records in taxon_seq matched records in the BOLD database with <97% similarity (bold_match <2). 1,530 of these had >97% similarity to voucher specimens of identified species in M. Carew’s private library, and were given those species names (bold_match < 2 & taxon_level = “species”). 4,858 had sufficient similarity to a BOLD record to identify them to genus or a higher taxonomic level (bold_match = 1 & taxon_level != “species”), and 8,087 had no match to the BOLD database (bold_match = 0). In most cases this was because we had only searched for matches to these records with genus-level or species-level searches. It is likely that higher level matches (similarity <90%) would be found for these records with an exhaustive search of the BOLD database. This last group retained the higher-level taxonomic identity assigned in the initial reference database search.

The metabarcoding sample data (in the biota table) only uses records identified to species. However, we have retained the full set of detected sequences in taxon_seq, including unmatched records and those matched only to genus or above, to permit updates of taxonomic identities, to permit updates of taxonomic identities when new sequence-taxon matches are determined, or when taxonomy is revised.

6.1.3 Separating aquatic macroinvertebrate species: table non_aqu_macro_invert

The bulk-processing method for metabarcoding samples (pcode P, processing_methods table) homogenizes entire samples without thorough sorting. The homogenized samples thus typically contain terrestrial and non-invertebrate biological material. As a result, the raw metabarcoding data for import into the database includes records of many non-macroinverterbrate and non-aquatic taxa (and occasionally marine taxa).

The aq_macro field in taxon_seq distinguishes aquatic macroinvertebrates (aq_macro = 1) from other taxa. This classification was determined using the function only_aqu_inverts()7, which classifies taxonomy tables. The function reads the table non_aqu_macro_invert which lists those taxa considered non-aquatic or non-macroinvertebrate.

The function only_aqu_inverts() requires an input table with fields phylum, class, order, family, genus and species, as built by the function build_taxonomy_table() (Code and tables 6.1, above). It reads the input table hierarchically, splitting it into “include” (aquatic macroinvertebrate) and “exclude” (non-aquatic or non-macroinvertebrate) tables, by matching taxa listed in the non_aqu_macro_invert table for exclusion and inclusion.

Broadly the logic of the table and the function is as follows. Only phyla Annelida, Arthropoda, Mollusca, Cnidaria, Platyhelminthes, Nemertea, and Nematoda are included. This excludes non-animals, tardigrades, Rotifera and non-invertebrates. Terrestrial or largely-terrestrial invertebrate classes Collembola and Diplopoda, and families such as Cicadellidae, Delphacidae and Thripidae are also excluded. Macroinvertebrate families and genera with non-aqautic members in the minority are dealt with by excluding relevant genera or species. Mostly aquatic families and genera are dealt with by conditionally excluding the family/genus with relevant exceptions listed for inclusion.

Microcrustacea, which are unlikely to be reliably caught in the 250-\(\mu\)m nets used for macroinvertebrate sampling, are also excluded: class Ostracoda, and and families Chydoridae and Daphniidae.

Note that the table (see Table 6.1 for an indicative selection of its content) was not constructed systematically from a taxonomic base, but iteratively and reactively to data received for import into the database. Additions and refinements to the table are likely with further use.

Table 6.1: An indicative selection of rows from the non_aqu_macro_invert table

rank

taxon_gp

taxon

description

exclusion_reason

exclusion_exception

1

phylum

Ascomycota

fungi

non-invertebrate

0

2

order

Cyclopoida

copepods

predominantly micro

0

3

family

Thripidae

thrips

non-aquatic

0

4

genus

Paederus

Rove beetles

terrestrial

0

5

species

Rhopalosiphum padi

aphid

non-aquatic

0

5

species

Lumbricillus variegatus

aquatic Enchytraeid oligochaetes

NOT EXCLUDED: aquatic species

1

6.1.4 Tracking reference library sources: table asv_sources

The last undescribed field in table taxon_seq is:

  • source_no, an integer code indicating the source for this sequence in the database, which links to the asv_sources table.

Table asv_sources records the original data sources from which each sequence in the taxon_seq table was compiled (field source). The field script records scripts that were used for compiling each set of data, and repo records the data repository where the scripts are kept.

6.1.5 Assigning taxoncodes to new species

The process of compiling taxon_seq identified or created 501 species that were not already in the database taxonomic tables taxon_spp or morphospp_etc. We assigned a shortcode to each species using the conventions of the shortcode field of the taxon_all table, as described in (Section 5.2). This was achieved in a automated process using code in section “Revision of taxonomic tables” in 01_add_asv_data_to_mwbugs_miseq13_22_mw46.qmd8.

6.2 Source metabarcoding data: tables miseq_samples and miseq_asv

The extraction and identification of barcode sequences from the macroinvertebrate sample (smpcode) generates intermediate data that is stored separately in the database. The DNA in each sample is extracted to an extraction (given an extraction_name in the laboratory). Technical replicates (seq_code) are taken from the extraction and submitted to GenBank for analysis. For most of the samples in the database, three replicates from each sample were taken, but in earlier studies two were taken.

Each smpcode, detailed in the samples table, is therefore the product of (usually) two or three replicate seq_codes. Each seq_code has a record in the miseq_samples table, which links to the samples table by the smpcode field.

The ASVs detected in each replicate are stored in the miseq_asv table, linked to miseq_samples by the seq_code field. The taxonomic identity of each asv_code in miseq_asv is linked to taxon_seq by the asv_code field.

Table miseq_asv contains all sequences detected in each seq_code (with consensus >=0.7 and constituting >=0.0001% of the reads in a replicate). We aggregated and filtered the seq-code sequence data to produce the set of species present in each smpcode that was added to the biota table. We aggregated the asv_codes to taxonomic identity and filtered them to include only aquatic macroinvertebrate taxa identified to species level. We filtered only a small number of likely false positive records, but otherwise, the species recorded for each smpcode are not filtered based on the number of reads.

6.2.1 Table miseq_samples

Table miseq_samples details each replicate miSeq sample (seq_code) sent to GenBank. Each processed smpcode was given a unique extraction_name, and the seq_code for each replicate is the extraction_name + “rep” + replicate number (replicate).

smpcode_exp and subsamp_perc are fields added for the ten samples taken for the bulk_processing experiment [ref]. 4 subsamples of each smpcode were processed separately, and each subsample of each sample has a smpcode_exp (= smpcode followed by the percentage subsample), and a percentage size subsamp_perc.

The remaining fields record technical details about the replicate: [MEL: can you add a short description of what each of these mean, or alternatively let me know if any of them do not need to be put in the database?]

  • samp_no, the order of samples in the 96-well plate;

  • plate_position, sample position in in the plate (rows A-H, column 1-12);

  • i7_index, Illumina Nextera index (8 bp) used for sample labelling with the Illumina p7 primer;

  • i5_index, Illumina Nextera index (8 bp) used for sample labelling with the Illumina p5 primer;

  • amplicons, the amplified regiosn of the DNA used for metabarcoding (see footnote 4).

The final field, miseq_run, permits grouping of seq_codes that were part of the same process, important for checking potential contamination among samples across a run (see Section 6.2.3).

6.2.2 Table miseq_asv

Table miseq_asv records all asv_codes recorded in each miseq_code, with the number of reads. This table contains the fields:

  • miseq_code, linking to table miseq_samples;

  • asv_code, linking to tables asv_bugcode and taxon_seq;

  • reads, the number of sequences produced for each ASV;

  • likely_true_positive, a logical variable—TRUE for all but 28 records that are likely false positives (see next section).

This table is the source data for the smpcode-species data added to the biota table. The following code replicates the process used to aggregate and filter these data for appending to the biota table.

# Filter miseq_asv to include only aquatic macroinvertebrate species that are
#        likely true positives
add_asvs <- miseq_asv[miseq_asv$asv_code %in% 
                       taxon_seq$asv_code[taxon_seq$taxon_level == "species" & 
                                          taxon_seq$aq_macro == 1] &
                        miseq_asv$likely_true_positive,]
# Add species field from taxon_seq and smpcode field from miseq_samples                      
add_asvs <- dplyr::mutate(add_asvs, 
                     species = taxon_seq$lowest_taxon[match(add_asvs$asv_code,
                                                      taxon_seq$asv_code)],
                     smpcode = miseq_samples$smpcode[match(add_asvs$seq_code,
                                                      miseq_samples$seq_code)],
                     .after = asv_code)
# Add taxoncode field from taxon_all
add_asvs <- dplyr::mutate(add_asvs,
                     taxoncode = taxon_all$shortcode[match(add_asvs$species, 
                                                           taxon_all$taxon)],
                     .after = asv_code)
# Aggregate species occurrence by smpcode, recording criteria that may be used 
# for filtering (maximum reads, no. replicates detected in)
add_biota <- aggregate(add_asvs$reads, 
                       by = list(smpcode = add_asvs$smpcode,
                                 taxoncode = add_asvs$taxoncode, 
                                 taxon = add_asvs$species), FUN = max)
add_biota$notes <- paste0("metabarcoding: maximum no. reads = ", add_biota$x)
for(i in 1:nrow(add_biota)){
  miseq_asv_i <- unique(add_asvs[add_asvs$species == add_biota$taxon[i] & 
                                   add_asvs$smpcode == add_biota$smpcode[i],
                                 c("species","seq_code")])
  add_biota$notes[i] <- paste0(add_biota$notes[i],  "; detected in ", 
                                 nrow(miseq_asv_i), " out of ", 
                                 sum(miseq_samples$smpcode == 
                                       add_biota$smpcode[i]), " replicates")
}
# Order and arrange fields for appending to biota table
add_biota <- add_biota[order(add_biota$smpcode),]
add_biota$count <- 1 # record only as presence
add_biota$coarsepick <- add_biota$originalbugcode <- NA
add_biota$shortcode <- add_biota$taxoncode
add_biota <- add_biota[match(c("smpcode","taxoncode","count","taxon",
         "coarsepick","notes","originalbugcode","shortcode"),names(add_biota))]

6.2.3 Filtering miseq_asv data before aggregating to biota table.

As coded above, the biota table includes only aquatic macroinvertebrate taxa (taxon_seq field aq_macro = 1) identified to species (taxon_seq field taxon_level = “species”).

We also excluded likely false positives (miseq_asv field likely_true_positive = FALSE). 28 records in miseq_asv were identified as likely false positives resulting from cross-contamination of sequences with many (>16,000) reads in sample in the same run. The method used to identify these likely false positives is described in supplementary material A of Wilkins et al. (in preparation).

The frequency of false positives arising from cross-contamination from sequences with few reads was very small (~0.01% of all records). Exclusion of records with small numbers of reads to remove such false positives removes many more true positives. We thus judged the small number potential remaining false positives to be an acceptably small source of error in the dataset.

Every metabarcoding record added to the biota table has a comment in the notes field that allows parsing of the maximum number of reads for each record and how many replicates it was detected in. All such notes are in this form:

metabarcoding: maximum no. reads = 16; detected in 2 out of 3 replicates

107 records from ten smpcodes have an additional phrase

; EXCLUDE this record from 3-rep 30% sample

The purpose and use of this phrase is illustrated in the example code for extracting project 3 data below (Section 6.3.3).

6.3 Metabarcoding addenda to other database tables

6.3.1 Tables samples and biota

The spring 2018 project added 92 DNA-based samples to the samples table (smpcodes ending in L—individually barcoded specimens—, or M—picked samples homogenized). These samples added 3,421 species-level records to the biota table.

The combined pairs of RBA samples collected for bulk-processed metabarcoding added 275 samples to the samples table (smpcodes ending in P). These samples added 20,691 species-level records to the biota table.

6.3.2 Tables collection_methods and processing_methods

The standard sampling approach for the metabarcoding survey was to collect two separate RBA samples and combine them. As a result, we have added two methods to the collection_methods table (Table 4.1):

  • ccode D: RBA samples combined: edge (sweep) + riffle (kick);

  • ccode E: 2 RBA edge (sweep) samples combined.

And we added three new processing methods to the processing_methods table (Table 4.2)

  • pcode L: picked samples combined and homegenized and DNA-metabarcoded (used only for the 2018 ‘46-site’ study);

  • pcode M: specimens from picked samples non-destructively DNA-metabarcoded allowing subsequent morphological identification (used only for the 2018 ‘46-site’ study);

  • pcode P: two RBA samples combined and homogenized and DNA-metabarcoded (the standard method for subsequent metabarcoding samples in the database).

6.3.3 Tables projects and sample_project_groups

With the addition of DNA-based samples, we initiated these two tables to allow easier extraction of data from the database by project. Table projects describes each project (projects 1,2,3 and 5 encompass the metabarcoding data, as described below), and table sample_project_groups contains all smpcodes that form part of the project.

The following code demonstrates the use of these tables to extract data for each of the projects:

####  
# Project 1: Trial comparison of morphological family-level ID and DNA barcoding 
# and metabarcoding on spring 2018 RBA samples (nick-name, the 46-site study)

samples_1 <- DBI::dbGetQuery(db_m, 
                          "SELECT * FROM samples 
                           JOIN sample_project_groups
                           ON samples.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 1;")
# returns 184 samples from 46 sites 
biota_1 <- DBI::dbGetQuery(db_m, 
                           "SELECT * FROM biota 
                           JOIN sample_project_groups
                           ON biota.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 1;")
# returns 5466 records
sites_1 <- sf::st_read(db_m, query = 
                      paste0("SELECT * FROM sites WHERE sitecode IN ('",
                             paste(samples_1$sitecode, collapse = "','"),"');"))
# returns 46 sites (a spatial table, so sf:st_read() was used)

####  
# Project 2: Experiment to optimise the processing strategy for combined pairs 
#  of RBA samples (nick-name, the bulk-processing experiment)

samples_2 <- DBI::dbGetQuery(db_m, 
                          "SELECT * FROM samples 
                           JOIN sample_project_groups
                           ON samples.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 2;")
# returns 10 samples from 10 sites, which were each split into 4 sub-samples
# So, need to use the smpcode_exp and the miseq-replicate-level data
miseq_samples_2 <- DBI::dbGetQuery(db_m, 
                       "SELECT * FROM miseq_samples 
                        JOIN sample_project_groups
                        ON miseq_samples.smpcode = sample_project_groups.smpcode
                        WHERE sample_project_groups.project_code = 2;")
# returns 86 replicates (seq_codes) of 40 subsamples (smpcode_exp):
#  2-5 replicates per subsample
miseq_asv_2 <- DBI::dbGetQuery(db_m,
                        "SELECT miseq_asv.*, taxon_seq.lowest_taxon 
                         FROM miseq_asv JOIN taxon_seq
                         ON miseq_asv.asv_code = taxon_seq.asv_code 
                         WHERE taxon_seq.aq_macro = 1 
                         AND taxon_seq.taxon_level = 'species'
                         AND likely_true_positive;")
# returns 128,961 records (excludes 107 false positives). 
sites_2 <- sf::st_read(db_m, query = 
                      paste0("SELECT * FROM sites WHERE sitecode IN ('",
                             paste(samples_2$sitecode, collapse = "','"),"');"))
# returns 10 sites

####  
# Project 3:  Metabarcoded combined-RBA-sample-pair survey of 225 sites across
#   Melbourne 2021-2024 (Nickname MW metabarcoding survey)

samples_3 <- DBI::dbGetQuery(db_m, 
                          "SELECT * FROM samples 
                           JOIN sample_project_groups
                           ON samples.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 3;")
# returns 184 samples from 46 sites 
biota_3 <- DBI::dbGetQuery(db_m, 
                           "SELECT * FROM biota 
                           JOIN sample_project_groups
                           ON biota.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 3;")
# returns 20691 records 
# NOTE project 3 includes the samples collected for project 2, which were fully
# processed, with 8-11 replicate MiSeq samples. All species detected in those
# samples are included in the biota table.  The samples can be adjusted to 
# approximate the standard 30% subsampled, 3-replicate bulk-processing method
# by excluding records by excluding records with the phrase: 
# 'EXCLUDE this record from 3-rep 30% sample' in the notes column 
biota_3 <- biota_3[!grepl("EXCLUDE this record from 3-rep", 
                          biota_3$notes),]
# removes 107 records that were found in other replicates/subsamples
sites_3 <- sf::st_read(db_m, query = 
                      paste0("SELECT * FROM sites WHERE sitecode IN ('",
                             paste(samples_3$sitecode, collapse = "','"),"');"))
# returns 225 sites (a spatial table, so sf:st_read() was used)

####  
# Project 5: Comparison of morphological family-level ID samples and 
# bulk-processed metabarcoding from 130 sites sampled in Spring 2021
# (Nickname 'Paired morphological/metabarcoding sample set)

samples_5 <- DBI::dbGetQuery(db_m, 
                          "SELECT * FROM samples 
                           JOIN sample_project_groups
                           ON samples.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 5;")
# returns 390 samples from 130 sites (2 pairs of morphologically ID-ed samples
# and a combined pair of samples bulk-processed as a single metabarcoding sample)
biota_5 <- DBI::dbGetQuery(db_m, 
                           "SELECT * FROM biota 
                           JOIN sample_project_groups
                           ON biota.smpcode = sample_project_groups.smpcode
                           WHERE sample_project_groups.project_code = 5;")
# returns 17235 records
sites_5 <- sf::st_read(db_m, query = 
                      paste0("SELECT * FROM sites WHERE sitecode IN ('",
                             paste(samples_5$sitecode, collapse = "','"),"');"))
# returns 132 sites (a spatial table, so sf:st_read() was used)
# The additional two samples are because of morphological and metabarcoding 
# samples having been collected in slightly different locations. These reaches
# can be treated as the same for comparison of sample types (group by site_no)
sites_5$site_no[!sites_5$sitecode %in% c("STV-3149-0","PYR-6295-0")] <- 
   1:sum(!sites_5$sitecode %in% c("STV-3149-0","PYR-6295-0"))
sites_5$site_no[sites_5$sitecode == "STV-3149-0"] <- 
                sites_5$site_no[sites_5$sitecode == "STV-2279-5"]
sites_5$site_no[sites_5$sitecode == "PYR-6295-0"] <- 
                sites_5$site_no[sites_5$sitecode == "PYR-6295-2"]
# 130 site_nos
# Add site_no to samples_5 for grouping
samples_5$site_no <- sites_5$site_no[match(samples_5$sitecode,sites_5$sitecode)]
# Note there are no sample_pairs statistics for the metabarcoding samples.

  1. sequences with <0.7 consensus with a sequence in a reference database were not included.↩︎

  2. Code used to compile the library is stored in Appendix 2 of the metabarcoding workflow repository (https://github.com/mecarew/metabarcoding_workflow). Code used to add barcodes from all samples is recorded in “01_add_asv_data_to_mwbugs_miseq13_22_mw46.qmd”, which can serve as a template for adding future barcode data to the database. Those two documents are stored in https://github.com/cjbwalsh/bulk_processing_experiment_metabarcoding.↩︎

  3. in metabarcoding_functions.R: : see the data downloads page of the database website.↩︎

  4. “short” amplicons used miCOIintF/BF1 with BR1/R5 primers (Hajibabaei et al. 2012; Leray et al. 2013; Elbrecht and Leese 2017) amplifying ~217 bp. “right” amplicons used miCOIintF/BF1 with dgHCO2198/ LepR1 primers (Hebert et al. 2004; Leray et al. 2013; Elbrecht and Leese 2017) amplifying ~313 bp).↩︎

  5. We use the term ‘group’ to indicate a group of species that cannot be confidently separated by variation in the COI gene, without any implication of a species complex.↩︎

  6. Note these hyperlinks point to the BOLD API. To find the BOLD page for any bin_uri, simply replace ‘XXXXXX’ in the following URL with the 7-character bin_uri: https://portal.boldsystems.org/bin/BOLD:XXXXXX. Some bin_uris have DOIs and can similarly be linked as https://doi.org/10.5883/BOLD:XXXXXXX.↩︎

  7. In bug_database_functions.R: see the data downloads page of the database website.↩︎

  8. Stored in the repository github.com/cjbwalsh/bulk_processing_experiment_metabarcoding↩︎