Continued from the KBase narrative "JGI QC impact on assembly, binning, phylogenomics, and functional analysis" [34]:¶

Please also reference the journal article:¶

Trimming and decontamination of metagenomic data can significantly impact assembly and binning metrics, phylogenomic and functional analysis¶

Jason M. Whitham and Amy M. Grunden, 2021¶

[email protected]✉ and [email protected]¶

North Carolina State University, 4550A Thomas Hall, Box 7615, Raleigh NC, 27695, United States of America¶

Assembly and binning of select readsets trimmed and decontaminated with recommended parameters¶

Modules used in 10158.6*fastq processing¶

Modules used in 9117.8*fastq processing¶

Modules used in 9108.2*fastq processing¶

Modules used in 9117.7*fastq processing¶

Modules used in 11306.3*fastq processing¶

Modules used in 9117.4*fastq processing¶

Trimming and decontamination removed as much as tens of millions of reads and tens of billions of bases from read files¶

Eighty-four trimmed and/or decontaminated fastq were generated from raw fastq files to evaluate the effects of these methods on assembly and binning metrics. The creation of these eighty-four read files involved sequentially force trimming, kmer trimming, quality trimming, decontaminating, or some combination of these steps based on recommendations posted in online bioinformatics forums. The number and percentage of reads and bases removed are provided in the Supplemental Files. When used, force trimming did not impact reads but removed about five percent of bases. Kmer trimming removed about four percent or fewer reads and as much as five percent of bases. Decontamination of raw fastq files removed between zero and seven percent of reads and bases; less than three percent for files that were force, quality, and/or kmer trimmed prior to decontamination. Quality trimming to Q10 removed about two to four percent of reads and about two to four percent of bases, while quality trimming to Q20 removed about eight to 15% of reads and about nine to 16% of bases. Generalizations could not be made about which steps had the greatest impact on reads or bases with the exception that quality trimming to Q20 consistently removed the most of each. Total reads removed by all combinations of steps tested ranged from about zero to 16%. Similarly, total bases removed were one to 22%. The greatest change in reads and bases was from about 399M to 334M and from about 60.3B to 46.7B, respectively.¶

In addition to the read files generated, the raw and JGI processed reads were included in the subsequent analyses, making a total of 96 read files. These ranged from 245M to 399M reads, a span of 154M reads, and from 34.4 to 60.3B bases, a span of 25.9B.¶

Total MAG counts correlated with bases and reads¶

MAGs are binned contigs assembled from metagenomic reads. Therefore, it is no surprise that we found MAG counts were correlated with input reads (p_raw = 0.040, 1.664 MAGs/tMreads, 95% CI [0.080, 3.249], adjusted Pearson's r = 0.382), and their base counts (p_raw = 0.041, 1.099 MAGs/Bbases, 95% CI [0.338, 1.392], adjusted Pearson's r = 0.382). Read and base counts were also correlated with medium (0.648 medium MAGs/tMreads, 95% CI [0.129, 1.166], adjusted Pearson's r = 0.455; 0.428 medium MAGs/Bbases, 95% CI [0.085, 0.772], adjusted Pearson's r = 0.455) and good MAGs from raw reads (p_raw = 0.004, 0.529 good MAGs/tMreads, 95% CI [0.187, 0.872], adjusted Pearson's r = 0.545; p_raw = 0.004, 0.350 good MAGs/Bbases, 95% CI [0.123, 0.577], adjusted Pearson's r = 0.544). We wanted to know though if reduction of reads and bases due to trimming and decontamination also reduced MAG counts.¶

Since trimmed and decontaminated reads were observations dependent upon the original raw files, we used mixed linear effects models to avoid violating the ordinary least squares model assumption that observations are independent [36]. We found that MAG counts were correlated with read and base counts of trimmed and decontaminated reads (p_trim_decon = 0.000, 2.095 MAGs/tMreads, 95% CI [1.435, 2.755]; p_trim_decon = 0.000, 1.320 MAGs/Bbases 95% CI [1.018, 1.622]). Read and base counts of trimmed and decontaminated reads were also correlated with medium MAGs (p_trim_decon = 0.000, 0.883 medium MAGs/tMreads, 95% CI [0.492, 1.275]; p_trim_decon = 0.000, 0.610 medium MAGs/Bbases, 95% CI [0.423, 0.797]) and good MAGs (p_trim_decon = 0.003, 0.399 good MAGs/tMreads, 95% CI [0.160, 0.638]; p_trim_decon = 0.000, 0.309 good MAGs/Bbases, 95% CI [0.196, 0.421]). No significant correlations were found with average MAG completeness or contamination and read or base counts for raw or trimmed and decontaminated reads.¶

#MAG counts were correlated with read counts of trimmed and decontaminated reads
#MAG counts were correlated with read counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_bins ~ tMreads", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_bins ~ tMreads', data=df).fit()
model2 = ols('r_bins ~ tMreads', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.146) 
#adjusted Pearson's r = 0.382

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'tMreads', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'tMreads', fig=fig)

#post estimation
model3 = smf.mixedlm("td_bins ~ tMreads", data=df, groups=df["Mix_Group"])
modelf3 = model.fit(reml=False)
print(modelf3.summary())

         Mixed Linear Model Regression Results
=======================================================
Model:            MixedLM Dependent Variable: td_bins  
No. Observations: 96      Method:             REML     
No. Groups:       6       Scale:              27.4833  
Min. group size:  16      Likelihood:         -306.4298
Max. group size:  16      Converged:          Yes      
Mean group size:  16.0                                 
-------------------------------------------------------
             Coef.  Std.Err.   z   P>|z|  [0.025 0.975]
-------------------------------------------------------
Intercept     9.712   12.537 0.775 0.439 -14.860 34.284
tMreads       2.095    0.337 6.222 0.000   1.435  2.755
Group Var   231.903   29.013                           
=======================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                td_bins   R-squared:                       0.077
Model:                            OLS   Adj. R-squared:                  0.068
Method:                 Least Squares   F-statistic:                     7.888
Date:                Sun, 28 Mar 2021   Prob (F-statistic):            0.00605
Time:                        22:25:31   Log-Likelihood:                -392.51
No. Observations:                  96   AIC:                             789.0
Df Residuals:                      94   BIC:                             794.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.3912     13.602      2.896      0.005      12.384      66.399
tMreads        1.1758      0.419      2.809      0.006       0.345       2.007
==============================================================================
Omnibus:                       35.897   Durbin-Watson:                   0.385
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                9.857
Skew:                           0.506   Prob(JB):                      0.00724
Kurtosis:                       1.800   Cond. No.                         297.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 r_bins   R-squared:                       0.185
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     4.771
Date:                Sun, 28 Mar 2021   Prob (F-statistic):             0.0404
Time:                        22:25:31   Log-Likelihood:                -103.89
No. Observations:                  23   AIC:                             211.8
Df Residuals:                      21   BIC:                             214.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     21.9212     25.577      0.857      0.401     -31.269      75.111
tMreads        1.6644      0.762      2.184      0.040       0.080       3.249
==============================================================================
Omnibus:                        1.524   Durbin-Watson:                   2.039
Prob(Omnibus):                  0.467   Jarque-Bera (JB):                1.220
Skew:                           0.536   Prob(JB):                        0.543
Kurtosis:                       2.652   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
         Mixed Linear Model Regression Results
=======================================================
Model:            MixedLM Dependent Variable: td_bins  
No. Observations: 96      Method:             ML       
No. Groups:       6       Scale:              27.1879  
Min. group size:  16      Likelihood:         -308.9553
Max. group size:  16      Converged:          Yes      
Mean group size:  16.0                                 
-------------------------------------------------------
             Coef.  Std.Err.   z   P>|z|  [0.025 0.975]
-------------------------------------------------------
Intercept     9.925   12.192 0.814 0.416 -13.970 33.820
tMreads       2.088    0.334 6.251 0.000   1.433  2.743
Group Var   191.725   22.176                           
=======================================================

<Figure size 864x576 with 0 Axes>

#MAG counts were correlated with base counts of trimmed and decontaminated reads
#MAG counts were correlated with base counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_bins ~ Bbases", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_bins ~ Bbases', data=df).fit()
model2 = ols('r_bins ~ Bbases', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.146) 
#adjusted Pearson's r = 0.382

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'Bbases', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'Bbases', fig=fig)

         Mixed Linear Model Regression Results
=======================================================
Model:            MixedLM Dependent Variable: td_bins  
No. Observations: 96      Method:             REML     
No. Groups:       6       Scale:              21.6539  
Min. group size:  16      Likelihood:         -296.4177
Max. group size:  16      Converged:          Yes      
Mean group size:  16.0                                 
-------------------------------------------------------
              Coef.  Std.Err.   z   P>|z| [0.025 0.975]
-------------------------------------------------------
Intercept     14.715    9.563 1.539 0.124 -4.028 33.458
Bbases         1.320    0.154 8.565 0.000  1.018  1.622
Group Var    226.343   31.822                          
=======================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                td_bins   R-squared:                       0.101
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     10.61
Date:                Sun, 28 Mar 2021   Prob (F-statistic):            0.00157
Time:                        22:25:41   Log-Likelihood:                -391.24
No. Observations:                  96   AIC:                             786.5
Df Residuals:                      94   BIC:                             791.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     36.3272     12.685      2.864      0.005      11.142      61.513
Bbases         0.8649      0.266      3.257      0.002       0.338       1.392
==============================================================================
Omnibus:                       46.989   Durbin-Watson:                   0.336
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               10.364
Skew:                           0.498   Prob(JB):                      0.00562
Kurtosis:                       1.736   Cond. No.                         413.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 r_bins   R-squared:                       0.184
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     4.749
Date:                Sun, 28 Mar 2021   Prob (F-statistic):             0.0409
Time:                        22:25:41   Log-Likelihood:                -103.90
No. Observations:                  23   AIC:                             211.8
Df Residuals:                      21   BIC:                             214.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     22.0775     25.566      0.864      0.398     -31.090      75.245
Bbases         1.0994      0.505      2.179      0.041       0.050       2.149
==============================================================================
Omnibus:                        1.517   Durbin-Watson:                   2.039
Prob(Omnibus):                  0.468   Jarque-Bera (JB):                1.218
Skew:                           0.535   Prob(JB):                        0.544
Kurtosis:                       2.647   Cond. No.                         268.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Average MAG completeness was not correlated with read counts of trimmed and decontaminated reads
#Average MAG completeness was not correlated with read counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_Mean_Completeness ~ tMreads", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_Mean_Completeness ~ tMreads', data=df).fit()
model2 = ols('r_Mean_Completeness ~ tMreads', data=df2).fit()

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'tMreads', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'tMreads', fig=fig)

              Mixed Linear Model Regression Results
==================================================================
Model:            MixedLM Dependent Variable: td_Mean_Completeness
No. Observations: 96      Method:             REML                
No. Groups:       6       Scale:              5.9304              
Min. group size:  16      Likelihood:         -226.7723           
Max. group size:  16      Converged:          Yes                 
Mean group size:  16.0                                            
--------------------------------------------------------------------
                Coef.    Std.Err.     z      P>|z|   [0.025   0.975]
--------------------------------------------------------------------
Intercept       57.240      4.077   14.039   0.000   49.249   65.232
tMreads         -0.049      0.125   -0.393   0.694   -0.294    0.196
Group Var        1.826      0.610                                   
==================================================================

                             OLS Regression Results                             
================================================================================
Dep. Variable:     td_Mean_Completeness   R-squared:                       0.001
Model:                              OLS   Adj. R-squared:                 -0.010
Method:                   Least Squares   F-statistic:                   0.06334
Date:                  Sun, 28 Mar 2021   Prob (F-statistic):              0.802
Time:                          22:28:24   Log-Likelihood:                -230.60
No. Observations:                    96   AIC:                             465.2
Df Residuals:                        94   BIC:                             470.3
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     55.0264      2.518     21.850      0.000      50.026      60.027
tMreads        0.0195      0.078      0.252      0.802      -0.134       0.173
==============================================================================
Omnibus:                        0.800   Durbin-Watson:                   1.696
Prob(Omnibus):                  0.670   Jarque-Bera (JB):                0.887
Skew:                          -0.119   Prob(JB):                        0.642
Kurtosis:                       2.593   Cond. No.                         297.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     r_Mean_Completeness   R-squared:                       0.010
Model:                             OLS   Adj. R-squared:                 -0.037
Method:                  Least Squares   F-statistic:                    0.2155
Date:                 Sun, 28 Mar 2021   Prob (F-statistic):              0.647
Time:                         22:28:24   Log-Likelihood:                -68.304
No. Observations:                   23   AIC:                             140.6
Df Residuals:                       21   BIC:                             142.9
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     53.8117      5.443      9.887      0.000      42.493      65.131
tMreads        0.0753      0.162      0.464      0.647      -0.262       0.412
==============================================================================
Omnibus:                        0.192   Durbin-Watson:                   1.901
Prob(Omnibus):                  0.908   Jarque-Bera (JB):                0.159
Skew:                           0.159   Prob(JB):                        0.923
Kurtosis:                       2.745   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Average MAG completeness was not correlated with base counts of trimmed and decontaminated reads
#Average MAG completeness was not correlated with base counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_Mean_Completeness ~ Bbases", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_Mean_Completeness ~ Bbases', data=df).fit()
model2 = ols('r_Mean_Completeness ~ Bbases', data=df2).fit()

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'Bbases', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'Bbases', fig=fig)

              Mixed Linear Model Regression Results
==================================================================
Model:            MixedLM Dependent Variable: td_Mean_Completeness
No. Observations: 96      Method:             REML                
No. Groups:       6       Scale:              5.9308              
Min. group size:  16      Likelihood:         -227.2845           
Max. group size:  16      Converged:          Yes                 
Mean group size:  16.0                                            
--------------------------------------------------------------------
                Coef.    Std.Err.     z      P>|z|   [0.025   0.975]
--------------------------------------------------------------------
Intercept       57.331      3.383   16.945   0.000   50.700   63.962
Bbases          -0.035      0.070   -0.503   0.615   -0.173    0.102
Group Var        1.774      0.587                                   
==================================================================

                             OLS Regression Results                             
================================================================================
Dep. Variable:     td_Mean_Completeness   R-squared:                       0.000
Model:                              OLS   Adj. R-squared:                 -0.010
Method:                   Least Squares   F-statistic:                   0.02058
Date:                  Sun, 28 Mar 2021   Prob (F-statistic):              0.886
Time:                          22:28:41   Log-Likelihood:                -230.62
No. Observations:                    96   AIC:                             465.2
Df Residuals:                        94   BIC:                             470.4
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     55.3173      2.380     23.240      0.000      50.591      60.043
Bbases         0.0071      0.050      0.143      0.886      -0.092       0.106
==============================================================================
Omnibus:                        0.847   Durbin-Watson:                   1.695
Prob(Omnibus):                  0.655   Jarque-Bera (JB):                0.917
Skew:                          -0.117   Prob(JB):                        0.632
Kurtosis:                       2.583   Cond. No.                         413.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     r_Mean_Completeness   R-squared:                       0.010
Model:                             OLS   Adj. R-squared:                 -0.037
Method:                  Least Squares   F-statistic:                    0.2185
Date:                 Sun, 28 Mar 2021   Prob (F-statistic):              0.645
Time:                         22:28:41   Log-Likelihood:                -68.302
No. Observations:                   23   AIC:                             140.6
Df Residuals:                       21   BIC:                             142.9
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     53.7971      5.438      9.893      0.000      42.489      65.105
Bbases         0.0502      0.107      0.467      0.645      -0.173       0.273
==============================================================================
Omnibus:                        0.191   Durbin-Watson:                   1.902
Prob(Omnibus):                  0.909   Jarque-Bera (JB):                0.158
Skew:                           0.158   Prob(JB):                        0.924
Kurtosis:                       2.745   Cond. No.                         268.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Average MAG contamination was not correlated with read counts of trimmed and decontaminated reads
#Average MAG contamination was not correlated with read counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_Mean_Contamination ~ tMreads", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_Mean_Contamination ~ tMreads', data=df).fit()
model2 = ols('r_Mean_Contamination ~ tMreads', data=df2).fit()

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'tMreads', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'tMreads', fig=fig)

               Mixed Linear Model Regression Results
===================================================================
Model:            MixedLM Dependent Variable: td_Mean_Contamination
No. Observations: 96      Method:             REML                 
No. Groups:       6       Scale:              48.3367              
Min. group size:  16      Likelihood:         -331.4206            
Max. group size:  16      Converged:          Yes                  
Mean group size:  16.0                                             
----------------------------------------------------------------------
              Coef.     Std.Err.      z      P>|z|    [0.025    0.975]
----------------------------------------------------------------------
Intercept     47.905      15.522    3.086    0.002    17.482    78.329
tMreads        0.660       0.443    1.492    0.136    -0.207     1.528
Group Var    217.245      20.703                                      
===================================================================

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     td_Mean_Contamination   R-squared:                       0.152
Model:                               OLS   Adj. R-squared:                  0.143
Method:                    Least Squares   F-statistic:                     16.82
Date:                   Sun, 28 Mar 2021   Prob (F-statistic):           8.72e-05
Time:                           22:28:56   Log-Likelihood:                -393.32
No. Observations:                     96   AIC:                             790.6
Df Residuals:                         94   BIC:                             795.8
Df Model:                              1                                         
Covariance Type:               nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.3043     13.717      0.970      0.335     -13.931      40.540
tMreads        1.7316      0.422      4.102      0.000       0.893       2.570
==============================================================================
Omnibus:                       13.222   Durbin-Watson:                   0.517
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.931
Skew:                           0.963   Prob(JB):                     0.000573
Kurtosis:                       3.151   Cond. No.                         297.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                             OLS Regression Results                             
================================================================================
Dep. Variable:     r_Mean_Contamination   R-squared:                       0.115
Model:                              OLS   Adj. R-squared:                  0.073
Method:                   Least Squares   F-statistic:                     2.733
Date:                  Sun, 28 Mar 2021   Prob (F-statistic):              0.113
Time:                          22:28:56   Log-Likelihood:                -101.59
No. Observations:                    23   AIC:                             207.2
Df Residuals:                        21   BIC:                             209.4
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.1563     23.136      1.347      0.192     -16.957      79.270
tMreads        1.1395      0.689      1.653      0.113      -0.294       2.573
==============================================================================
Omnibus:                        1.877   Durbin-Watson:                   1.151
Prob(Omnibus):                  0.391   Jarque-Bera (JB):                0.610
Skew:                           0.223   Prob(JB):                        0.737
Kurtosis:                       3.662   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Average MAG contamination was not correlated with base counts of trimmed and decontaminated reads
#Average MAG contamination was not correlated with base counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_Mean_Contamination ~ Bbases", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_Mean_Contamination ~ Bbases', data=df).fit()
model2 = ols('r_Mean_Contamination ~ Bbases', data=df2).fit()

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'Bbases', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'Bbases', fig=fig)

               Mixed Linear Model Regression Results
===================================================================
Model:            MixedLM Dependent Variable: td_Mean_Contamination
No. Observations: 96      Method:             REML                 
No. Groups:       6       Scale:              48.5993              
Min. group size:  16      Likelihood:         -332.4019            
Max. group size:  16      Converged:          Yes                  
Mean group size:  16.0                                             
----------------------------------------------------------------------
              Coef.     Std.Err.      z      P>|z|    [0.025    0.975]
----------------------------------------------------------------------
Intercept     55.584      12.528    4.437    0.000    31.029    80.139
Bbases         0.288       0.230    1.251    0.211    -0.163     0.738
Group Var    224.759      21.324                                      
===================================================================

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     td_Mean_Contamination   R-squared:                       0.136
Model:                               OLS   Adj. R-squared:                  0.127
Method:                    Least Squares   F-statistic:                     14.83
Date:                   Sun, 28 Mar 2021   Prob (F-statistic):           0.000215
Time:                           22:29:16   Log-Likelihood:                -394.19
No. Observations:                     96   AIC:                             792.4
Df Residuals:                         94   BIC:                             797.5
Df Model:                              1                                         
Covariance Type:               nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     19.1996     13.079      1.468      0.145      -6.770      45.169
Bbases         1.0544      0.274      3.851      0.000       0.511       1.598
==============================================================================
Omnibus:                       12.843   Durbin-Watson:                   0.551
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               14.479
Skew:                           0.950   Prob(JB):                     0.000718
Kurtosis:                       3.108   Cond. No.                         413.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                             OLS Regression Results                             
================================================================================
Dep. Variable:     r_Mean_Contamination   R-squared:                       0.115
Model:                              OLS   Adj. R-squared:                  0.073
Method:                   Least Squares   F-statistic:                     2.722
Date:                  Sun, 28 Mar 2021   Prob (F-statistic):              0.114
Time:                          22:29:16   Log-Likelihood:                -101.59
No. Observations:                    23   AIC:                             207.2
Df Residuals:                        21   BIC:                             209.5
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.2555     23.121      1.352      0.191     -16.827      79.338
Bbases         0.7528      0.456      1.650      0.114      -0.196       1.702
==============================================================================
Omnibus:                        1.868   Durbin-Watson:                   1.152
Prob(Omnibus):                  0.393   Jarque-Bera (JB):                0.604
Skew:                           0.222   Prob(JB):                        0.739
Kurtosis:                       3.658   Cond. No.                         268.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Good MAG counts were correlated with read counts of trimmed and decontaminated reads
#Good MAG counts were correlated with read counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_good_bins ~ tMreads", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_good_bins ~ tMreads', data=df).fit()
model2 = ols('r_good_bins ~ tMreads', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.297) 
#adjusted Pearson's r = 0.545

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'tMreads', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'tMreads', fig=fig)

          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: td_good_bins
No. Observations: 96      Method:             REML        
No. Groups:       6       Scale:              3.4594      
Min. group size:  16      Likelihood:         -204.1157   
Max. group size:  16      Converged:          Yes         
Mean group size:  16.0                                    
-----------------------------------------------------------
              Coef.  Std.Err.    z    P>|z|  [0.025  0.975]
-----------------------------------------------------------
Intercept     4.835     4.017  1.204  0.229  -3.037  12.708
tMreads       0.399     0.122  3.276  0.001   0.160   0.638
Group Var     3.774     1.555                              
==========================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           td_good_bins   R-squared:                       0.020
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     1.902
Date:                Sun, 28 Mar 2021   Prob (F-statistic):              0.171
Time:                        22:29:30   Log-Likelihood:                -215.55
No. Observations:                  96   AIC:                             435.1
Df Residuals:                      94   BIC:                             440.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.7670      2.153      6.858      0.000      10.492      19.042
tMreads        0.0914      0.066      1.379      0.171      -0.040       0.223
==============================================================================
Omnibus:                        0.619   Durbin-Watson:                   1.508
Prob(Omnibus):                  0.734   Jarque-Bera (JB):                0.550
Skew:                           0.182   Prob(JB):                        0.760
Kurtosis:                       2.926   Cond. No.                         297.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            r_good_bins   R-squared:                       0.329
Model:                            OLS   Adj. R-squared:                  0.297
Method:                 Least Squares   F-statistic:                     10.31
Date:                Sun, 28 Mar 2021   Prob (F-statistic):            0.00419
Time:                        22:29:30   Log-Likelihood:                -68.680
No. Observations:                  23   AIC:                             141.4
Df Residuals:                      21   BIC:                             143.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3598      5.532     -0.065      0.949     -11.865      11.146
tMreads        0.5293      0.165      3.211      0.004       0.187       0.872
==============================================================================
Omnibus:                        0.823   Durbin-Watson:                   2.044
Prob(Omnibus):                  0.663   Jarque-Bera (JB):                0.827
Skew:                           0.372   Prob(JB):                        0.661
Kurtosis:                       2.444   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Good MAG counts were correlated with base counts of trimmed and decontaminated reads
#Good MAG counts were correlated with base counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_good_bins ~ Bbases", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_good_bins ~ Bbases', data=df).fit()
model2 = ols('r_good_bins ~ Bbases', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.296) 
#adjusted Pearson's r = 0.544

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'Bbases', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'Bbases', fig=fig)

          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: td_good_bins
No. Observations: 96      Method:             REML        
No. Groups:       6       Scale:              2.9316      
Min. group size:  16      Likelihood:         -197.6067   
Max. group size:  16      Converged:          Yes         
Mean group size:  16.0                                    
-----------------------------------------------------------
              Coef.  Std.Err.    z    P>|z|  [0.025  0.975]
-----------------------------------------------------------
Intercept     3.079     2.863  1.076  0.282  -2.531   8.690
Bbases        0.309     0.058  5.359  0.000   0.196   0.421
Group Var     4.218     1.763                              
==========================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           td_good_bins   R-squared:                       0.059
Model:                            OLS   Adj. R-squared:                  0.049
Method:                 Least Squares   F-statistic:                     5.911
Date:                Sun, 28 Mar 2021   Prob (F-statistic):             0.0169
Time:                        22:29:45   Log-Likelihood:                -213.59
No. Observations:                  96   AIC:                             431.2
Df Residuals:                      94   BIC:                             436.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     12.9052      1.993      6.474      0.000       8.948      16.863
Bbases         0.1014      0.042      2.431      0.017       0.019       0.184
==============================================================================
Omnibus:                        0.552   Durbin-Watson:                   1.429
Prob(Omnibus):                  0.759   Jarque-Bera (JB):                0.576
Skew:                           0.175   Prob(JB):                        0.750
Kurtosis:                       2.854   Cond. No.                         413.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            r_good_bins   R-squared:                       0.328
Model:                            OLS   Adj. R-squared:                  0.296
Method:                 Least Squares   F-statistic:                     10.26
Date:                Sun, 28 Mar 2021   Prob (F-statistic):            0.00428
Time:                        22:29:45   Log-Likelihood:                -68.700
No. Observations:                  23   AIC:                             141.4
Df Residuals:                      21   BIC:                             143.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3125      5.533     -0.056      0.955     -11.818      11.193
Bbases         0.3497      0.109      3.203      0.004       0.123       0.577
==============================================================================
Omnibus:                        0.824   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.662   Jarque-Bera (JB):                0.831
Skew:                           0.369   Prob(JB):                        0.660
Kurtosis:                       2.432   Cond. No.                         268.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

#Medium MAG counts were correlated with read counts of trimmed and decontaminated reads
#Medium MAG counts were correlated with read counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
td_medium_bins = [32, 33, 32, 24, 26, 35, 33, 26, 29, 18, 35, 27, 18, 33, 32, 28, 22, 20, 20, 23, 23, 21, 25, 22, 21, 22, 22, 23, 22, 22, 26, 21, 28, 27, 30, 26, 25, 27, 29, 23, 24, 21, 28, 23, 20, 26, 26, 24, 36, 34, 31, 26, 29, 36, 33, 29, 27, 21, 39, 24, 22, 37, 32, 28, 28, 28, 31, 26, 32, 30, 32, 29, 28, 29, 31, 28, 30, 32, 30, 27, 22, 23, 28, 26, 24, 23, 24, 25, 27, 23, 25, 27, 25, 25, 25, 25]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_medium_bins = [24, 9, 37, 28, 22, 36, 41, 14, 34, 26, 28, 27, 33, 22, 16, 36, 22, 32, 28, 31, 14, 16, 20]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]


#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_medium_bins': td_medium_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_medium_bins': r_medium_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_medium_bins ~ tMreads", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_medium_bins ~ tMreads', data=df).fit()
model2 = ols('r_medium_bins ~ tMreads', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.297) 
#adjusted Pearson's r = 0.545

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'tMreads', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'tMreads', fig=fig)

           Mixed Linear Model Regression Results
============================================================
Model:            MixedLM Dependent Variable: td_medium_bins
No. Observations: 96      Method:             REML          
No. Groups:       6       Scale:              10.0592       
Min. group size:  16      Likelihood:         -254.6484     
Max. group size:  16      Converged:          Yes           
Mean group size:  16.0                                      
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    -1.702     6.623  -0.257  0.797  -14.682  11.279
tMreads       0.883     0.200   4.422  0.000    0.492   1.275
Group Var    12.867     2.927                                
============================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         td_medium_bins   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.110
Method:                 Least Squares   F-statistic:                     12.73
Date:                Sun, 11 Apr 2021   Prob (F-statistic):           0.000568
Time:                        12:21:36   Log-Likelihood:                -274.12
No. Observations:                  96   AIC:                             552.2
Df Residuals:                      94   BIC:                             557.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     12.7682      3.963      3.222      0.002       4.900      20.636
tMreads        0.4352      0.122      3.568      0.001       0.193       0.677
==============================================================================
Omnibus:                        3.754   Durbin-Watson:                   1.392
Prob(Omnibus):                  0.153   Jarque-Bera (JB):                3.763
Skew:                           0.463   Prob(JB):                        0.152
Kurtosis:                       2.711   Cond. No.                         297.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          r_medium_bins   R-squared:                       0.243
Model:                            OLS   Adj. R-squared:                  0.207
Method:                 Least Squares   F-statistic:                     6.745
Date:                Sun, 11 Apr 2021   Prob (F-statistic):             0.0168
Time:                        12:21:36   Log-Likelihood:                -78.203
No. Observations:                  23   AIC:                             160.4
Df Residuals:                      21   BIC:                             162.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.5665      8.370      0.546      0.591     -12.840      21.974
tMreads        0.6476      0.249      2.597      0.017       0.129       1.166
==============================================================================
Omnibus:                        1.144   Durbin-Watson:                   2.336
Prob(Omnibus):                  0.564   Jarque-Bera (JB):                1.006
Skew:                           0.461   Prob(JB):                        0.605
Kurtosis:                       2.554   Cond. No.                         178.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/kb/runtime/lib/python3.6/site-packages/statsmodels/graphics/regressionplots.py:221: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = fig.add_subplot(2, 2, 1)
/kb/runtime/lib/python3.6/site-packages/statsmodels/graphics/regressionplots.py:231: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = fig.add_subplot(2, 2, 2)
/kb/runtime/lib/python3.6/site-packages/statsmodels/graphics/regressionplots.py:238: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = fig.add_subplot(2, 2, 3)
/kb/runtime/lib/python3.6/site-packages/statsmodels/graphics/regressionplots.py:251: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax = fig.add_subplot(2, 2, 4)

<Figure size 864x576 with 0 Axes>

#Medium MAG counts were correlated with base counts of trimmed and decontaminated reads
#Medium MAG counts were correlated with base counts of raw reads at alpha = 0.5

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

#data
Mix_Group = ['10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '10158.6', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9117.8', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9108.2', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '9117.7', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '11306.3', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4', '9117.4']
td_read_files = ['10158.6_raw', '10158.6_qc', '10158.6_trim150', '10158.6_ftrim', '10158.6_ktrim', '10158.6_atrim', '10158.6_aqbtrim', '10158.6_aqtrim', '10158.6_qbtrim', '10158.6_qtrim', '10158.6_bb1', '10158.6_bb2', '10158.6_bb3', '10158.6_bb4', '10158.6_bb5', '10158.6_bb6', '9117.8_raw', '9117.8_qc', '9117.8_trim150', '9117.8_ftrim', '9117.8_ktrim', '9117.8_atrim', '9117.8_aqbtrim', '9117.8_aqtrim', '9117.8_qbtrim', '9117.8_qtrim', '9117.8_bb1', '9117.8_bb2', '9117.8_bb3', '9117.8_bb4', '9117.8_bb5', '9117.8_bb6', '9108.2_raw', '9108.2_qc', '9108.2_trim150', '9108.2_ftrim', '9108.2_ktrim', '9108.2_atrim', '9108.2_aqbtrim', '9108.2_aqtrim', '9108.2_qbtrim', '9108.2_qtrim', '9108.2_bb1', '9108.2_bb2', '9108.2_bb3', '9108.2_bb4', '9108.2_bb5', '9108.2_bb6', '9117.7_raw', '9117.7_qc', '9117.7_trim150', '9117.7_ftrimmed', '9117.7_ktrimmed', '9117.7_atrimmed', '9117.7_aqbtrimmed', '9117.7_aqtrimmed', '9117.7_qbtrimmed', '9117.7_qtrimmed', '9117.7_bb1', '9117.7_bb2', '9117.7_bb3', '9117.7_bb4', '9117.7_bb5', '9117.7_bb6', '11306.3_raw', '11306.3_qc', '11306.3_trim150', '11306.3_ftrimmed', '11306.3_ktrimmed', '11306.3_atrimmed', '11306.3_aqbtrimmed', '11306.3_aqtrimmed', '11306.3_qbtrimmed', '11306.3_qtrimmed', '11306.3_bb1', '11306.3_bb2', '11306.3_bb3', '11306.3_bb4', '11306.3_bb5', '11306.3_bb6', '9117.4_raw', '9117.4_qc', '9117.4_trim150', '9117.4_ftrimmed', '9117.4_ktrimmed', '9117.4_atrimmed', '9117.4_aqbtrimmed', '9117.4_aqtrimmed', '9117.4_qbtrimmed', '9117.4_qtrimmed', '9117.4_bb1', '9117.4_bb2', '9117.4_bb3', '9117.4_bb4', '9117.4_bb5', '9117.4_bb6']
td_tMreads = [36.0129894, 35.2337896, 36.0129894, 36.0129894, 35.8983284, 35.933862, 34.552143, 31.2682706, 34.3282984, 30.6449696, 35.8442254, 31.2345964, 30.615651, 35.3058416, 34.4731868, 34.2527702, 28.2058246, 26.5787996, 28.2058246, 28.2058246, 27.66738, 27.6666568, 27.2469874, 25.5238858, 27.2469874, 25.2340502, 26.886397, 24.8035162, 24.519662, 26.7855648, 26.4733418, 26.3827484, 39.9148934, 37.3791976, 39.9148934, 39.9148934, 38.7122998, 38.708094, 37.7905962, 34.6650054, 37.6107554, 34.1456456, 37.8858836, 33.921016, 33.4111558, 37.7734844, 36.9821596, 36.8056446, 33.4711394, 32.6004118, 33.4711394, 33.4711394, 33.04149, 33.0383456, 30.3924568, 32.5038064, 30.0542808, 32.3956334, 32.9696938, 30.3601654, 30.025172, 32.8646394, 32.4398904, 32.3335082, 31.8428354, 31.3942576, 31.8428354, 31.8428354, 31.637986, 31.6358996, 29.2361886, 31.0176556, 28.9929926, 30.9364282, 31.6358006, 29.2361776, 28.9929822, 31.5146638, 31.0175998, 30.9363772, 34.9441596, 32.3312534, 34.9441596, 34.9441596, 33.4723724, 33.4683444, 30.8153706, 32.9387816, 30.4748512, 32.8289102, 32.6738296, 30.0963168, 29.7659662, 32.561628, 32.1535464, 32.0471204]
td_Bbases = [54.379613994, 52.641056249, 54.0194841, 51.498574842, 50.780358538, 53.669962132, 51.549471661, 45.964065176, 48.624647571, 42.845175292, 53.535374086, 45.919226343, 42.80751312, 53.311820816, 51.43795566, 48.522733084, 42.590795146, 40.133987396, 42.3087369, 40.334329178, 39.44260618, 41.667343504, 40.776782835, 37.725547193, 40.776782835, 35.427126679, 40.490156646, 36.660246851, 34.422663647, 40.446202848, 39.621053097, 37.463971015, 60.271489034, 56.442588376, 59.8723401, 57.078297562, 55.202040634, 58.310667004, 56.435790641, 51.019814812, 53.310062558, 47.762707137, 57.069909524, 49.92030233, 46.728770205, 57.037961444, 55.229109813, 52.166971365, 50.541420494, 49.226621818, 50.2067091, 47.863729342, 47.116772654, 49.7656626, 44.974095956, 48.667868199, 42.241274113, 46.022382453, 49.6630349, 44.930324539, 42.203344352, 49.625605494, 48.576089344, 45.937352987, 48.082681454, 47.024019995, 47.7642531, 45.535254622, 45.103175218, 47.640851402, 43.590661125, 46.584185484, 41.025529046, 44.063678425, 47.640703902, 43.590646205, 41.025515524, 47.587142338, 46.584108342, 44.063610873, 52.765680996, 48.820192634, 52.4162394, 49.970148228, 47.71891569, 50.406042552, 45.62052235, 49.326011714, 42.846059312, 46.640424087, 49.207539094, 44.559568777, 41.852323333, 49.16805828, 48.15227162, 45.531321684]
td_bins = [78, 85, 82, 83, 85, 83, 82, 78, 79, 63, 90, 72, 67, 78, 85, 83, 65, 62, 64, 52, 55, 59, 59, 56, 54, 50, 62, 55, 50, 60, 61, 53, 69, 76, 74, 75, 75, 71, 73, 72, 68, 65, 74, 71, 66, 74, 74, 65, 95, 103, 107, 96, 97, 103, 88, 98, 79, 81, 104, 90, 76, 101, 101, 90, 99, 100, 96, 97, 94, 97, 96, 97, 95, 82, 101, 96, 80, 97, 96, 91, 70, 68, 70, 65, 68, 69, 64, 61, 65, 64, 77, 62, 64, 78, 65, 62]
td_Mean_Completeness = [53.86, 53.28, 56.31, 57.37, 54.68, 50.1, 58.05, 54.04, 51.16, 54.76, 51.12, 52.23, 57.52, 52.15, 58.83, 54.29, 55.06, 54.27, 53.37, 53.16, 53.77, 57.81, 54.55, 54.88, 54.17, 52.29, 54.94, 52.04, 51.41, 58.36, 54.69, 56.81, 56.66, 53.72, 57.38, 56.22, 52.7, 52.65, 55.11, 54.88, 57.48, 54.64, 53.6, 54.34, 58.18, 52.11, 56.26, 58.28, 55.64, 57.45, 58.55, 54.54, 57.45, 58.65, 55.18, 55.5, 58.25, 60.02, 55.82, 58.17, 60.34, 56.8, 57.0, 59.14, 54.4, 51.04, 58.28, 56.73, 48.21, 51.68, 52.66, 56.77, 55.24, 57.44, 58.56, 57.33, 54.41, 56.39, 53.78, 58.04, 57.17, 59.53, 56.07, 51.82, 58.95, 60.91, 56.23, 56.59, 60.48, 59.56, 56.87, 61.29, 54.39, 60.71, 53.56, 53.89]
td_Mean_Contamination = [64.8, 56.57, 63.75, 56.14, 63.41, 65.77, 68.71, 59.0, 51.31, 53.39, 53.35, 56.24, 59.82, 71.72, 69.63, 66.78, 53.25, 46.8, 58.27, 54.69, 48.23, 50.47, 57.38, 53.78, 50.02, 45.22, 47.86, 47.19, 54.22, 48.53, 54.71, 53.82, 83.63, 71.19, 88.26, 89.13, 73.4, 65.59, 87.02, 85.4, 79.15, 67.55, 71.37, 72.84, 82.49, 64.29, 92.78, 85.56, 107.95, 99.99, 99.71, 85.91, 99.47, 78.31, 82.71, 107.35, 97.95, 92.98, 87.49, 93.41, 102.6, 77.96, 97.57, 99.91, 83.7, 63.09, 68.23, 69.77, 73.61, 78.21, 70.56, 69.82, 69.56, 64.5, 63.16, 82.06, 72.7, 69.26, 75.71, 62.79, 62.01, 54.02, 69.1, 58.79, 55.27, 51.99, 60.61, 57.98, 59.85, 65.24, 56.41, 58.49, 58.44, 53.79, 57.06, 54.33]
td_good_bins = [21, 19, 19, 14, 15, 20, 20, 17, 17, 12, 19, 16, 12, 22, 20, 17, 16, 16, 17, 17, 15, 17, 19, 15, 15, 14, 16, 16, 15, 18, 19, 16, 18, 17, 17, 16, 16, 16, 17, 17, 14, 15, 17, 15, 15, 19, 15, 17, 22, 21, 18, 19, 21, 22, 17, 21, 21, 16, 23, 16, 17, 24, 20, 21, 19, 21, 21, 15, 18, 19, 20, 18, 15, 17, 19, 18, 18, 18, 19, 16, 17, 17, 19, 18, 18, 17, 19, 17, 19, 18, 18, 16, 18, 21, 20, 17]
td_good_Mean_Completeness = [85.98, 87.17, 86.9, 87.66, 86.64, 86.04, 85.18, 86.79, 86.35, 86.03, 88.47, 85.58, 89.46, 86.17, 83.85, 86.86, 87.87, 87.38, 87.61, 86.94, 87.35, 86.96, 88.23, 88.04, 88.62, 90.16, 88.26, 89.11, 86.1, 88.26, 87.55, 87.21, 87.51, 86.6, 87.87, 87.83, 86.62, 87.87, 87.12, 87.7, 87.06, 87.94, 87.37, 85.69, 87.28, 85.92, 88.11, 86.67, 87.68, 87.33, 87.89, 88.48, 89.21, 88.4, 86.1, 86.69, 87.99, 88.53, 89.18, 87.33, 86.83, 88.58, 87.12, 87.34, 88.54, 86.69, 87.03, 86.06, 88.99, 86.81, 86.17, 86.12, 87.78, 85.64, 86.41, 87.08, 85.98, 88.56, 87.42, 87.18, 85.99, 87.07, 86.97, 86.86, 86.75, 89.43, 86.51, 86.19, 86.17, 85.87, 86.68, 87.63, 86.26, 89.38, 87.22, 87.03]
td_good_Mean_Contamination = [5.06, 4.78, 4.28, 4.58, 3.83, 4.25, 4.66, 4.68, 4.57, 4.01, 4.01, 4.34, 4.43, 4.96, 4.23, 5.17, 3.79, 3.63, 3.76, 3.61, 3.99, 3.66, 4.04, 3.17, 3.67, 3.37, 4.3, 3.39, 3.89, 3.63, 4.05, 4.51, 3.87, 3.93, 3.27, 3.48, 4.16, 4.88, 3.89, 3.16, 4.65, 4.07, 4.41, 4.22, 4.07, 3.78, 3.85, 4.16, 3.92, 3.06, 4.2, 4.41, 3.21, 3.81, 4.31, 3.64, 3.84, 4.09, 3.84, 3.64, 3.48, 3.71, 3.4, 4.06, 3.76, 4.32, 4.61, 4.74, 3.56, 4.54, 4.1, 4.05, 4.26, 4.31, 4.84, 4.06, 3.88, 4.59, 4.63, 3.76, 3.57, 3.88, 3.56, 3.5, 3.68, 4.71, 3.46, 3.84, 3.94, 3.61, 3.5, 3.36, 3.4, 3.57, 3.91, 4.21]
td_medium_bins = [32, 33, 32, 24, 26, 35, 33, 26, 29, 18, 35, 27, 18, 33, 32, 28, 22, 20, 20, 23, 23, 21, 25, 22, 21, 22, 22, 23, 22, 22, 26, 21, 28, 27, 30, 26, 25, 27, 29, 23, 24, 21, 28, 23, 20, 26, 26, 24, 36, 34, 31, 26, 29, 36, 33, 29, 27, 21, 39, 24, 22, 37, 32, 28, 28, 28, 31, 26, 32, 30, 32, 29, 28, 29, 31, 28, 30, 32, 30, 27, 22, 23, 28, 26, 24, 23, 24, 25, 27, 23, 25, 27, 25, 25, 25, 25]
r_read_files = ['9117.5_raw', '10158.8_raw', '11263.1_raw', '11306.3_raw', '11306.1_raw', '11260.6_raw', '11260.5_raw', '9108.1_raw', '9053.2_raw', '9672.8_raw', '9108.2_raw', '9053.4_raw', '9053.3_raw', '9117.4_raw', '9117.6_raw', '9117.7_raw', '9117.8_raw', '10158.6_raw', '10186.3_raw', '10186.4_raw', '7331.1_raw', '9053.5_raw', '9041.8_raw']
r_tMreads = [36.0129894, 17.6218972, 38.2800142, 34.9076424, 35.3037194, 37.1504476, 40.3613864, 20.7773948, 31.8428354, 30.1166938, 27.718318, 40.8492618, 39.7169858, 34.4581152, 26.9696492, 21.3309852, 39.9148934, 34.9441596, 35.690255, 35.5019026, 33.4711394, 28.2058246, 36.96984]
r_Bbases = [54.379613994, 26.609064772, 57.802821442, 52.710540024, 53.308616294, 56.097175876, 60.945693464, 31.373866148, 48.082681454, 45.1750407, 41.85466018, 61.682385318, 59.972648558, 52.031753952, 40.724170292, 32.209787652, 60.271489034, 52.765680996, 53.89228505, 53.607872926, 50.541420494, 42.590795146, 55.8244584]
r_bins = [65, 47, 139, 99, 55, 90, 115, 38, 86, 87, 69, 71, 95, 70, 62, 95, 65, 78, 85, 109, 45, 49, 52]
r_Mean_Completeness = [45.96, 58.83, 51.28, 56.54, 56.55, 63.26, 52.23, 58.47, 54.69, 53.7, 58.18, 60.32, 62.41, 65.52, 52.14, 50.97, 56.26, 57.0, 65.39, 54.89, 53.78, 53.56, 52.81]
r_Mean_Contamination = [24.03, 69.63, 67.13, 60.4, 53.21, 81.43, 55.06, 35.14, 54.71, 74.62, 54.7, 76.81, 66.86, 78.94, 67.85, 46.5, 92.78, 97.57, 121.14, 103.47, 75.71, 57.06, 65.71]
r_good_bins = [15, 4, 23, 19, 14, 25, 29, 9, 20, 20, 18, 18, 23, 17, 10, 22, 16, 21, 18, 20, 9, 9, 14]
r_medium_bins = [24, 9, 37, 28, 22, 36, 41, 14, 34, 26, 28, 27, 33, 22, 16, 36, 22, 32, 28, 31, 14, 16, 20]
r_good_Mean_Completeness = [84.79, 83.85, 89.19, 87.14, 90.06, 87.41, 90.7, 86.6, 87.55, 85.4, 87.31, 86.68, 86.18, 84.94, 87.02, 88.92, 88.11, 87.12, 87.24, 89.3, 87.42, 87.22, 86.64]
r_good_Mean_Contamination = [3.37, 4.23, 3.57, 4.4, 3.62, 4.4, 4.56, 3.0, 4.05, 4.32, 4.53, 3.9, 4.48, 4.17, 2.86, 2.87, 3.85, 3.4, 3.63, 4.37, 4.63, 3.91, 4.25]

#create dataset
df = pd.DataFrame({'td_tMreads': td_tMreads,
                   'td_Bbases': td_Bbases,
                   'td_bins': td_bins,
                   'td_Mean_Completeness': td_Mean_Completeness,
                   'td_Mean_Contamination': td_Mean_Contamination,
                   'td_good_bins': td_good_bins,
                   'td_medium_bins': td_medium_bins,
                   'td_good_Mean_Completeness': td_good_Mean_Completeness,
                   'td_good_Mean_Contamination': td_good_Mean_Contamination,
                   'Mix_Group': Mix_Group})
df.rename(columns={'td_tMreads': 'tMreads', 'td_Bbases': 'Bbases'}, inplace=True)

df2 = pd.DataFrame({'r_tMreads': r_tMreads,
                   'r_Bbases': r_Bbases,
                   'r_bins': r_bins,
                   'r_Mean_Completeness': r_Mean_Completeness,
                   'r_Mean_Contamination': r_Mean_Contamination,
                   'r_good_bins': r_good_bins,
                   'r_medium_bins': r_medium_bins,
                   'r_good_Mean_Completeness': r_good_Mean_Completeness,
                   'r_good_Mean_Contamination': r_good_Mean_Contamination})
df2.rename(columns={'r_tMreads': 'tMreads', 'r_Bbases': 'Bbases'}, inplace=True)

#view dataset
#print(df)

#fit regression model
model = smf.mixedlm("td_medium_bins ~ Bbases", data=df, groups=df["Mix_Group"])
modelf = model.fit()
model1 = ols('td_medium_bins ~ Bbases', data=df).fit()
model2 = ols('r_medium_bins ~ Bbases', data=df2).fit()

#adj r^2 = Pearson product-moment correlation coefficient (r) adjusted for number of predictors 
#... r = sqrt(0.296) 
#adjusted Pearson's r = 0.544

#mdf = md.fit()
#print(mdf.summary())

#view model summary
print(modelf.summary())
print(model1.summary())
print(model2.summary())

#define figure size
fig = plt.figure(figsize=(12,8))
fig2 = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model1, 'Bbases', fig=fig)
fig2 = sm.graphics.plot_regress_exog(model2, 'Bbases', fig=fig)

           Mixed Linear Model Regression Results
============================================================
Model:            MixedLM Dependent Variable: td_medium_bins
No. Observations: 96      Method:             REML          
No. Groups:       6       Scale:              8.4560        
Min. group size:  16      Likelihood:         -247.5058     
Max. group size:  16      Converged:          Yes           
Mean group size:  16.0                                      
-------------------------------------------------------------
             Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
-------------------------------------------------------------
Intercept    -2.118     4.767  -0.444  0.657  -11.462   7.226
Bbases        0.610     0.095   6.390  0.000    0.423   0.797
Group Var    12.750     3.052                                
============================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         td_medium_bins   R-squared:                       0.178
Model:                            OLS   Adj. R-squared:                  0.169
Method:                 Least Squares   F-statistic:                     20.29
Date:                Sun, 11 Apr 2021   Prob (F-statistic):           1.91e-05
Time:                        18:21:14   Log-Likelihood:                -270.83
No. Observations:                  96   AIC:                             545.7
Df Residuals:                      94   BIC:                             550.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.6339      3.619      2.939      0.004       3.449      17.819
Bbases         0.3412      0.076      4.504      0.000       0.191       0.492
==============================================================================
Omnibus:                        3.934   Durbin-Watson:                   1.321
Prob(Omnibus):                  0.140   Jarque-Bera (JB):                3.838
Skew:                           0.441   Prob(JB):                        0.147
Kurtosis:                       2.575   Cond. No.                         413.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          r_medium_bins   R-squared:                       0.243
Model:                            OLS   Adj. R-squared:                  0.207
Method:                 Least Squares   F-statistic:                     6.733
Date:                Sun, 11 Apr 2021   Prob (F-statistic):             0.0169
Time:                        18:21:14   Log-Likelihood:                -78.208
No. Observations:                  23   AIC:                             160.4
Df Residuals:                      21   BIC:                             162.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.5998      8.365      0.550      0.588     -12.796      21.995
Bbases         0.4283      0.165      2.595      0.017       0.085       0.772
==============================================================================
Omnibus:                        1.130   Durbin-Watson:                   2.333
Prob(Omnibus):                  0.568   Jarque-Bera (JB):                1.000
Skew:                           0.458   Prob(JB):                        0.607
Kurtosis:                       2.548   Cond. No.                         268.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<Figure size 864x576 with 0 Axes>

DISCUSSION¶

In this study, we demonstrated that JGI trimming and decontamination procedures had little impact on the quantity or quality of MAGs from complex rhizosphere metagenomes, or the functional profiling of raw and qc MAGs that were phylogenomically paired (Table 2). However, we did observe that the number of raw and qc MAGs discretely placed in species trees increased from zero to four MAGs to five to seven MAGs to seven to 10 MAGs as quality thresholds for completeness and contamination were decreased from high to good to medium quality (Fig. 3, Supplemental_Fig.1). Phylogenomic differences of MAGs may be explained by differences in binning and assembly metrics including the 2.0% lower average contamination of qc MAGs compared to raw MAGs, and significantly higher total contig counts, contigs greater than 10k bp in length, and larger total lengths of raw assemblies compared to qc assemblies. Since choosing JGI trimmed and decontaminated or raw reads means reporting and depositing a similar quantity and quality of MAGs, some with phylogenomic differences, researchers may choose to assemble each and retain the union of discreet and paired MAGs to increase the total number in their analysis and avoid missing functionally important community members.We believe our methods were appropriate for the questions we were asking, but there are other ways of analyzing the data. To illustrate this point, consider that binning of a single assembly generates multiple MAGs, 77 MAGs for example. Binning multiple assemblies generates multiple MAG counts (e.g. assembly01 = 77 MAGs, assembly02 = 66 MAGs, assembly03 = 91 MAGs, ... assembly24 = 73 MAGs). So, a distribution of MAG counts can be generated from a set of assemblies, which can subsequently be compared to another distribution of MAG counts from an alternative assembly set (e.g. raw vs qc assembly MAG counts). However, each MAG has its own completeness percentage (e.g. bin001 = 14.9%, bin002 = 8.7%, bin003 = 93.4%, ... bin077 = 23.1%), contamination percentage, and counts of single-copy and multi-copy markers, used to calculate the completeness and contamination percentages. Since each assembly has multiple MAGs, each assembly set contains multiple distributions for these other metrics. To evaluate differences in binning metrics besides MAG counts, we elected to average MAGs single-copy marker counts, multi-copy marker counts, completeness scores, and contamination scores for each assembly. Distributions used for statistical testing were therefore average values. The consequence of this is that we tested differences in the averages of averages. A possible alternative method could be to make distributions by combining all values for each MAG metric for all assemblies generated with the same trimming and decontamination procedure, disregarding intuitive assembly-level groupings. We believe our method is more relevant to the researcher who wants to know if their assembly, when binned, is going to have better or worse binning metrics than if they used an assembly prepared a different way (raw vs qc).¶

We failed to reject the null hypotheses that there were no significant differences in several key binning metrics for assemblies that were JGI trimmed and decontaminated compared to raw assemblies. These include total counts of MAGs, and completeness averages, single-copy marker count averages, and multi-copy marker count averages of assembly MAGs. However, our study was unpowered, comparing 23 assembly pairs. It is expected that differences in these metrics could be found significant given a much higher sample size. Based on the small effect sizes of less than 0.1 found for the significant difference in average contamination, it is also expected though that significance would have a small practical effect. We calculate that a powered study (power = 0.8) would need a sample size of greater than 824 assembly pairs (801 more pairs than what we used) for an effect size less than or equal to 0.1 and α = 0.05. Then again, an effect size may be greater for low quality data, and some JGI datasets are worse quality than the ones used in this study. Therefore, in addition to sample size, future studies should consider using average Q scores as a factor or filter in experiment designs.¶

We also found that more aggressive trimming reduced MAG counts, including good quality MAGs, with small to medium effect. While JGIs methods of trimming and decontamination removed between 0.6 - 7.7% of reads in the fastq files, we removed as much as 16% of reads. Parameters that were overly aggressive included quality trimming to Q20 and discarding reads that were trimmed to less than 100 bp. More mild parameters such as trimming to Q8 - Q12 and discarding reads that are less than 40 bp are recommended for those who elect to trim their reads to avoid loss of MAGs.¶

CONCLUSIONS¶

Mild trimming and decontamination of metagenomics reads can change the way an investigator answers the questions "Who is there and what are they doing?" This is because some MAGs assembled with JGI trimming and decontamination are phylogenomically distinct from ones assembled with raw reads. Phylogenomics informs investigators of MAG identities and functions through relatedness to other organisms, and phylogenomically distinct microbes also have differing COG, PFAM, and TIGRFAM functional profiles. Since the number of MAGs discretely placed in species trees increases with inclusion of MAGs with lower qualities, the discrepancy will be more substantial with medium quality MAGs compared to high quality MAGs. While mild JGI trimming and decontamination can impact MAG identities and functions, it does not appear to impact how many are assembled. However, aggressive trimming should be avoided for this reason.¶

List of abbreviations¶

IMG/M = Integrated Microbial Genomes and Microbiomes¶

DOE = United States Department of Energy¶

JGI = Joint Genome Institute¶

KBS = Kellogg Biological Station¶

MAGs = metagenome assembled genomes¶

PCA = principal component analysis¶

qc = JGI trimmed and decontaminated fastq files or reads¶

raw = raw fastq files or reads¶

DECLARATIONS¶

Not applicable¶

Availability of data and material¶

All data and code generated and analyzed during this study are included in this published article, JGI IMG/M (Proposal ID: 1296, [5]), in the KBase narratives [34-35], and in the GitHub repository [33].¶

Competing interests¶

The authors declare that they have no competing interests.¶

Funding¶

Funding was provided by the United States Department of Energy, Award No. DE-EE0008523.¶

Authors' contributions¶

JMW was responsible for experimental design, data acquisition, wrangling, statistical analyses, creating figures and tables, depositing code and generated data into repositories, and drafted the manuscript.¶

AMG contributed to manuscript edits.¶

Acknowledgements¶

We acknowledge the computing resources provided on Henry2, a high-performance computing cluster operated by North Carolina State University, and acknowledge Lisa L. Lowe for her assistance with adding software packages to Henry2, which was provided through the Office of Information Technology High Performance Computing services at NC State University.¶

REFERENCES¶

1. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies, Bioinformatics. 2013;29:1072-1075. doi:10.1093/bioinformatics/btt086.¶

2. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2016;32:1088-90. doi:10.1093/bioinformatics/btv697.¶

3. Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TB, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature biotechnology. 2017;35:725-31. doi:10.1038/nbt.3893.¶

4. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043-1055. doi:10.1101/gr.186072.114.¶

5. Tiedje JM. Metagenomic analysis of the rhizosphere of three biofuel crops at the KBS intensive site. United States: N. p. 2013. doi:10.25585/1488010.¶

6. Guo J, Cole JR, Zhang Q, Brown CT, Tiedje JM. Microbial community analysis with ribosomal gene fragments from shotgun metagenomes. Appl. Environ. Microbiol. 2016;82:157-166.¶

7. Bay SK, Dong X, Bradley JA, Leung PM, Grinter R, Jirapanjawat T, et al.. Trace gas oxidizers are widespread and active members of soil microbial communities. Nat. Microbiology. 2021:1-11.¶

8. Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al.. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnology. 2018;36:566. doi:10.1038/nbt.4163.¶

9. Kluyver T, Ragan-Kelley B, Pérez F, Granger BE, Bussonnier M, Frederic J, et al.. Jupyter Notebooks - a publishing format for reproducible computational workflows. ELPUB. 2016.¶

10. Chen IM, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, et al.. The IMG/M data management and analysis system v. 6.0: new tools and advanced capabilities. Nucleic Acids Res. 2021;49:D751-63. doi.org/10.1093/nar/gkaa939.¶

11. Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Sundaramurthi JC, Lee J, et al.. Genomes OnLine Database (GOLD) v. 8: overview and updates. Nucleic Acids Res. 2021;49:D723-33. doi:10.1093/nar/gkaa983.¶

12. Bushnell B: BBTools Software Package. 2017. http://sourceforge.net/projects/bbmap. Accessed 15 Oct 2020.¶

13. BBDuk Guide. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/. Accessed 15 Oct 2020.¶

14. SeqAnswers BBDuk. http://seqanswers.com/forums/showthread.php?t=96593&goto=nextnewest. Accessed 15 Oct 2020.¶

15. BioStars BBDuk 1. https://www.biostars.org/p/237714/#237745. Accessed 15 Oct 2020.¶

16. BioStars BBDuk 2. https://www.biostars.org/p/237931/. Accessed 15 Oct 2020.¶

17. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. 2006.¶

18. Li D, Luo R, Liu CM, Leung CM, Ting HF, Sadakane K, et al.. MEGAHIT v1. 0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3-11.¶

19. Azad A, Pavlopoulos GA, Ouzounis CA, Kyrpides NC, Buluç A. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res. 2018;46:e33.¶

20. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr. Protoc. Bioinform. 2020;70:e102.¶

21. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420-8. doi.org/10.1093/bioinformatics/bts174.¶

22. Whitham JM. KBase Silver Case Study: Determining Media Formulation Requirements for Isolation of Microbiome Constituents. United States: N. p. 2021. doi:10.25982/68579.143/1766297.¶

23. Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7:e7359. doi.org/10.7717/peerj.7359.¶

24. Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32:605-607.¶

25. Yue Y, Huang H, Qi Z, Dou HM, Liu XY, Han TF, et al. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinformatics. 2020;21:334. doi.org/10.1186/s12859-020-03667-3.¶

26. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep. 2015;5:1-6.¶

27. Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490¶

28. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 2016;33:1635-8.¶

29. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D, Koonin EV. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021;49:D274-81.¶

30. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer EL, et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 2021;49:D412-9.¶

31. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2001;29:41-3.¶

32. Torchiano M. effsize: Efficient Effect Size Computation. 2020. doi:10.5281/zenodo.1480624.¶

33. GitHub. https://github.com/jmwhitha/Trimming_and_decon. Accessed 22 April 2021.¶

34. Whitham, Jason. JGI QC impact on assembly, binning, phylogenomics, and functional analysis. United States: N. p., 2021. Web. doi:10.25982/62657.1515/1779219.¶

Continued from the KBase narrative "JGI QC impact on assembly, binning, phylogenomics, and functional analysis" [34]:¶

Please also reference the journal article:¶

Trimming and decontamination of metagenomic data can significantly impact assembly and binning metrics, phylogenomic and functional analysis¶

Jason M. Whitham and Amy M. Grunden, 2021¶

[email protected]✉ and [email protected]¶

North Carolina State University, 4550A Thomas Hall, Box 7615, Raleigh NC, 27695, United States of America¶

Assembly and binning of select readsets trimmed and decontaminated with recommended parameters¶

Modules used in 10158.6*fastq processing¶

Modules used in 9117.8*fastq processing¶

Modules used in 9108.2*fastq processing¶

Modules used in 9117.7*fastq processing¶

Modules used in 11306.3*fastq processing¶

Modules used in 9117.4*fastq processing¶

Trimming and decontamination removed as much as tens of millions of reads and tens of billions of bases from read files¶

In addition to the read files generated, the raw and JGI processed reads were included in the subsequent analyses, making a total of 96 read files. These ranged from 245M to 399M reads, a span of 154M reads, and from 34.4 to 60.3B bases, a span of 25.9B.¶

Total MAG counts correlated with bases and reads¶

DISCUSSION¶

CONCLUSIONS¶

List of abbreviations¶

IMG/M = Integrated Microbial Genomes and Microbiomes¶

DOE = United States Department of Energy¶

JGI = Joint Genome Institute¶

KBS = Kellogg Biological Station¶

MAGs = metagenome assembled genomes¶

PCA = principal component analysis¶

qc = JGI trimmed and decontaminated fastq files or reads¶

raw = raw fastq files or reads¶

DECLARATIONS¶

Ethics approval and consent to participate¶

Not applicable¶

Consent for publication¶

Not applicable¶

Availability of data and material¶

All data and code generated and analyzed during this study are included in this published article, JGI IMG/M (Proposal ID: 1296, [5]), in the KBase narratives [34-35], and in the GitHub repository [33].¶

Competing interests¶

The authors declare that they have no competing interests.¶

Funding¶

Funding was provided by the United States Department of Energy, Award No. DE-EE0008523.¶

Authors' contributions¶

JMW was responsible for experimental design, data acquisition, wrangling, statistical analyses, creating figures and tables, depositing code and generated data into repositories, and drafted the manuscript.¶

AMG contributed to manuscript edits.¶

Acknowledgements¶

REFERENCES¶

1. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies, Bioinformatics. 2013;29:1072-1075. doi:10.1093/bioinformatics/btt086.¶

2. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics. 2016;32:1088-90. doi:10.1093/bioinformatics/btv697.¶

3. Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TB, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature biotechnology. 2017;35:725-31. doi:10.1038/nbt.3893.¶

4. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043-1055. doi:10.1101/gr.186072.114.¶

5. Tiedje JM. Metagenomic analysis of the rhizosphere of three biofuel crops at the KBS intensive site. United States: N. p. 2013. doi:10.25585/1488010.¶

6. Guo J, Cole JR, Zhang Q, Brown CT, Tiedje JM. Microbial community analysis with ribosomal gene fragments from shotgun metagenomes. Appl. Environ. Microbiol. 2016;82:157-166.¶

7. Bay SK, Dong X, Bradley JA, Leung PM, Grinter R, Jirapanjawat T, et al.. Trace gas oxidizers are widespread and active members of soil microbial communities. Nat. Microbiology. 2021:1-11.¶

8. Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al.. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnology. 2018;36:566. doi:10.1038/nbt.4163.¶

9. Kluyver T, Ragan-Kelley B, Pérez F, Granger BE, Bussonnier M, Frederic J, et al.. Jupyter Notebooks - a publishing format for reproducible computational workflows. ELPUB. 2016.¶

10. Chen IM, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, et al.. The IMG/M data management and analysis system v. 6.0: new tools and advanced capabilities. Nucleic Acids Res. 2021;49:D751-63. doi.org/10.1093/nar/gkaa939.¶

11. Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Sundaramurthi JC, Lee J, et al.. Genomes OnLine Database (GOLD) v. 8: overview and updates. Nucleic Acids Res. 2021;49:D723-33. doi:10.1093/nar/gkaa983.¶

12. Bushnell B: BBTools Software Package. 2017. http://sourceforge.net/projects/bbmap. Accessed 15 Oct 2020.¶

13. BBDuk Guide. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/. Accessed 15 Oct 2020.¶

14. SeqAnswers BBDuk. http://seqanswers.com/forums/showthread.php?t=96593&goto=nextnewest. Accessed 15 Oct 2020.¶

15. BioStars BBDuk 1. https://www.biostars.org/p/237714/#237745. Accessed 15 Oct 2020.¶

16. BioStars BBDuk 2. https://www.biostars.org/p/237931/. Accessed 15 Oct 2020.¶

17. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. 2006.¶

18. Li D, Luo R, Liu CM, Leung CM, Ting HF, Sadakane K, et al.. MEGAHIT v1. 0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3-11.¶

19. Azad A, Pavlopoulos GA, Ouzounis CA, Kyrpides NC, Buluç A. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res. 2018;46:e33.¶

20. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr. Protoc. Bioinform. 2020;70:e102.¶

21. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420-8. doi.org/10.1093/bioinformatics/bts174.¶

22. Whitham JM. KBase Silver Case Study: Determining Media Formulation Requirements for Isolation of Microbiome Constituents. United States: N. p. 2021. doi:10.25982/68579.143/1766297.¶

23. Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7:e7359. doi.org/10.7717/peerj.7359.¶

24. Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32:605-607.¶

25. Yue Y, Huang H, Qi Z, Dou HM, Liu XY, Han TF, et al. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinformatics. 2020;21:334. doi.org/10.1186/s12859-020-03667-3.¶

26. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep. 2015;5:1-6.¶

27. Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490¶

28. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 2016;33:1635-8.¶

29. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D, Koonin EV. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021;49:D274-81.¶

30. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer EL, et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 2021;49:D412-9.¶

31. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2001;29:41-3.¶

32. Torchiano M. effsize: Efficient Effect Size Computation. 2020. doi:10.5281/zenodo.1480624.¶

33. GitHub. https://github.com/jmwhitha/Trimming_and_decon. Accessed 22 April 2021.¶

34. Whitham, Jason. JGI QC impact on assembly, binning, phylogenomics, and functional analysis. United States: N. p., 2021. Web. doi:10.25982/62657.1515/1779219.¶

35. Whitham, Jason. Impact of BBDuk metagenomic read trimming and decontamination. United States: N. p., 2021. Web. doi:10.25982/77705.1341/1779218.¶

36. Sainani K. The importance of accounting for correlated observations. PM&R. 2010;2:858-861.¶

Apps