1. How to generate input data?

There three types of input data. Gene cluster file is the orthologs information for the analyzed bacterial population, and PGAP could directly output the gene cluster file data. Additionally, a series of other softwares or programs, such as OrthoMCL, PanOCT, Inparanoid/Multiparanoid and so on, could output the gene cluster like file. To prepare the genecluster file, those strain specific gene should be added.

2. How to choose sample algorithm?

For a population with no more than 15 strains, Traverse All algorithm would be best choice and the time cost would no more than 5 minutes. When the population size is more than 15, Totally Random (TR), and Distance Guide (DG) are recommended, and DG is extremely recommended. Theoretically, PanGP could be employed to conduct pan-genome profile analysis for bacteria, fungi, and others. For other specie (NOT bacteria) with population size more than 15, TR could be good choice, as we only test DG on bacteria.

3. How to choose sample size?

Theoretically, the larger the sample size is, the better the result would be. At the same time, the larger the sample size is, more time would be cost. For N strains with total M gene clusters, the sample size could refer to the Table 5.1 . Usually, 500 would be enough.

4. What does k (Amplification Coefficient) do and how it works?

When we calculated pan-genome size for n out of N strains, there would be C(N,n) combinations. When C(N,n) is very large, we could not afford the time cost in calculating the pan-genome size and genome diversity for all combinations. Thus, we evaluated the genome diversity of all combinations and sampling combinations for pan-genome size calculation in the two-step process in DG sample algorithm.

In the first step, we sample s × k combinations from total C(N,n) combinations randomly. These s × k combinations were used to evaluate the distribution of genome diversity in the total C(N,n) combinations. The k value could be modified to control the sample size during evaluating. When the total genomes number N increased, C(N,n) value would be very large, large k value could offer the distribution of genome diversity at low cost.

In the second step, we sample s combinations from the s × k combinations based on genome diversity. When the sample size was defined, k value could decide the size of combinations derived from the sample combinations for pan-genome profile analysis. Generally, the number of final combinations for pan-genome size calculation would determinate the time cost in the DG algorithm.

In summary, k value does not only control the sample size for evaluating the distribution of genome diversity of all combinations, but also decides the size of combinations derived from the sample combinations for pan-genome profile analysis

In the first step, we sample s × k combinations from total C(N,n) combinations randomly. These s × k combinations were used to evaluate the distribution of genome diversity in the total C(N,n) combinations. The k value could be modified to control the sample size during evaluating. When the total genomes number N increased, C(N,n) value would be very large, large k value could offer the distribution of genome diversity at low cost.

In the second step, we sample s combinations from the s × k combinations based on genome diversity. When the sample size was defined, k value could decide the size of combinations derived from the sample combinations for pan-genome profile analysis. Generally, the number of final combinations for pan-genome size calculation would determinate the time cost in the DG algorithm.

In summary, k value does not only control the sample size for evaluating the distribution of genome diversity of all combinations, but also decides the size of combinations derived from the sample combinations for pan-genome profile analysis