You are hereSoftware / kGEM: k-Genotype Expectation Maximization algorithm for Reconstructing a Viral population from Single-Amplicon reads.

kGEM: k-Genotype Expectation Maximization algorithm for Reconstructing a Viral population from Single-Amplicon reads.


kGEM tool finds haplotypes for Single-amplicon sequencing data. This tool requires aligned reads in special internal format and auxiliary program ERIF could help to convert reads in this format either from fasta (unaligned) format or from SAM (pairwise alignment) format. This tool uses InDelFixer for pairwise alignment inside. 

To run both kGEM and ERIF Java Runtime Environment is necessary (http://java.com/en/download/index.jsp)

Download ERIF-1.0  

Download kGEM-0.3.1

For both tools help in command line available

There are two ways to prepare reads for kGEM:

  1. Using reads in fasta format:
    1. Run ERIF using following command: 
       java -jar <path_to_ERIF.jar> -g <path_to_reference.fasta> -i <path_to_reads.fasta> -o <folder_for_output>
      where <path_to_reference> is the path to a reference for aligning reads in fasta format (reference should be one sequence)
      <path_to_reads> is a path to a reads in fasta format and
      <folder_for_output> is a path where your results will be stored (default: current folder). In this folder several files will appear (reads.sam - InDelFixer alignment, <ref>_ext.fasta file with extended reference sequence, and <reads>_ext.txt - aligned reads in special internal format)
    2. The aligned_reads.fas is prepared input for kGEM (in output folder)
  2. Using reads in SAM format (aligned):
    1. First if you have indexed BAM file unpack it to SAM using SAMtools
    2. Run EFIF using following command: 
      java -jar <path_to_ERIF.jar> -g <path_to_reference.fasta> -sam <path_to_reads.sam> -o <folder_for_output>
      where <path_to_reference> is the path to a reference for aligning reads in fasta format (reference should be one sequence)
      <path_to_reads> is a path to a reads in SAM format and
      <folder_for_output> is a path where your results will be stored (default: current folder). In this folder several files will appear (reads.sam - InDelFixer alignment, <ref>_ext.fasta file with extended reference sequence, and aligned_reads.fas - aligned reads in special internal format)
    3. The aligned_reads.fas is prepared input for kGEM (in output folder)

After reads_ext.txt file obtained run KGEM using following comand:

 java -jar <path_to_KGEM-v.jar>  <path_to_reads>/aligned_reads.fas  <k> -o <output_directory>

where <k> is a number of initial haplotypes for estimation (this number should be higher than actual number of haplotypes in population or for clustering more <k> could be reduced). This parameter is positive integer number

aligned_reads.fas reads obtained on previous step and <output_directory> (default: current) will contains two files after prograram will be finished. The file haplotypes.fa will contain haplotypes in fasta format and their frequencies in description (example:

>read1_0.38

ACTGGAA......

means that this haplotype has frequency 38%)

and second file will contain these haplotypes but instead of frequencies in description program just copy them proportionally to the frequencies. This file will contain the same number of entries as initial file with reads.

  • Note: result files reads.fa and haplotypes.fa may contain dashes '-' which were used for alignment, hence to get pure sequences file should be cleaned via any txt editor with command Repalce all '-' '' or in linux machines with command:
sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa

 

Example

Assuming ERIF.jar KGEM.jar sample_data.fa and reference.fa are in current directory. Then first run following command:

java -jar ERIF.jar -g reference.fa -i sample_data.fa -o test_

Alternatively! you could use SAM file instead of fasta. (reads.sam) 

java -jar ERIF.jar -g reference.fa -sam reads.sam -o test_

After that in this folder will appear output file test_reads.sam_ext.txt

Run next command:

java -jar KGEM-0.3.1.jar test_reads.sam_ext.txt 100

After completion of kGEM the two files will appear in current directory: haplotypes.fa and reads.fa

For linux users to clean dashes from output following command is available:

 sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa

And as a result haplotypes with their frequencies will be stored in haplotypes_cleaned.fa file. 

For developers: 

source code available on git repository KGEM_on_github

Programming Language Scala, for compilation Maven is required.

  1. Download and install Maven 2 or 3
  2. Download sources from github repository
  3. From the folder where sources is placed run:
    mvn clean package

    Note: you could download and build jar from maven repository directly:

  • mvn org.apache.maven.plugins:maven-depend
    ency-plugin:2.4:get -DremoteRepositories=https://raw.github.org/night-stalker/KG
    EM -Dartifact=kgem:kgem:0.3.1

ERIF currently not available from maven directly!

Also for developers using Maven kgem repository available, to be able to use it inside Maven project following configuration is necessary:

In the pom.xml add to tag repositories:

<repository>
      <id>kgem</id>
      <name>KGEM repository</name>
      <url>https://github.com/night-stalker/KGEM</url>
</repository>

and to tag dependencies:

<dependency>
     <groupId>kgem</groupId>
     <artifactId>kgem</artifactId>
     <version>0.3.1</version>
</dependency>