art_illumina -ss HS20 -i GRCH38BWAindex/GRCH38chr1L3556522.fna -p -l 100 -m 200 -s 10  -c 10000000 -o g38L100c10000000Nhs20Paired


spark-submit  --class cs.ucla.edu.bwaspark.BWAMEMSpark --master spark://  /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar upload-fastq   1 21 g38L100c10000000Nhs20Paired1.fq g38L100c10000000Nhs20Paired2.fq /xubo/data/alignment/cs-bwamem/fastq/g38L100c10000000Nhs20Paired12.fastq


需要将fasta dispatch到各个节点,每个节点需要有cs-bwamem的jar包,jni搞全局路径


spark-submit --executor-memory 4g --class cs.ucla.edu.bwaspark.BWAMEMSpark --total-executor-cores 20 --master spark://  --conf spark.driver.host= --conf spark.driver.cores=4 --conf spark.driver.maxResultSize=4g --conf spark.storage.memoryFraction=0.7  --conf spark.akka.threads=2 --conf spark.akka.frameSize=1024 /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar cs-bwamem -bfn 1 -bPSW 1 -sbatch 10 -bPSWJNI 1  -oChoice 2 -oPath hdfs:// -localRef 1 -jniPath /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/jniNative.so -isSWExtBatched 1  1 /home/hadoop/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem/GRCH38BWAindex/GRCH38chr1L3556522.fasta  /xubo/data/alignment/cs-bwamem/fastq/g38L100c10000000Nhs20Paired12.fastq


spark-submit --executor-memory 6g --class cs.ucla.edu.bwaspark.BWAMEMSpark --total-executor-cores 20 --master spark://  --conf spark.driver.host= --conf spark.driver.cores=4 --conf spark.driver.maxResultSize=6g --conf spark.storage.memoryFraction=0.7  --conf spark.akka.threads=2 --conf spark.akka.frameSize=1024 /home/hadoop/xubo/tools/cloud-scale-bwamem-0.2.1/target/cloud-scale-bwamem-0.2.0-assembly.jar merge hdfs:// /xubo/data/alignment/cs-bwamem/fastq/g38L100c10000000Nhs20Paired12F2.adam /xubo/data/alignment/cs-bwamem/fastq/g38L100c10000000Nhs20Paired12F2.merge.adam


package org.bdgenomics.avocado.cli

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

* Created by xubo on 2016/5/27.
* 从hdfs下载经过avocado匹配好的数据
* run:success
object parquetRead2csbwamemDup {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$')))
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//    val file = "hdfs://"
//    val file = "hdfs://"
val file = "hdfs://"

val df3 = sqlContext.read.option("mergeSchema", "true").parquet(file)
//    df3.printSchema()
var j = 1
//    for (i <- df3) {
//      println(j + ":" + i)
//      j += 1
//    }
//    println("distinct:" + df3.distinct.count())


scala>     println(df3.count())
scala> println("distinct:" + df3.distinct.count())



Worker1 Time: 815792
Calculate Metrics Time: 2093
Worker2 Time: 688463
16/06/03 18:07:19 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray())
16/06/03 18:07:19 WARN QueuedThreadPool: 1 threads could not be stopped
CS-BWAMEM Finished!!!
Jun 3, 2016 5:41:47 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 21
Jun 3, 2016 5:41:57 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 21




【1】 [BIBM] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Chao Wang, and Xuehai Zhou, "Distributed Gene Clinical Decision Support System Based on Cloud Computing", in IEEE International Conference on Bioinformatics and Biomedicine. (BIBM 2017, CCF B)
【2】 [IEEE CLOUD] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Xuehai Zhou. Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark (CLOUD 2017, CCF-C).
【3】 [CCGrid] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Jinhong Zhou, Xuehai Zhou. DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions. (CCGrid 2017, CCF-C).
【4】more: https://github.com/xubo245/Publications


If you have any questions or suggestions, please write it in the issue of this project or send an e-mail to me: xubo245@mail.ustc.edu.cn
Wechat: xu601450868
QQ: 601450868
