Running Trinity
2013-12-24 16:11
627 查看
Running Trinity
Trinity is run via the script: Trinity.pl found in the base installation directory.Usage info is as follows:
############################################################################### # # ______ ____ ____ ____ ____ ______ __ __ # | || \ | || \ | || || | | # | || D ) | | | _ | | | | || | | # |_| |_|| / | | | | | | | |_| |_|| ~ | # | | | \ | | | | | | | | | |___, | # | | | . \ | | | | | | | | | | | # |__| |__|\_||____||__|__||____| |__| |____/ # ############################################################################### # # Required: # # --seqType <string> :type of reads: ( fa, or fq ) # # --JM <string> :(Jellyfish Memory) number of GB of system memory to use for # k-mer counting by jellyfish (eg. 10G) *include the 'G' char # # If paired reads: # --left <string> :left reads, one or more (separated by space) # --right <string> :right reads, one or more (separated by space) # # Or, if unpaired reads: # --single <string> :single reads, one or more (note, if single file contains pairs, can use flag: --run_as_paired ) # #################################### ## Misc: ######################### # # --SS_lib_type <string> :Strand-specific RNA-Seq read orientation. # if paired: RF or FR, # if single: F or R. (dUTP method = RF) # See web documentation. # # --output <string> :name of directory for output (will be # created if it doesn't already exist) # default( "/Users/bhaas/SVN/trinityrnaseq/trunk/trinity_out_dir" ) # --CPU <int> :number of CPUs to use, default: 2 # --min_contig_length <int> :minimum assembled contig length to report # (def=200) # --genome_guided :set to genome guided mode, only retains assembly fasta file. # --jaccard_clip :option, set if you have paired reads and # you expect high gene density with UTR # overlap (use FASTQ input file format # for reads). # (note: jaccard_clip is an expensive # operation, so avoid using it unless # necessary due to finding excessive fusion # transcripts w/o it.) # # --prep :Only prepare files (high I/O usage) and stop before kmer counting. # # --no_cleanup :retain all intermediate input files. # --full_cleanup :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta # # --cite :show the Trinity literature citation # # --version :reports Trinity version (BLEEDING_EDGE) and exits. # #################################################### # Inchworm and K-mer counting-related options: ##### # # --min_kmer_cov <int> :min count for K-mers to be assembled by # Inchworm (default: 1) # --inchworm_cpu <int> :number of CPUs to use for Inchworm, default is min(6, --CPU option) # # --no_run_inchworm :stop after running jellyfish, before inchworm. # ################################### # Chrysalis-related options: ###### # # --max_reads_per_graph <int> :maximum number of reads to anchor within # a single graph (default: 200000) # --no_run_chrysalis :stop Trinity after Inchworm and before # running Chrysalis # --no_run_quantifygraph :stop Trinity just before running the # parallel QuantifyGraph computes, to # leverage a compute farm and massively # parallel execution.. # # --chrysalis_output <string> :name of directory for chrysalis output (will be # created if it doesn't already exist) # default( "chrysalis" ) # # --no_bowtie :dont run bowtie to use pair info in chrysalis clustering. # ##################################### ### Butterfly-related options: #### # # --bfly_opts <string> :additional parameters to pass through to butterfly # (see butterfly options: java -jar Butterfly.jar ). # (note: only for expert or experimental use. Commonly used parameters are exposed through this Trinity menu here). # # ////////////////////////////////// # Alternative reconstruction modes: # Default mode is the 'regular' Butterfly transcript reconstruction by graph node extension. # # --PasaFly PASA-like algorithm for maximally-supported isoforms (conservative reconstructions, fewer isoforms) # or # --CuffFly Cufflinks-like algorithm to report minimum transcripts (fewest isoforms) # # # Butterfly read-pair grouping settings (used for all reconstruction modes to define 'pair paths'): # # --group_pairs_distance <int> :maximum length expected between fragment pairs (default: 500) # (reads outside this distance are treated as single-end) # # /////////////////////////////////////////////// # Butterfly default reconstruction mode settings. (no CuffFly or PasaFly custom settings are currently available). # # --path_reinforcement_distance <int> :minimum overlap of reads with growing transcript # path (default: PE: 75, SE: 25) # Set to 1 for the most lenient path extension requirements. # # --triplet_lock : (increase stringency of regular butterfly reconstruction) # lock triplet-supported nodes: node 'c' having read path 'A-B-C' disables 'Z-B-C' if no such read support exists. # # --extended_lock : (further increase the stringency of regular butterfy reconstruction) # extend the triplet lock to include longer range read path information. # ex. in extending path 'A-B-Z' to 'A-B-Z-D', we only find read support for 'A-B-C-D', that 'A-B-Z' extension to 'D' will be blocked. # (assumes --triplet_lock) # # ///////////////////////////////////////// # Butterfly transcript reduction settings: # # --no_path_merging : all transcript candidates are output (including SNP variations, however, some SNPs may be unphased) # # By default, alternative transcript candidates are merged (in reality, discarded) if they are found to be too similar, according to the following logic: # # (identity=(numberOfMatches/shorterLen) > 95.0% or if we have <= 2 mismatches) and if we have internal gap lengths <= 10 # # with parameters as: # # --min_per_id_same_path <int> default: 95 min percent identity for two paths to be merged into single paths # --max_diffs_same_path <int> default: 2 max allowed differences encountered between path sequences to combine them # --max_internal_gap_same_path <int> default: 10 maximum number of internal consecutive gap characters allowed for paths to be merged into single paths. # # If, in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative # compatible read (pair-path) support is retained, and the other is discarded. # # # ////////////////////////////////////////////// # Butterfly Java and parallel execution settings. # # --bflyHeapSpaceMax <string> :java max heap space setting for butterfly # (default: 10G) => yields command # 'java -Xmx10G -jar Butterfly.jar ... $bfly_opts' # --bflyHeapSpaceInit <string> :java initial hap space settings for # butterfly (default: 1G) => yields command # 'java -Xms1G -jar Butterfly.jar ... $bfly_opts' # --bflyGCThreads <int> :threads for garbage collection # (default, not specified, so java decides) # --bflyCPU <int> :CPUs to use (default will be normal # number of CPUs; e.g., 2) # --bflyCalculateCPU :Calculate CPUs based on 80% of max_memory # divided by maxbflyHeapSpaceMax # --no_run_butterfly :stops after the Chrysalis stage. You'll # need to run the Butterfly computes # separately, such as on a computing grid. # Then, concatenate all the Butterfly assemblies by running: # 'find trinity_out_dir/ -name "*allProbPaths.fasta" # -exec cat {} + > trinity_out_dir/Trinity.fasta' # ################################# # Grid-computing options: ####### # # --grid_computing_module <string> : Perl module in /Users/bhaas/SVN/trinityrnaseq/trunk/PerlLibAdaptors/ # that implements 'run_on_grid()' # for naively parallel cmds. (eg. 'BroadInstGridRunner') # # ############################################################################### # # *Note, a typical Trinity command might be: # Trinity.pl --seqType fq --JM 100G --left reads_1.fq --right reads_2.fq --CPU 6 # # see: /Users/bhaas/SVN/trinityrnaseq/trunk/sample_data/test_Trinity_Assembly/ # for sample data and 'runMe.sh' for example Trinity execution # For more details, visit: http://trinityrnaseq.sf.net # ###############################################################################
Note | Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved. For protocols on strand-specific RNA-Seq, see: Borodina T, Adjaye J, Sultan M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 2011;500:79-98. PubMed PMID: 21943893. |
Paired reads:
RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense
strand (forward(F)); typical of the dUTP/UDG sequencing method.
FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)
Unpaired (single) reads:
F: the single read is in the sense (forward) orientation
R: the single read is in the antisense (reverse) orientation
By setting the --SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.
Other important considerations:
Whether you use Fastq or Fasta formatted input files, be sure to keep the reads oriented as they are reported by Illumina, if the data are strand-specific. This is because, Trinity will properly orient
the sequences according to the specified library type. If the data are not strand-specific, now worries because the reads will be parsed in both orientations.
If you have both paired and unpaired data, and the data are NOT strand-specific, you can combine the unpaired data with the left reads of the paired fragments. Be sure that the unpaired reads have
a /1 as a suffix to the accession value similarly to the left fragment reads. The right fragment reads should all have /2 as the accession suffix. Then, run Trinity using the --left and --right parameters as if all the data were paired.
If you have multiple paired-end library fragment sizes, set the --group_pairs_distance according to the larger insert library. Pairings that exceed that distance will
be treated as if they were unpaired by the Butterfly process.
by setting the --CPU option, you are indicating the maximum number of threads to be used by processes within Trinity. Note that Inchworm alone will be capped at 6 threads,
since performance will not improve for this step beyond that setting)
Typical Trinity Command Line
A typical Trinity command for assembling non-strand-specific RNA-seq data would be like so, running the entire process on a single high-memory server (aim for 1G RAM per 1M ~76 base Illumina pairedreads, but often much less memory is required):
Run Trinity like so:
Trinity.pl --seqType fq --JM 10G --left reads_1.fq --right reads_2.fq --CPU 6
Example data and sample pipeline are provided and described here.
Output of Trinity
When Trinity completes, it will create a Trinity.fasta output file in the trinity_out_dir/ output directory (or output directory you specify).Obtain basic stats for the number of transcripts, components, and contig N50 value by running:
% $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
Total trinity transcripts: 9351 Total trinity components: 8695 Contig N50: 1585
After obtaining Trinity transcripts, there are downstream processes available to further explore these
data.
相关文章推荐
- mysql database manual(mysql数据库手册)
- GTK+ Reference Manual
- MySQL 5.5 手册下载
- kinit manual
- Blast+
- Nature综述:NGS实验中的错误来自哪里?如何减少?
- RNA-Seq De novo Assembly Using Trinity
- Tophat
- samtools manual(1)
- samtools manual(2)
- MS Technet windows server 2008 动手实验营实验手册
- liunx 下dhcp中继及服务器配置
- DHCP在企业网中的应用
- Note of Apache Ant - Using Ant
- http服务知识点和wordpress博客的建立
- 如何手动在linux unbutu 上面安装docker
- 9-10 rpm程序包管理和YUM仓库
- KVM(二)桥接网络
- 3 Linux之“男人”使用介绍
- [Shell Programmin] ZSH