使用seqkit统计fastq.gz文件行数
主要是一个操作fastq文件的生信软件
>>> seqkit stats ./test.fq.gz
file format type num_seqs sum_len min_len avg_len max_len
./test.fq.gz FASTQ DNA 5,723,738 200,330,830 35 35 35
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there's no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.
Compression level:
format range default comment
gzip 1-9 5 https://github.com/klauspost/pgzip sets 5 as the default value.
xz NA NA https://github.com/ulikunitz/xz does not support.
zstd 1-4 2 roughly equals to zstd 1, 3, 7, 11, respectively.
bzip 1-9 6 https://github.com/dsnet/compress
Usage:
seqkit [command]
Available Commands:
amplicon extract amplicon (or specific region around it) via primer(s)
bam monitoring and online histograms of BAM record features
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with the same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
fa2fq retrieve corresponding FASTQ records by a FASTA file
faidx create FASTA index file and extract subsequence
fish look for short sequences in larger sequences using local alignment
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...)
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion
sana sanitize broken single line FASTQ files
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
shuffle shuffle sequences
sliding extract subsequences in sliding windows
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
sum compute message digest for all sequences in FASTA/Q files
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
version print version information and check for update
watch monitoring and online histograms of sequence features
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)