Other parsers¶
VCF_Reader and VariantCall¶
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
See the definitions at
As usual, there is a parser class, called VCF_Reader, that can generate an
iterator of objects describing the structural variant calls. These objects are of type VariantCall
and each describes one line of a VCF file. See below for an example.
-
class
HTSeq.VCF_Reader(filename_or_sequence)¶ As a subclass of
FileOrSequence, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.When requesting an iterator, it generates objects of type
VariantCall.-
metadata¶ VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).
-
parse_meta(header_filename = None)¶ The VCF_Reader normally does not parse the meta-information and also the
VariantCalldoes not contain unpacked metainformation. The function parse_meta reads the header information either from the attachedFileOrSequenceor from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.
-
make_info_dict()¶ This function will parse the info string and create the attribute
infodictwhich contains a dict with key:value-pairs containig the type-information for each entry of theVariantCall‘s info field.
-
-
class
HTSeq.VariantCall(line, nsamples = 0, sampleids=[])¶ A VariantCall object always contains the following attributes:
-
alt¶ The alternative base(s) of the
VariantCall. This is a list containing all called alternatives.
-
chrom¶ The Chromosome on which the
VariantCallwas called.
-
filter¶ This specifies if the
VariantCallpassed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).
-
format¶ Contains the format string specifying which per-sample information is stored in
VariantCall.samples.
-
id¶ The id of the
VariantCall, if it has been found in any database, for unknown variants this will be ”.”.
-
info¶ This will contain either the string version of the info field for this
VariantCallor a dict with the parsed and processed info-string.
-
pos¶ A
HTSeq.GenomicPositionthat specifies the position of theVariantCall.
-
qual¶ The quality of the
VariantCall.
-
ref¶ The reference base(s) of the
VariantCall.
-
samples¶ A dict mapping sample-id’s to subdicts which use the
VariantCall.formatas keys to store the per-sample information.
-
unpack_info(infodict)¶ This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.
-
Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):
>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" )
>>> vcfr.parse_meta()
>>> vcfr.make_info_dict()
>>> for vc in vcfr:
... print vc,
1:10327:'T'->'C'
1:10433:'A'->'AC'
1:10439:'AC'->'A'
1:10440:'C'->'A'
FIXME The example above is not run, as the example file is still missing!
Wiggle Reader¶
The Wiggle format (file extension often .wig) is a format to describe numeric scores assigned to base-pair positions on a genome.
The class WiggleReader is parser for such files.
-
class
HTSeq.WiggleReader(filename_or_sequence, verbose=True)¶ The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A
WiggleReaderobject generates an iterator, which yields pairs of the form(iv, score), whereivis aGenomicIntervalobject andscoreis afloatwith the score that the file assigns to the specified interval. Ifverboseis set to True, the user is alerted to skipped lines (comments orbrowserlines) by a message printed to the standard output.
BED Reader¶
The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.
-
class
HTSeq.BED_Reader(filename_or_sequence)¶ The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A
BED_Readerobject generates an iterator, which yields aGenomicFeatureobject for each line in the BED file (except for lines starting withtrack, whcih are skipped).The attributes of the yielded
GenomicFeatureobjects are as follows:iv- a
GenomicIntervalobject with the coordinates as given by the 1st, 2nd, 3rd, and 6th column of the BED file. If the BED file has less than 6 columns, the strand is set to “.”. name- the name of feature as given in the 4th column, or
unnamed, if the file has only three columns type- always the string
BED line score- a float with the score as given by the 5th column (or
Noneif the BED file has less 5 columns). thick- a
GenomicIntervalobject containg the “thick” part of the feature, as specified by the 6th and 7th column, with chromosome and strand copied fromiv(orNoneif the BED file has less 7 columns). itemRgb- a list of three
intvalues, taken from the 8th column (Noneif the BED file has less 8 columns). In a BED file, this triple is meant to specify the colour in which the feature should be drawn in a browser.