pbcore.io.dataset¶
The Python DataSet XML API is designed to be a lightweight interface for creating, opening, manipulating and writing DataSet XML files. It provides both a native Python API and console entry points for use in manual dataset curation or as a resource for P_Module developers.
The API and console entry points are designed with the set operations one might perform on the various types of data held by a DataSet XML in mind: merge, split, write etc. While various types of DataSets can be found in XML files, the API (and in a way the console entry point, dataset.py) has DataSet as its base type, with various subtypes extending or replacing functionality as needed.
Console Entry Point Usage¶
The following entry points are available through the main script: dataset.py:
usage: dataset.py [-h] [-v] [--debug]
{create,filter,merge,split,validate,loadstats,consolidate}
...
Run dataset.py by specifying a command.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--debug Turn on debug level logging
DataSet sub-commands:
{create,filter,merge,split,validate,loadstats,consolidate}
Type {command} -h for a command's options
Create:
usage: dataset.py create [-h] [--type DSTYPE] [--novalidate] [--relative]
outfile infile [infile ...]
Create an XML file from a fofn or bam
positional arguments:
outfile The XML to create
infile The fofn or BAM file(s) to make into an XML
optional arguments:
-h, --help show this help message and exit
--type DSTYPE The type of XML to create
--novalidate Don't validate the resulting XML, don't touch paths
--relative Make the included paths relative instead of absolute (not
compatible with --novalidate)
Filter:
usage: dataset.py filter [-h] infile outfile filters [filters ...]
Add filters to an XML file. Suggested fields: ['bcf', 'bcq', 'bcr',
'length', 'pos', 'qend', 'qname', 'qstart', 'readstart', 'rname', 'rq',
'tend', 'tstart', 'zm']. More expensive fields: ['accuracy', 'bc', 'movie',
'qs']
positional arguments:
infile The xml file to filter
outfile The resulting xml file
filters The values and thresholds to filter (e.g. 'rq>0.85')
optional arguments:
-h, --help show this help message and exit
Union:
usage: dataset.py union [-h] outfile infiles [infiles ...]
Combine XML (and BAM) files
positional arguments:
outfile The resulting XML file
infiles The XML files to merge
optional arguments:
-h, --help show this help message and exit
Validate:
usage: dataset.py validate [-h] infile
Validate ResourceId files (XML validation only available in testing)
positional arguments:
infile The XML file to validate
optional arguments:
-h, --help show this help message and exit
Load PipeStats:
usage: dataset.py loadstats [-h] [--outfile OUTFILE] infile statsfile
Load an sts.xml file into a DataSet XML file
positional arguments:
infile The XML file to modify
statsfile The .sts.xml file to load
optional arguments:
-h, --help show this help message and exit
--outfile OUTFILE The XML file to output
Split:
usage: dataset.py split [-h] [--contigs] [--chunks CHUNKS] [--subdatasets]
[--outdir OUTDIR]
infile ...
Split the dataset
positional arguments:
infile The xml file to split
outfiles The resulting xml files
optional arguments:
-h, --help show this help message and exit
--contigs Split on contigs
--chunks CHUNKS Split contigs into <chunks> total windows
--subdatasets Split on subdatasets
--outdir OUTDIR Specify an output directory
Consolidate:
usage: dataset.py consolidate [-h] [--numFiles NUMFILES] [--noTmp]
infile datafile xmlfile
Consolidate the XML files
positional arguments:
infile The XML file to consolidate
datafile The resulting data file
xmlfile The resulting XML file
optional arguments:
-h, --help show this help message and exit
--numFiles NUMFILES The number of data files to produce (1)
--noTmp Don't copy to a tmp location to ensure local disk
use
Usage Examples¶
Filter Reads (CLI version)¶
In this scenario we have one or more bam files worth of subreads, aligned or otherwise, that we want to filter and put in a single bam file. This is possible using the CLI with the following steps, starting with a DataSet XML file:
# usage: dataset.py filter <in_fn.xml> <out_fn.xml> <filters>
dataset.py filter in_fn.subreadset.xml filtered_fn.subreadset.xml 'rq>0.85'
# usage: dataset.py consolidate <in_fn.xml> <out_data_fn.bam> <out_fn.xml>
dataset.py consolidate filtered_fn.subreadset.xml consolidate.subreads.bam out_fn.subreadset.xml
The filtered DataSet and the consolidated DataSet should be read for read equivalent when used with SMRT Analysis software.
Filter Reads (API version)¶
The API version of filtering allows for more advanced filtering criteria:
ss = SubreadSet('in_fn.subreadset.xml')
ss.filters.addRequirement(rname=[('=', 'E.faecalis.2'),
('=', 'E.faecalis.2')],
tStart=[('<', '99'),
('<', '299')],
tEnd=[('>', '0'),
('>', '200')])
Produces the following conditions for a read to be considered passing:
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0) OR (rname = E.faecalis.2 AND tstart < 299 AND tend > 200)
You can add sets of filters by providing equal length lists of requirements for each filter.
Additional requirements added singly will be added to all filters:
ss.filters.addRequirement(rq=[('>', '0.85')])
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85) OR (rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85)
Additional requirements added with a plurality of options will duplicate the previous requiremnts for each option:
ss.filters.addRequirement(length=[('>', 500), ('>', 1000)])
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85 AND length > 500) OR (rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85 AND length > 500) OR (rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85 AND length > 1000) OR (rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85 AND length > 1000)
Of course you can always wipe the filters and start over:
ss.filters = None
Consolidation is more similar to the CLI version:
ss.consolidate('cons.bam')
ss.write('cons.xml')
Resequencing Pipeline (CLI version)¶
In this scenario, we have two movies worth of subreads in two SubreadSets that we want to align to a reference, merge together, split into DataSet chunks by contig, then send through quiver on a chunkwise basis (in parallel).
Align each movie to the reference, producing a dataset with one bam file for each execution:
pbalign movie1.subreadset.xml referenceset.xml movie1.alignmentset.xml pbalign movie2.subreadset.xml referenceset.xml movie2.alignmentset.xml
Merge the files into a FOFN-like dataset (bams aren’t touched):
# dataset.py merge <out_fn> <in_fn> [<in_fn> <in_fn> ...] dataset.py merge merged.alignmentset.xml movie1.alignmentset.xml movie2.alignmentset.xml
Split the dataset into chunks by contig (rname) (bams aren’t touched). Note that supplying output files splits the dataset into that many output files (up to the number of contigs), with multiple contigs per file. Not supplying output files splits the dataset into one output file per contig, named automatically. Specifying a number of chunks instead will produce that many files, with contig or even sub contig (reference window) splitting.:
dataset.py split --contigs --chunks 8 merged.alignmentset.xml
Quiver then consumes these chunks:
variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk1consensus.fasta --algorithm quiver chunk1contigs.alignmentset.xml variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk2consensus.fasta --algorithm quiver chunk2contigs.alignmentset.xml
The chunking works by duplicating the original merged dataset (no bam duplication) and adding filters to each duplicate such that only reads belonging to the appropriate contigs are emitted. The contigs are distributed amongst the output files in such a way that the total number of records per chunk is about even.
Tangential Information¶
DataSet.refNames (which returns a list of reference names available in the dataset) is also subject to the filtering imposed during the split. Therefore you wont be running through superfluous (and apparently unsampled) contigs to get the reads in this chunk. The DataSet.records generator is also subject to filtering, but not as efficiently as readsInRange. If you do not have a reference window, readsInReference() is also an option.
As the bam files are never touched, each dataset contains all the information necessary to access all reads for all contigs. Doing so on these filtered datasets would require disabling the filters first:
dset.disableFilters()
Or removing the specific filter giving you problems:
dset.filters.removeRequirement('rname')
Resequencing Pipeline (API version)¶
In this scenario, we have two movies worth of subreads in two SubreadSets that we want to align to a reference, merge together, split into DataSet chunks by contig, then send through quiver on a chunkwise basis (in parallel). We want to do them using the API, rather than the CLI.
Align each movie to the reference, producing a dataset with one bam file for each execution
# CLI (or see pbalign API): pbalign movie1.subreadset.xml referenceset.xml movie1.alignmentset.xml pbalign movie2.subreadset.xml referenceset.xml movie2.alignmentset.xml
Merge the files into a FOFN-like dataset (bams aren’t touched)
# API, filename_list is dummy data: filename_list = ['movie1.alignmentset.xml', 'movie2.alignmentset.xml'] # open: dsets = [AlignmentSet(fn) for fn in filename_list] # merge with + operator: from functools import reduce dset = reduce(lambda x, y: x + y, dsets) # OR: dset = AlignmentSet(*filename_list)
Split the dataset into chunks by contigs (or subcontig windows)
# split: dsets = dset.split(contigs=True, chunks=8)
Quiver then consumes these chunks
# write out if you need to (or pass directly to quiver API): outfilename_list = ['chunk1contigs.alignmentset.xml', 'chunk2contigs.alignmentset.xml'] # write with 'write' method: map(lambda (ds, nm): ds.write(nm), zip(dsets, outfilename_list)) # CLI (or see quiver API): variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk1consensus.fasta --algorithm quiver chunk1contigs.alignmentset.xml variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk2consensus.fasta --algorithm quiver chunk2contigs.alignmentset.xml # Inside quiver (still using python dataset API): aln = AlignmentSet(fname) # get this set's windows: refWindows = aln.refWindows # gather the reads for these windows using readsInRange, e.g.: reads = list(itertools.chain(aln.readsInRange(rId, start, end) for rId, start, end in refWindows))
API overview¶
The chunking works by duplicating the original merged dataset (no bam duplication) and adding filters to each duplicate such that only reads belonging to the appropriate contigs/windows are emitted. The contigs are distributed amongst the output files in such a way that the total number of records per chunk is about even.
DataSets can be created using the appropriate constructor (SubreadSet), or with the common constructor (DataSet) and later cast to a specific type (copy(asType=”SubreadSet”)). The DataSet constructor acts as a factory function (an artifact of early api Designs). The factory behavior is defined in the DataSet metaclass.
digraph inheritance5d781fce17 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "AlignmentSet" [URL="#pbcore.io.dataset.DataSetIO.AlignmentSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to Alignments. No type specific Metadata exists,"]; "ReadSet" -> "AlignmentSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BarcodeSet" [URL="#pbcore.io.dataset.DataSetIO.BarcodeSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to Barcodes"]; "ContigSet" -> "BarcodeSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConsensusAlignmentSet" [URL="#pbcore.io.dataset.DataSetIO.ConsensusAlignmentSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Dataset type for aligned CCS reads. Essentially identical to AlignmentSet"]; "AlignmentSet" -> "ConsensusAlignmentSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConsensusReadSet" [URL="#pbcore.io.dataset.DataSetIO.ConsensusReadSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to CCSreads. No type specific Metadata exists, so"]; "ReadSet" -> "ConsensusReadSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ContigSet" [URL="#pbcore.io.dataset.DataSetIO.ContigSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to Contigs"]; "DataSet" -> "ContigSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DataSet" [URL="#pbcore.io.dataset.DataSetIO.DataSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The record containing the DataSet information, with possible type"]; "GmapReferenceSet" [URL="#pbcore.io.dataset.DataSetIO.GmapReferenceSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to GMAP References"]; "ReferenceSet" -> "GmapReferenceSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InvalidDataSetIOError" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="The base class for all DataSetIO related custom exceptions"]; "MissingFileError" [URL="#pbcore.io.dataset.DataSetIO.MissingFileError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Specifically thrown by _fileExists(),"]; "InvalidDataSetIOError" -> "MissingFileError" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ReadSet" [URL="#pbcore.io.dataset.DataSetIO.ReadSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base type for read sets, should probably never be used as a concrete"]; "DataSet" -> "ReadSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ReferenceSet" [URL="#pbcore.io.dataset.DataSetIO.ReferenceSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to References"]; "ContigSet" -> "ReferenceSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SubreadSet" [URL="#pbcore.io.dataset.DataSetIO.SubreadSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type specific to Subreads"]; "ReadSet" -> "SubreadSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TranscriptAlignmentSet" [URL="#pbcore.io.dataset.DataSetIO.TranscriptAlignmentSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Dataset type for aligned RNA transcripts. Essentially identical to"]; "AlignmentSet" -> "TranscriptAlignmentSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TranscriptSet" [URL="#pbcore.io.dataset.DataSetIO.TranscriptSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DataSet type for processed RNA transcripts in BAM format. These are not"]; "ReadSet" -> "TranscriptSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; }Classes representing DataSets of various types.
- class pbcore.io.dataset.DataSetIO.AlignmentSet(*files, **kwargs)¶
Bases:
ReadSet
DataSet type specific to Alignments. No type specific Metadata exists, so the base class version is OK (this just ensures type representation on output and expandability
- __init__(*files, **kwargs)¶
An AlignmentSet
- Args:
- files
handled by super
- referenceFastaFname=None
the reference fasta filename for this alignment.
- strict=False
see base class
- skipCounts=False
see base class
- countRecords(rname=None, winStart=None, winEnd=None)¶
Count the number of records mapped to ‘rname’ that overlap with ‘window’
- property fullRefNames¶
A list of reference full names (full header).
- intervalContour(rname, tStart=0, tEnd=None)¶
Take a set of index records and build a pileup of intervals, or “contour” describing coverage over the contig
..note:: Naively incrementing values in an array is too slow and takes too much memory. Sorting tuples by starts and ends and iterating through them and the reference (O(nlogn + nlogn + n + n + m)) takes too much memory and time. Iterating over the reference, using numpy conditional indexing at each base on tStart and tEnd columns uses no memory, but is too slow (O(nm), but in numpy (C, hopefully)). Building a delta list via sorted tStarts and tEnds one at a time saves memory and is ~5x faster than the second method above (O(nlogn + nlogn + m)).
- readsInRange(refName, start, end, buffsize=50, usePbi=True, longest=False, sampleSize=0, justIndices=False)¶
A generator of (usually) BamAlignment objects for the reads in one or more Bam files pointed to by the ExternalResources in this DataSet that have at least one coordinate within the specified range in the reference genome.
Rather than developing some convoluted approach for dealing with auto-inferring the desired references, this method and self.refNames should allow users to compose the desired query.
- Args:
- refName
the name of the reference that we are sampling
- start
the start of the range (inclusive, index relative to reference)
- end
the end of the range (inclusive, index relative to reference)
- Yields:
BamAlignment objects
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet(data.getBam()) >>> for read in ds.readsInRange(ds.refNames[15], 100, 150): ... print('hn: %i' % read.holeNumber) hn: ...
- readsInReference(refName)¶
A generator of (usually) BamAlignment objects for the reads in one or more Bam files pointed to by the ExternalResources in this DataSet that are mapped to the specified reference genome.
- Args:
- refName
the name of the reference that we are sampling.
- Yields:
BamAlignment objects
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet(data.getBam()) >>> for read in ds.readsInReference(ds.refNames[15]): ... print('hn: %i' % read.holeNumber) hn: ...
- property records¶
A generator of (usually) BamAlignment objects for the records in one or more Bam files pointed to by the ExternalResources in this DataSet.
- Yields:
A BamAlignment object
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet(data.getBam()) >>> for record in ds.records: ... print('hn: %i' % record.holeNumber) hn: ...
- property recordsByReference¶
The records in this AlignmentSet, sorted by tStart.
- property refIds¶
A dict of refName: refId for the joined referenceInfoTable TODO: depricate in favor of more descriptive rname2tid
- refInfo(key)¶
Return a column in the referenceInfoTable, tupled with the reference name. TODO(mdsmith)(2016-01-27): pick a better name for this method…
- refLength(rname)¶
The length of reference ‘rname’. This is expensive, so if you’re going to do many lookups cache self.refLengths locally and use that.
- property refLengths¶
A dict of refName: refLength
- property refNames¶
A list of reference names (id).
- property refWindows¶
Going to be tricky unless the filters are really focused on windowing the reference. Much nesting or duplication and the correct results are really not guaranteed
- referenceInfo(refName)¶
Select a row from the DataSet.referenceInfoTable using the reference name as a unique key (or ID, if you really have to)
- property referenceInfoTable¶
The merged reference info tables from the external resources. Record.ID is remapped to a unique integer key (though using record.Name is preferred).
..note:: Reference names are assumed to be unique
- resourceReaders(refName=False)¶
A generator of Indexed*Reader objects for the ExternalResources in this DataSet.
- Args:
- refName
Only yield open resources if they have refName in their referenceInfoTable
- Yields:
An open indexed alignment file
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet(data.getBam()) >>> for seqFile in ds.resourceReaders(): ... for record in seqFile: ... print('hn: %i' % record.holeNumber) hn: ...
- property rname2tid¶
A dict of refName: refId for the joined referenceInfoTable
- splitContour(contour, splits)¶
Take a contour and a number of splits, return the location of each coverage mediated split with the first at 0
- split_references(chunks)¶
Chunks requested: 0 or >= num_refs: One chunk per reference 1 to (num_refs - 1): Grouped somewhat evenly by num_records
- property tid2rname¶
A dict of refName: refId for the joined referenceInfoTable
- class pbcore.io.dataset.DataSetIO.BarcodeSet(*files, **kwargs)¶
Bases:
ContigSet
DataSet type specific to Barcodes
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- addMetadata(newMetadata, **kwargs)¶
Add metadata specific to this subtype, while leaning on the superclass method for generic metadata. Also enforce metadata type correctness.
- property metadata¶
Return the DataSet metadata as a DataSetMetadata object. Attributes should be populated intuitively, but see DataSetMetadata documentation for more detail.
- class pbcore.io.dataset.DataSetIO.ConsensusAlignmentSet(*files, **kwargs)¶
Bases:
AlignmentSet
Dataset type for aligned CCS reads. Essentially identical to AlignmentSet aside from the contents of the underlying BAM files.
- class pbcore.io.dataset.DataSetIO.ConsensusReadSet(*files, **kwargs)¶
Bases:
ReadSet
DataSet type specific to CCSreads. No type specific Metadata exists, so the base class version is OK (this just ensures type representation on output and expandability
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import ConsensusReadSet >>> ds2 = ConsensusReadSet(data.getXml(2), strict=False, ... skipMissing=True) >>> ds2 <ConsensusReadSet... >>> ds2._metadata <SubreadSetMetadata...
- class pbcore.io.dataset.DataSetIO.ContigSet(*files, **kwargs)¶
Bases:
DataSet
DataSet type specific to Contigs
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- addMetadata(newMetadata, **kwargs)¶
Add metadata specific to this subtype, while leaning on the superclass method for generic metadata. Also enforce metadata type correctness.
- consolidate(outfn=None, numFiles=1, useTmp=False)¶
Consolidation should be implemented for window text in names and for filters in ContigSets
- property contigNames¶
The names assigned to the External Resources, or contigs if no name assigned.
- property contigs¶
A generator of contigs from the fastaReader objects for the ExternalResources in this ReferenceSet.
- Yields:
A fasta file entry
- get_contig(contig_id)¶
Get a contig by ID
- induceIndices(force=False)¶
Generate indices for ExternalResources.
Not compatible with DataSet base type
- property metadata¶
Return the DataSet metadata as a DataSetMetadata object. Attributes should be populated intuitively, but see DataSetMetadata documentation for more detail.
- resourceReaders(refName=None)¶
A generator of fastaReader objects for the ExternalResources in this ReferenceSet.
- Yields:
An open fasta file
- split(nchunks)¶
Deep copy the DataSet into a number of new DataSets containing roughly equal chunks of the ExternalResources or subdatasets.
Examples:
split into exactly n datasets where each addresses a different piece of the collection of contigs:
dset.split(contigs=True, chunks=n)
split into at most n datasets where each addresses a different piece of the collection of contigs, but contigs are kept whole:
dset.split(contigs=True, maxChunks=n)
split into at most n datasets where each addresses a different piece of the collection of contigs and the number of chunks is in part based on the number of reads:
dset.split(contigs=True, maxChunks=n, breakContigs=True)
- Args:
- chunks
the number of chunks to split the DataSet.
- ignoreSubDatasets
(True) do not split by subdatasets
- contigs
split on contigs instead of external resources
- zmws
Split by zmws instead of external resources
- barcodes
Split by barcodes instead of external resources
- maxChunks
The upper limit on the number of chunks.
- breakContigs
Whether or not to break contigs
- byRecords
Split contigs by mapped records, rather than ref length
- targetSize
The target minimum number of reads per chunk
- updateCounts
Update the count metadata in each chunk
- Returns:
A generator of new DataSet objects (all other information deep copied).
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> # splitting is pretty intuitive: >>> ds1 = AlignmentSet(data.getXml(11)) >>> # but divides up extRes's, so have more than one: >>> ds1.numExternalResources > 1 True >>> # the default is one AlignmentSet per ExtRes: >>> dss = list(ds1.split()) >>> len(dss) == ds1.numExternalResources True >>> # but you can specify a number of AlignmentSets to produce: >>> dss = list(ds1.split(chunks=1)) >>> len(dss) == 1 True >>> dss = list(ds1.split(chunks=2, ignoreSubDatasets=True)) >>> len(dss) == 2 True >>> # The resulting objects are similar: >>> dss[0].uuid == dss[1].uuid False >>> dss[0].name == dss[1].name True >>> # Previously merged datasets are 'unmerged' upon split, unless >>> # otherwise specified. >>> # Lets try merging and splitting on subdatasets: >>> ds1 = AlignmentSet(data.getXml(7)) >>> ds1.totalLength 123588 >>> ds1tl = ds1.totalLength >>> ds2 = AlignmentSet(data.getXml(10)) >>> ds2.totalLength 117086 >>> ds2tl = ds2.totalLength >>> # merge: >>> dss = ds1 + ds2 >>> dss.totalLength == (ds1tl + ds2tl) True >>> # unmerge: >>> ds1, ds2 = sorted( ... list(dss.split(2, ignoreSubDatasets=False)), ... key=lambda x: x.totalLength, reverse=True) >>> ds1.totalLength == ds1tl True >>> ds2.totalLength == ds2tl True
- updateCounts()¶
Update the TotalLength and NumRecords for this DataSet.
Not compatible with the base DataSet class, which has no ability to touch ExternalResources. -1 is used as a sentinel value for failed size determination. It should never be written out to XML in regular use.
- class pbcore.io.dataset.DataSetIO.DataSet(*files, **kwargs)¶
Bases:
object
The record containing the DataSet information, with possible type specific subclasses
- __add__(otherDataset)¶
Merge the representations of two DataSets without modifying the original datasets. (Fails if filters are incompatible).
- Args:
- otherDataset
a DataSet to merge with self
- Returns:
A new DataSet with members containing the union of the input DataSets’ members and subdatasets representing the input DataSets
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> from pbcore.io.dataset.DataSetWriter import toXml >>> # xmls with different resourceIds: success >>> ds1 = AlignmentSet(data.getXml(7)) >>> ds2 = AlignmentSet(data.getXml(10)) >>> ds3 = ds1 + ds2 >>> expected = ds1.numExternalResources + ds2.numExternalResources >>> ds3.numExternalResources == expected True >>> # xmls with different resourceIds but conflicting filters: >>> # failure to merge >>> ds2.filters.addRequirement(rname=[('=', 'E.faecalis.1')]) >>> ds3 = ds1 + ds2 >>> ds3 >>> # xmls with same resourceIds: ignores new inputs >>> ds1 = AlignmentSet(data.getXml(7)) >>> ds2 = AlignmentSet(data.getXml(7)) >>> ds3 = ds1 + ds2 >>> expected = ds1.numExternalResources >>> ds3.numExternalResources == expected True
- __deepcopy__(memo)¶
Deep copy this Dataset by recursively deep copying the members (objMetadata, DataSet metadata, externalResources, filters and subdatasets)
- __eq__(other)¶
Test for DataSet equality. The method specified in the documentation calls for md5 hashing the “Core XML” elements and comparing. This is the same procedure for generating the Uuid, so the same method may be used. However, as simultaneously or regularly updating the Uuid is not specified, we opt to not set the newUuid when checking for equality.
- Args:
- other
The other DataSet to compare to this DataSet.
- Returns:
T/F the Core XML elements of this and the other DataSet hash to the same value
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- __repr__()¶
Represent the dataset with an informative string:
- Returns:
“<type uuid filenames>”
- addDatasets(otherDataSet)¶
Add subsets to a DataSet object using other DataSets.
The following method of enabling merge-based split prevents nesting of datasets more than one deep. Nested relationships are flattened.
Note
Most often used by the __add__ method, rather than directly.
- addExternalResources(newExtResources, updateCount=True)¶
Add additional ExternalResource objects, ensuring no duplicate resourceIds. Most often used by the __add__ method, rather than directly.
- Args:
- newExtResources
A list of new ExternalResource objects, either created de novo from a raw bam input, parsed from an xml input, or already contained in a separate DataSet object and being merged.
- Doctest:
>>> from pbcore.io.dataset.DataSetMembers import ExternalResource >>> from pbcore.io import DataSet >>> ds = DataSet() >>> # it is possible to add ExtRes's as ExternalResource objects: >>> er1 = ExternalResource() >>> er1.resourceId = "test1.bam" >>> er2 = ExternalResource() >>> er2.resourceId = "test2.bam" >>> er3 = ExternalResource() >>> er3.resourceId = "test1.bam" >>> ds.addExternalResources([er1], updateCount=False) >>> len(ds.externalResources) 1 >>> # different resourceId: succeeds >>> ds.addExternalResources([er2], updateCount=False) >>> len(ds.externalResources) 2 >>> # same resourceId: fails >>> ds.addExternalResources([er3], updateCount=False) >>> len(ds.externalResources) 2 >>> # but it is probably better to add them a little deeper: >>> ds.externalResources.addResources( ... ["test3.bam"])[0].addIndices(["test3.bam.bai"])
- addFilters(newFilters, underConstruction=False)¶
Add new or extend the current list of filters. Public because there is already a reasonably compelling reason (the console script entry point). Most often used by the __add__ method.
- Args:
- newFilters
a Filters object or properly formatted Filters record
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import SubreadSet >>> from pbcore.io.dataset.DataSetMembers import Filters >>> ds1 = SubreadSet() >>> filt = Filters() >>> filt.addRequirement(rq=[('>', '0.85')]) >>> ds1.addFilters(filt) >>> print(ds1.filters) ( rq > 0.85 ) >>> # Or load with a DataSet >>> ds2 = DataSet(data.getXml(15)) >>> print(ds2.filters) ... ( rname = E.faecalis...
- addMetadata(newMetadata, **kwargs)¶
Add dataset metadata.
Currently we ignore duplicates while merging (though perhaps other transformations are more appropriate) and plan to remove/merge conflicting metadata with identical attribute names.
All metadata elements should be strings, deepcopy shouldn’t be necessary.
This method is most often used by the __add__ method, rather than directly.
- Args:
- newMetadata
a dictionary of object metadata from an XML file (or carefully crafted to resemble one), or a wrapper around said dictionary
- kwargs
new metadata fields to be piled into the current metadata (as an attribute)
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import DataSet >>> ds = DataSet() >>> # it is possible to add new metadata: >>> ds.addMetadata(None, Name='LongReadsRock') >>> print(ds._metadata.getV(container='attrib', tag='Name')) LongReadsRock >>> # but most will be loaded and modified: >>> ds2 = DataSet(data.getXml(7)) >>> ds2._metadata.totalLength 123588 >>> ds2._metadata.totalLength = 100000 >>> ds2._metadata.totalLength 100000 >>> ds2._metadata.totalLength += 100000 >>> ds2._metadata.totalLength 200000 >>> ds3 = DataSet(data.getXml(7)) >>> ds3.loadStats(data.getStats()) >>> ds4 = DataSet(data.getXml(10)) >>> ds4.loadStats(data.getStats()) >>> ds5 = ds3 + ds4
- property barcodes¶
Return the list of barcodes explicitly set by filters via DataSet.split(barcodes=True).
- classmethod castableTypes()¶
The types to which this DataSet type may be cast. This is a property instead of a member variable as we can enforce casting limits here (and modify if needed by overriding them in subclasses).
- Returns:
A dictionary of MetaType->Class mappings, e.g. ‘DataSet’: DataSet
- close()¶
Close all of the opened resource readers
- copy(asType=None)¶
Deep copy the representation of this DataSet
- Args:
- asType
The type of DataSet to return, e.g. ‘AlignmentSet’
- Returns:
A DataSet object that is identical but for UniqueId
- Doctest:
>>> from functools import reduce >>> import pbcore.data.datasets as data >>> from pbcore.io import DataSet, SubreadSet >>> ds1 = DataSet(data.getXml(11)) >>> # Deep copying datasets is easy: >>> ds2 = ds1.copy() >>> # But the resulting uuid's should be different. >>> ds1 == ds2 False >>> ds1.uuid == ds2.uuid False >>> ds1 is ds2 False >>> # Most members are identical >>> ds1.name == ds2.name True >>> ds1.externalResources == ds2.externalResources True >>> ds1.filters == ds2.filters True >>> ds1.subdatasets == ds2.subdatasets True >>> len(ds1.subdatasets) == 2 True >>> len(ds2.subdatasets) == 2 True >>> # Except for the one that stores the uuid: >>> ds1.objMetadata == ds2.objMetadata False >>> # And of course identical != the same object: >>> assert not reduce(lambda x, y: x or y, ... [ds1d is ds2d for ds1d in ... ds1.subdatasets for ds2d in ... ds2.subdatasets]) >>> # But types are maintained: >>> ds1 = SubreadSet(data.getXml(9), strict=True) >>> ds1.metadata <SubreadSetMetadata... >>> ds2 = ds1.copy() >>> ds2.metadata <SubreadSetMetadata... >>> # Lets try casting >>> ds1 = DataSet(data.getBam()) >>> ds1 <DataSet... >>> ds1 = ds1.copy(asType='SubreadSet') >>> ds1 <SubreadSet... >>> # Lets do some illicit casting >>> ds1 = ds1.copy(asType='ReferenceSet') Traceback (most recent call last): TypeError: Cannot cast from SubreadSet to ReferenceSet >>> # Lets try not having to cast >>> ds1 = SubreadSet(data.getBam()) >>> ds1 <SubreadSet...
- copyFiles(outdir)¶
Copy all of the top level ExternalResources to an output directory ‘outdir’
- copyTo(dest, relative=False, subdatasets=False)¶
Doesn’t resolve resource name collisions
- property createdAt¶
Return the DataSet CreatedAt timestamp
- property description¶
The description of this DataSet
- disableFilters()¶
Disable read filtering for this object
- enableFilters()¶
Re-enable read filtering for this object
- property filters¶
Limit setting to ensure cache hygiene and filter compatibility
- induceIndices(force=False)¶
Generate indices for ExternalResources.
Not compatible with DataSet base type
- loadMetadata(filename)¶
Load pipeline metadata from a <moviename>.metadata.xml file (or other DataSet)
- Args:
- filename
the filename of a <moviename>.metadata.xml file
- loadStats(filename=None)¶
Load pipeline statistics from a <moviename>.sts.xml file. The subset of these data that are defined in the DataSet XSD become available through via DataSet.metadata.summaryStats.<…> and will be written out to the DataSet XML format according to the DataSet XML XSD.
- Args:
- filename
the filename of a <moviename>.sts.xml file. If None: load all stats from sts.xml files, including for subdatasets.
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds1 = AlignmentSet(data.getXml(7)) >>> ds1.loadStats(data.getStats()) >>> ds2 = AlignmentSet(data.getXml(10)) >>> ds2.loadStats(data.getStats()) >>> ds3 = ds1 + ds2 >>> ds1.metadata.summaryStats.prodDist.bins [1576, 901, 399, 0] >>> ds2.metadata.summaryStats.prodDist.bins [1576, 901, 399, 0] >>> ds3.metadata.summaryStats.prodDist.bins [3152, 1802, 798, 0]
- makePathsAbsolute(curStart='.')¶
As part of the validation process, make all ResourceIds absolute URIs rather than relative paths. Generally not called by API users.
- Args:
- curStart
The location from which relative paths should emanate.
- makePathsRelative(outDir=False)¶
Make things easier for writing test cases: make all ResourceIds relative paths rather than absolute paths. A less common use case for API consumers.
- Args:
- outDir
The location from which relative paths should originate
- merge(other, copyOnMerge=True, newuuid=True)¶
Merge an ‘other’ dataset with this dataset, same as add operator, but can take argumens
- property metadata¶
Return the DataSet metadata as a DataSetMetadata object. Attributes should be populated intuitively, but see DataSetMetadata documentation for more detail.
- property name¶
The name of this DataSet
- newRandomUuid()¶
Generate a new random UUID
- newUuid(setter=True, random=False)¶
Generate and enforce the uniqueness of an ID for a new DataSet. While user setable fields are stripped out of the Core DataSet object used for comparison, the previous UniqueId is not. That means that copies will still be unique, despite having the same contents.
- Args:
- setter=True
Setting to False allows MD5 hashes to be generated (e.g. for comparison with other objects) without modifying the object’s UniqueId
- random=False
If true, the new UUID will be generated randomly. Otherwise a hashing algo will be used from “core” elements of the XML. This will yield a reproducible UUID for datasets that have the same “core” attributes/metadata.
- Returns:
The new Id, a properly formatted md5 hash of the Core DataSet
- Doctest:
>>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet() >>> old = ds.uuid >>> _ = ds.newUuid() >>> old != ds.uuid True
- property numExternalResources¶
The number of ExternalResources in this DataSet
- property numRecords¶
The number of records in this DataSet (from the metadata)
- processFilters()¶
Generate a list of functions to apply to a read, all of which return T/F. Each function is an OR filter, so any() true passes the read. These functions are the AND filters, and will likely check all() of other functions. These filtration functions are cached so that they are not regenerated from the base filters for every read
- reFilter(light=True)¶
The filters on this dataset have changed, update DataSet state as needed
- readsInSubDatasets(subNames=None)¶
To be used in conjunction with self.subSetNames
- property records¶
A generator of (usually) BamAlignment objects for the records in one or more Bam files pointed to by the ExternalResources in this DataSet.
- Yields:
A BamAlignment object
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> ds = AlignmentSet(data.getBam()) >>> for record in ds.records: ... print('hn: %i' % record.holeNumber) hn: ...
- resourceReaders()¶
Return a list of open pbcore Reader objects for the top level ExternalResources in this DataSet
- split(chunks=0, ignoreSubDatasets=True, contigs=False, maxChunks=0, breakContigs=False, targetSize=5000, zmws=False, barcodes=False, byRecords=False, updateCounts=True)¶
Deep copy the DataSet into a number of new DataSets containing roughly equal chunks of the ExternalResources or subdatasets.
Examples:
split into exactly n datasets where each addresses a different piece of the collection of contigs:
dset.split(contigs=True, chunks=n)
split into at most n datasets where each addresses a different piece of the collection of contigs, but contigs are kept whole:
dset.split(contigs=True, maxChunks=n)
split into at most n datasets where each addresses a different piece of the collection of contigs and the number of chunks is in part based on the number of reads:
dset.split(contigs=True, maxChunks=n, breakContigs=True)
- Args:
- chunks
the number of chunks to split the DataSet.
- ignoreSubDatasets
(True) do not split by subdatasets
- contigs
split on contigs instead of external resources
- zmws
Split by zmws instead of external resources
- barcodes
Split by barcodes instead of external resources
- maxChunks
The upper limit on the number of chunks.
- breakContigs
Whether or not to break contigs
- byRecords
Split contigs by mapped records, rather than ref length
- targetSize
The target minimum number of reads per chunk
- updateCounts
Update the count metadata in each chunk
- Returns:
A generator of new DataSet objects (all other information deep copied).
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet >>> # splitting is pretty intuitive: >>> ds1 = AlignmentSet(data.getXml(11)) >>> # but divides up extRes's, so have more than one: >>> ds1.numExternalResources > 1 True >>> # the default is one AlignmentSet per ExtRes: >>> dss = list(ds1.split()) >>> len(dss) == ds1.numExternalResources True >>> # but you can specify a number of AlignmentSets to produce: >>> dss = list(ds1.split(chunks=1)) >>> len(dss) == 1 True >>> dss = list(ds1.split(chunks=2, ignoreSubDatasets=True)) >>> len(dss) == 2 True >>> # The resulting objects are similar: >>> dss[0].uuid == dss[1].uuid False >>> dss[0].name == dss[1].name True >>> # Previously merged datasets are 'unmerged' upon split, unless >>> # otherwise specified. >>> # Lets try merging and splitting on subdatasets: >>> ds1 = AlignmentSet(data.getXml(7)) >>> ds1.totalLength 123588 >>> ds1tl = ds1.totalLength >>> ds2 = AlignmentSet(data.getXml(10)) >>> ds2.totalLength 117086 >>> ds2tl = ds2.totalLength >>> # merge: >>> dss = ds1 + ds2 >>> dss.totalLength == (ds1tl + ds2tl) True >>> # unmerge: >>> ds1, ds2 = sorted( ... list(dss.split(2, ignoreSubDatasets=False)), ... key=lambda x: x.totalLength, reverse=True) >>> ds1.totalLength == ds1tl True >>> ds2.totalLength == ds2tl True
- property subSetNames¶
The subdataset names present in this DataSet
- property tags¶
The tags of this DataSet
- property timeStampedName¶
The timeStampedName of this DataSet
- toExternalFiles()¶
Returns a list of top level external resources (no indices).
- toFofn(outfn=None, uri=False, relative=False)¶
Return a list of resource filenames (and write to optional outfile)
- Args:
- outfn
(None) the file to which the resouce filenames are to be written. If None, the only emission is a returned list of file names.
- uri
(t/F) write the resource filenames as URIs.
- relative
(t/F) emit paths relative to outfofn or ‘.’ if no outfofn
- Returns:
A list of filenames or uris
- Writes:
(Optional) A file containing a list of filenames or uris
- Doctest:
>>> from pbcore.io import DataSet >>> DataSet("bam1.bam", "bam2.bam", strict=False, ... skipMissing=True).toFofn(uri=False) ['bam1.bam', 'bam2.bam']
- property totalLength¶
The total length of this DataSet
- property uniqueId¶
The UniqueId of this DataSet
- updateCounts()¶
Update the TotalLength and NumRecords for this DataSet.
Not compatible with the base DataSet class, which has no ability to touch ExternalResources. -1 is used as a sentinel value for failed size determination. It should never be written out to XML in regular use.
- property uuid¶
The UniqueId of this DataSet
- write(outFile, validate=True, modPaths=None, relPaths=None, pretty=True)¶
Write to disk as an XML file
- Args:
- outFile
The filename of the xml file to be created
- validate
T/F (True) validate the ExternalResource ResourceIds
- relPaths
T/F (None/no change) make the ExternalResource ResourceIds relative instead of absolute filenames
- modPaths
DEPRECATED (T/F) allow paths to be modified
- Doctest:
>>> import pbcore.data.datasets as data >>> from pbcore.io import DataSet >>> import tempfile, os >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outfile = os.path.join(outdir, 'tempfile.xml') >>> ds1 = DataSet(data.getXml(), skipMissing=True) >>> ds1.write(outfile, validate=False) >>> ds2 = DataSet(outfile, skipMissing=True) >>> ds1 == ds2 True
- property zmwRanges¶
Return the end-inclusive range of ZMWs covered by the dataset if this was explicitly set by filters via DataSet.split(zmws=True).
- class pbcore.io.dataset.DataSetIO.GmapReferenceSet(*files, **kwargs)¶
Bases:
ReferenceSet
DataSet type specific to GMAP References
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- exception pbcore.io.dataset.DataSetIO.MissingFileError¶
Bases:
InvalidDataSetIOError
Specifically thrown by _fileExists(), and trapped in openDataSet().
- class pbcore.io.dataset.DataSetIO.ReadSet(*files, **kwargs)¶
Bases:
DataSet
Base type for read sets, should probably never be used as a concrete class
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- addMetadata(newMetadata, **kwargs)¶
Add metadata specific to this subtype, while leaning on the superclass method for generic metadata. Also enforce metadata type correctness.
- assertBarcoded()¶
Test whether all resources are barcoded files
- consolidate(dataFile, numFiles=1, useTmp=True)¶
Consolidate a larger number of bam files to a smaller number of bam files (min 1)
- Args:
- dataFile
The name of the output file. If numFiles >1 numbers will be added.
- numFiles
The number of data files to be produced.
- property hasPbi¶
Test whether all resources are opened as IndexedBamReader objects
- induceIndices(force=False)¶
Generate indices for ExternalResources.
Not compatible with DataSet base type
- property isBarcoded¶
Determine whether all resources are barcoded files
- property mov2qid¶
A dict of movieId: movieName for the joined readGroupTable
- property movieIds¶
A dict of movieName: movieId for the joined readGroupTable TODO: depricate this for more descriptive mov2qid
- property qid2mov¶
A dict of movieId: movieName for the joined readGroupTable
- property readGroupTable¶
Combine the readGroupTables of each external resource
- resourceReaders()¶
Open the files in this ReadSet
- split_movies(chunks)¶
Chunks requested: 0 or >= num_movies: One chunk per movie 1 to (num_movies - 1): Grouped somewhat evenly by num_records
- updateCounts()¶
Update the TotalLength and NumRecords for this DataSet.
Not compatible with the base DataSet class, which has no ability to touch ExternalResources. -1 is used as a sentinel value for failed size determination. It should never be written out to XML in regular use.
- class pbcore.io.dataset.DataSetIO.ReferenceSet(*files, **kwargs)¶
Bases:
ContigSet
DataSet type specific to References
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- property refNames¶
The reference names assigned to the External Resources, or contigs if no name assigned.
- class pbcore.io.dataset.DataSetIO.SubreadSet(*files, **kwargs)¶
Bases:
ReadSet
DataSet type specific to Subreads
DocTest:
>>> from pbcore.io import SubreadSet >>> from pbcore.io.dataset.DataSetMembers import ExternalResources >>> import pbcore.data.datasets as data >>> ds1 = SubreadSet(data.getXml(no=5), skipMissing=True) >>> ds2 = SubreadSet(data.getXml(no=5), skipMissing=True) >>> # So they don't conflict: >>> ds2.externalResources = ExternalResources() >>> ds1 <SubreadSet... >>> ds1._metadata <SubreadSetMetadata... >>> ds1._metadata <SubreadSetMetadata... >>> ds1.metadata <SubreadSetMetadata... >>> len(ds1.metadata.collections) 1 >>> len(ds2.metadata.collections) 1 >>> ds3 = ds1 + ds2 >>> len(ds3.metadata.collections) 2 >>> ds4 = SubreadSet(data.getSubreadSet(), skipMissing=True) >>> ds4 <SubreadSet... >>> ds4._metadata <SubreadSetMetadata... >>> len(ds4.metadata.collections) 1
- __init__(*files, **kwargs)¶
DataSet constructor
Initialize representations of the ExternalResources, MetaData, Filters, and LabeledSubsets, parse inputs if possible
- Args:
- files
one or more filenames or uris to read
- strict=False
strictly require all index files
- skipCounts=False
skip updating counts for faster opening
- Doctest:
>>> import os, tempfile >>> import pbcore.data.datasets as data >>> from pbcore.io import AlignmentSet, SubreadSet >>> # Prog like pbalign provides a .bam file: >>> # e.g. d = AlignmentSet("aligned.bam") >>> # Something like the test files we have: >>> inBam = data.getBam() >>> d = AlignmentSet(inBam) >>> # A UniqueId is generated, despite being a BAM input >>> bool(d.uuid) True >>> dOldUuid = d.uuid >>> # They can write this BAM to an XML: >>> # e.g. d.write("alignmentset.xml") >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = os.path.join(outdir, 'tempfile.xml') >>> d.write(outXml) >>> # And then recover the same XML: >>> d = AlignmentSet(outXml) >>> # The UniqueId will be the same >>> d.uuid == dOldUuid True >>> # Inputs can be many and varied >>> ds1 = AlignmentSet(data.getXml(7), data.getBam(1)) >>> ds1.numExternalResources 2 >>> ds1 = AlignmentSet(data.getFofn()) >>> ds1.numExternalResources 2 >>> # Constructors should be used directly >>> SubreadSet(data.getSubreadSet(), ... skipMissing=True) <SubreadSet... >>> # Even with untyped inputs >>> AlignmentSet(data.getBam()) <AlignmentSet... >>> # AlignmentSets can also be manipulated after opening: >>> # Add external Resources: >>> ds = AlignmentSet() >>> _ = ds.externalResources.addResources(["IdontExist.bam"]) >>> ds.externalResources[-1].resourceId == "IdontExist.bam" True >>> # Add an index file >>> pbiName = "IdontExist.bam.pbi" >>> ds.externalResources[-1].addIndices([pbiName]) >>> ds.externalResources[-1].indices[0].resourceId == pbiName True
- class pbcore.io.dataset.DataSetIO.TranscriptAlignmentSet(*files, **kwargs)¶
Bases:
AlignmentSet
Dataset type for aligned RNA transcripts. Essentially identical to AlignmentSet aside from the contents of the underlying BAM files.
- class pbcore.io.dataset.DataSetIO.TranscriptSet(*files, **kwargs)¶
Bases:
ReadSet
DataSet type for processed RNA transcripts in BAM format. These are not technically “reads”, but they share many of the same properties and are therefore handled the same way.
- pbcore.io.dataset.DataSetIO.checkAndResolve(fname, possibleRelStart=None)¶
Try and skip resolveLocation if possible
- pbcore.io.dataset.DataSetIO.divideKeys(keys, chunks)¶
Returns all of the keys in a list of lists, corresponding to evenly sized chunks of the original keys
- pbcore.io.dataset.DataSetIO.filtered(generator)¶
Wrap a generator with postfiltering
- pbcore.io.dataset.DataSetIO.isDataSet(xmlfile)¶
Determine if a file is a DataSet before opening it
- pbcore.io.dataset.DataSetIO.openDataFile(*files, **kwargs)¶
Factory function for DataSet types determined by the first data file
- pbcore.io.dataset.DataSetIO.openDataSet(*files, **kwargs)¶
Return a DataSet, based on the named “files”. If any files contain other files, and if those others cannot be found, then we try to resolve symlinks, but only for .xml and .fofn files themselves.
- pbcore.io.dataset.DataSetIO.splitKeys(keys, chunks)¶
Returns key pairs for each chunk defining the bounds of each chunk
The operations possible between DataSets of the same and different types are limited, see the DataSet XML documentation for details.
DataSet XML files have a few major components: XML metadata, ExternalReferences, Filters, DataSet Metadata, etc. These are represented in different ways internally, depending on their complexity. DataSet metadata especially contains a large number of different potential elements, many of which are accessible in the API as nested attributes. To preserve the API’s ability to grant access to any DataSet Metadata available now and in the future, as well as to maintain the performance of dataset reading and writing, each DataSet stores its metadata in what approximates a tree structure, with various helper classes and functions manipulating this tree. The structure of this tree and currently implemented helper classes are available in the DataSetMembers module.
digraph inheritance1742c48be3 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "Automation" [URL="#pbcore.io.dataset.DataSetMembers.Automation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Automation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AutomationParameter" [URL="#pbcore.io.dataset.DataSetMembers.AutomationParameter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "AutomationParameter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AutomationParameters" [URL="#pbcore.io.dataset.DataSetMembers.AutomationParameters",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "AutomationParameters" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BarcodeSetMetadata" [URL="#pbcore.io.dataset.DataSetMembers.BarcodeSetMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DataSetMetadata subtype specific to BarcodeSets."]; "DataSetMetadata" -> "BarcodeSetMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BinBoundaryMismatchError" [URL="#pbcore.io.dataset.DataSetMembers.BinBoundaryMismatchError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BinMismatchError" -> "BinBoundaryMismatchError" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BinMismatchError" [URL="#pbcore.io.dataset.DataSetMembers.BinMismatchError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BinNumberMismatchError" [URL="#pbcore.io.dataset.DataSetMembers.BinNumberMismatchError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BinMismatchError" -> "BinNumberMismatchError" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BinWidthMismatchError" [URL="#pbcore.io.dataset.DataSetMembers.BinWidthMismatchError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BinMismatchError" -> "BinWidthMismatchError" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BindingKit" [URL="#pbcore.io.dataset.DataSetMembers.BindingKit",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Kit" -> "BindingKit" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BioSampleMetadata" [URL="#pbcore.io.dataset.DataSetMembers.BioSampleMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metadata for a single BioSample"]; "RecordWrapper" -> "BioSampleMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BioSamplesMetadata" [URL="#pbcore.io.dataset.DataSetMembers.BioSamplesMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metadata for the list of BioSamples"]; "RecordWrapper" -> "BioSamplesMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CellPac" [URL="#pbcore.io.dataset.DataSetMembers.CellPac",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Kit" -> "CellPac" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CollectionMetadata" [URL="#pbcore.io.dataset.DataSetMembers.CollectionMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metadata for a single collection. It contains Context,"]; "RecordWrapper" -> "CollectionMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CollectionsMetadata" [URL="#pbcore.io.dataset.DataSetMembers.CollectionsMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Element should just have children: a list of"]; "RecordWrapper" -> "CollectionsMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConsensusReadSetRef" [URL="#pbcore.io.dataset.DataSetMembers.ConsensusReadSetRef",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ConsensusReadSetRef" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ContigSetMetadata" [URL="#pbcore.io.dataset.DataSetMembers.ContigSetMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DataSetMetadata subtype specific to ContigSets."]; "DataSetMetadata" -> "ContigSetMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ContinuousDistribution" [URL="#pbcore.io.dataset.DataSetMembers.ContinuousDistribution",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ContinuousDistribution" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CopyFilesMetadata" [URL="#pbcore.io.dataset.DataSetMembers.CopyFilesMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CopyFile members don't seem complex enough to justify"]; "RecordWrapper" -> "CopyFilesMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DNABarcode" [URL="#pbcore.io.dataset.DataSetMembers.DNABarcode",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "DNABarcode" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DNABarcodes" [URL="#pbcore.io.dataset.DataSetMembers.DNABarcodes",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "DNABarcodes" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DataSetMetadata" [URL="#pbcore.io.dataset.DataSetMembers.DataSetMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The root of the DataSetMetadata element tree, used as base for subtype"]; "RecordWrapper" -> "DataSetMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DiscreteDistribution" [URL="#pbcore.io.dataset.DataSetMembers.DiscreteDistribution",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "DiscreteDistribution" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExternalResource" [URL="#pbcore.io.dataset.DataSetMembers.ExternalResource",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ExternalResource" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExternalResources" [URL="#pbcore.io.dataset.DataSetMembers.ExternalResources",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ExternalResources" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FileIndex" [URL="#pbcore.io.dataset.DataSetMembers.FileIndex",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "FileIndex" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FileIndices" [URL="#pbcore.io.dataset.DataSetMembers.FileIndices",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "FileIndices" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Filter" [URL="#pbcore.io.dataset.DataSetMembers.Filter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Filter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Filters" [URL="#pbcore.io.dataset.DataSetMembers.Filters",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Filters" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Kit" [URL="#pbcore.io.dataset.DataSetMembers.Kit",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Kit" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OutputOptions" [URL="#pbcore.io.dataset.DataSetMembers.OutputOptions",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "OutputOptions" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ParentDataSet" [URL="#pbcore.io.dataset.DataSetMembers.ParentDataSet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ParentDataSet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ParentTool" [URL="#pbcore.io.dataset.DataSetMembers.ParentTool",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "ParentTool" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PbiFlags" [URL="#pbcore.io.dataset.DataSetMembers.PbiFlags",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "PrimaryMetadata" [URL="#pbcore.io.dataset.DataSetMembers.PrimaryMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Doctest:"]; "RecordWrapper" -> "PrimaryMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Properties" [URL="#pbcore.io.dataset.DataSetMembers.Properties",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Properties" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Property" [URL="#pbcore.io.dataset.DataSetMembers.Property",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "Property" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Provenance" [URL="#pbcore.io.dataset.DataSetMembers.Provenance",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metadata concerning this dataset's provenance"]; "RecordWrapper" -> "Provenance" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ReadSetMetadata" [URL="#pbcore.io.dataset.DataSetMembers.ReadSetMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "DataSetMetadata" -> "ReadSetMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RecordWrapper" [URL="#pbcore.io.dataset.DataSetMembers.RecordWrapper",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The base functionality of a metadata element."]; "RunDetailsMetadata" [URL="#pbcore.io.dataset.DataSetMembers.RunDetailsMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "RunDetailsMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SecondaryMetadata" [URL="#pbcore.io.dataset.DataSetMembers.SecondaryMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "SecondaryMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SequencingKitPlate" [URL="#pbcore.io.dataset.DataSetMembers.SequencingKitPlate",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Kit" -> "SequencingKitPlate" [arrowsize=0.5,style="setlinewidth(0.5)"]; "StatsMetadata" [URL="#pbcore.io.dataset.DataSetMembers.StatsMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metadata from the machine sts.xml"]; "RecordWrapper" -> "StatsMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SubreadSetMetadata" [URL="#pbcore.io.dataset.DataSetMembers.SubreadSetMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DataSetMetadata subtype specific to SubreadSets. Deals explicitly"]; "ReadSetMetadata" -> "SubreadSetMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TemplatePrepKit" [URL="#pbcore.io.dataset.DataSetMembers.TemplatePrepKit",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="TemplatePrepKit metadata"]; "Kit" -> "TemplatePrepKit" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WellSampleMetadata" [URL="#pbcore.io.dataset.DataSetMembers.WellSampleMetadata",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RecordWrapper" -> "WellSampleMetadata" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZeroBinWidthError" [URL="#pbcore.io.dataset.DataSetMembers.ZeroBinWidthError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; }DataSetMetadata (also the tag of the Element in the DataSet XML representation) is somewhat challening to store, access, and (de)serialize efficiently. Here, we maintain a bulk representation of all of the dataset metadata (or any other XML data, like ExternalResources) found in the XML file in the following data structure:
Child elements are represented similarly and stored (recursively) as a list in ‘children’. The top level we store for DataSetMetadata is just a list, which can be thought of as the list of children of a different element (say, a DataSet or SubreadSet element, if we stored that):
DataSetMetadata = [XmlTag, XmlTagWithSameOrDifferentTag]
We keep this for three reasons:
We don’t want to have to write a lot of logic to go from XML to an internal representation and then back to XML.
We want to be able to store and at least write metadata that doesn’t yet exist, even if we can’t merge it intelligently.
Keeping and manipulating a dictionary is ~10x faster than an OrderedAttrDict, and probably faster to use than a full stack of objects.
Instead, we keep and modify this list:dictionary structure, wrapping it in classes as necessary. The classes and methods that wrap this datastructure serve two pruposes:
- Provide an interface for our code (and making merging clean) e.g.:
DataSet(“test.xml”).metadata.numRecords += 1
Provide an interface for users of the DataSet API, e.g.:
numRecords = DataSet(“test.xml”).metadata.numRecords
bioSamplePointer = (DataSet(“test.xml”) .metadata.collections[0] .wellSample.bioSamplePointers[0])
- Though users can still access novel metadata types the hard way e.g.:
bioSamplePointer = (DataSet(“test.xml”) .metadata.collections[0] [‘WellSample’][‘BioSamplePointers’] [‘BioSamplePointer’].record[‘text’])
- Notes:
If you want temporary children to be retained for a classes’s children, pass parent=self to the child’s constructor.
it helps to add a TAG member…
- class pbcore.io.dataset.DataSetMembers.Automation(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbmeta'¶
- property automationParameters¶
- class pbcore.io.dataset.DataSetMembers.AutomationParameter(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- property value¶
- class pbcore.io.dataset.DataSetMembers.AutomationParameters(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- __getitem__(tag)¶
Override to use tag as Name instead of strictly tag
- addParameter(key, value)¶
- property automationParameter¶
- property parameterNames¶
- class pbcore.io.dataset.DataSetMembers.BarcodeSetMetadata(record=None)¶
Bases:
DataSetMetadata
The DataSetMetadata subtype specific to BarcodeSets.
- TAG = 'DataSetMetadata'¶
- property barcodeConstruction¶
- exception pbcore.io.dataset.DataSetMembers.BinBoundaryMismatchError(min1, min2)¶
Bases:
BinMismatchError
- exception pbcore.io.dataset.DataSetMembers.BinNumberMismatchError(num1, num2)¶
Bases:
BinMismatchError
- exception pbcore.io.dataset.DataSetMembers.BinWidthMismatchError(width1, width2)¶
Bases:
BinMismatchError
- class pbcore.io.dataset.DataSetMembers.BioSampleMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The metadata for a single BioSample
- property DNABarcodes¶
- NS = 'pbsample'¶
- TAG = 'BioSample'¶
- class pbcore.io.dataset.DataSetMembers.BioSamplesMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The metadata for the list of BioSamples
- Doctest:
>>> from pbcore.io import SubreadSet >>> import pbcore.data.datasets as data >>> ds = SubreadSet(data.getSubreadSet(), skipMissing=True) >>> ds.metadata.bioSamples[0].name 'consectetur purus' >>> for bs in ds.metadata.bioSamples: ... print(bs.name) consectetur purus >>> em = {'tag':'BioSample', 'text':'', 'children':[], ... 'attrib':{'Name':'great biosample'}} >>> ds.metadata.bioSamples.append(em) >>> ds.metadata.bioSamples[1].name 'great biosample'
- NS = 'pbsample'¶
- TAG = 'BioSamples'¶
- __getitem__(index)¶
Get a biosample
- __iter__()¶
Iterate over biosamples
- addSample(name)¶
- merge(other)¶
- class pbcore.io.dataset.DataSetMembers.CollectionMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The metadata for a single collection. It contains Context, InstrumentName etc. as attribs, InstCtrlVer etc. for children
- NS = 'pbmeta'¶
- TAG = 'CollectionMetadata'¶
- property automation¶
- property bindingKit¶
- property cellIndex¶
- property cellPac¶
- property collectionNumber¶
- property consensusReadSetRef¶
- property context¶
- property instCtrlVer¶
- property instrumentId¶
- property instrumentName¶
- property primary¶
- property runDetails¶
- property secondary¶
- property sequencingKitPlate¶
- property sigProcVer¶
- property templatePrepKit¶
- property wellSample¶
- class pbcore.io.dataset.DataSetMembers.CollectionsMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The Element should just have children: a list of CollectionMetadataTags
- NS = 'pbmeta'¶
- TAG = 'Collections'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- merge(other, forceUnique=False)¶
- class pbcore.io.dataset.DataSetMembers.ConsensusReadSetRef(record=None, parent=None)¶
Bases:
RecordWrapper
- property uuid¶
- class pbcore.io.dataset.DataSetMembers.ContigSetMetadata(record=None)¶
Bases:
DataSetMetadata
The DataSetMetadata subtype specific to ContigSets.
- TAG = 'DataSetMetadata'¶
- property organism¶
- property ploidy¶
- class pbcore.io.dataset.DataSetMembers.ContinuousDistribution(record=None, parent=None)¶
Bases:
RecordWrapper
- property binWidth¶
- property bins¶
- property description¶
- property labels¶
Label the bins with the min value of each bin
- property maxBinValue¶
- property maxOutlierValue¶
- merge(other)¶
- property minBinValue¶
- property minOutlierValue¶
- property numBins¶
- property sample95thPct¶
- property sampleMean¶
- property sampleMed¶
- property sampleMedian¶
- property sampleMode¶
- property sampleN50¶
- property sampleSize¶
- property sampleStd¶
- class pbcore.io.dataset.DataSetMembers.CopyFilesMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The CopyFile members don’t seem complex enough to justify class representation, instead rely on base class methods
- TAG = 'CopyFiles'¶
- class pbcore.io.dataset.DataSetMembers.DNABarcode(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbsample'¶
- TAG = 'DNABarcode'¶
- class pbcore.io.dataset.DataSetMembers.DNABarcodes(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbsample'¶
- TAG = 'DNABarcodes'¶
- __getitem__(index)¶
Get a DNABarcode
- __iter__()¶
Iterate over DNABarcode
- addBarcode(name)¶
- class pbcore.io.dataset.DataSetMembers.DataSetMetadata(record=None)¶
Bases:
RecordWrapper
The root of the DataSetMetadata element tree, used as base for subtype specific DataSet or for generic “DataSet” records.
- NS = 'pbds'¶
- TAG = 'DataSetMetadata'¶
- addParentDataSet(uniqueId, metaType, timeStampedName='', createdBy='AnalysisJob')¶
Add a ParentDataSet record in the Provenance section. Currently only used for SubreadSets.
- merge(other)¶
- property numRecords¶
Return the number of records in a DataSet using helper functions defined in the base class
- property provenance¶
- property summaryStats¶
- property totalLength¶
Return the TotalLength property of this dataset. TODO: update the value from the actual external reference on ValueError
- class pbcore.io.dataset.DataSetMembers.DiscreteDistribution(record=None, parent=None)¶
Bases:
RecordWrapper
- property bins¶
- property description¶
- property labels¶
- merge(other)¶
- property numBins¶
- class pbcore.io.dataset.DataSetMembers.ExternalResource(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- property adapters¶
- addIndices(indices)¶
- property bai¶
- property bam¶
- property barcodes¶
- property control¶
- property externalResources¶
- property gmap¶
Unusual: returns the gmap external resource instead of the resId
- property indices¶
- merge(other)¶
- property metaType¶
- property pbi¶
- property reference¶
- property resourceId¶
- property scraps¶
- property sts¶
- property tags¶
Return the list of tags for children in this element
- property timeStampedName¶
- class pbcore.io.dataset.DataSetMembers.ExternalResources(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- addResources(resourceIds)¶
Add a new external reference with the given uris. If you’re looking to add ExternalResource objects, append() or extend() them instead.
- Args:
resourceIds: a list of uris as strings
- merge(other)¶
- property resourceIds¶
- property resources¶
- sort()¶
In theory we could sort the ExternalResource objects, but that would require opening them
- class pbcore.io.dataset.DataSetMembers.FileIndex(record=None)¶
Bases:
RecordWrapper
- KEEP_WITH_PARENT = True¶
- NS = 'pbbase'¶
- property metaType¶
- property resourceId¶
- property timeStampedName¶
- class pbcore.io.dataset.DataSetMembers.FileIndices(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- class pbcore.io.dataset.DataSetMembers.Filter(record=None)¶
Bases:
RecordWrapper
- NS = 'pbds'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- addRequirement(name, operator, value, modulo=None)¶
- merge(other)¶
- property plist¶
- pop(index)¶
- removeRequirement(req)¶
- class pbcore.io.dataset.DataSetMembers.Filters(record=None)¶
Bases:
RecordWrapper
- NS = 'pbds'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- addFilter(**kwargs)¶
Use this to add filters. Members of the list will be considered requirements for fulfilling this option. Use multiple calls to add multiple filters.
- Args:
name: The name of the requirement, e.g. ‘rq’ options: A list of (operator, value) tuples, e.g. (‘>’, ‘0.85’)
- addFilterList(filters)¶
filters is a list of options, with a list of reqs for each option. Each req is a tuple (name, oper, val)
- addRequirement(**kwargs)¶
Use this to add requirements. Members of the list will be considered options for fulfilling this requirement, all other filters will be duplicated for each option. Use multiple calls to add multiple requirements to the existing filters. Use removeRequirement first to not add conflicting filters.
- Args:
name: The name of the requirement, e.g. ‘rq’ options: A list of (operator, value) tuples, e.g. (‘>’, ‘0.85’)
- broadcastFilters(filts)¶
Filt is a list of Filter objects or lists of reqs. Take all existing filters, duplicate and combine with each new filter
- filterIndexRecords(indexRecords, nameMap, movieMap, readType='bam')¶
- fromString(filterString)¶
- mapRequirement(**kwargs)¶
Add requirements to each of the existing requirements, mapped one to one
- merge(other)¶
- removeFilter(index)¶
- removeRequirement(req)¶
- testCompatibility(other)¶
- testField(param, values, testType=<class 'str'>, oper='=')¶
- testParam(param, value, testType=<class 'str'>, oper='=')¶
- tests(readType='bam', tIdMap=None)¶
- class pbcore.io.dataset.DataSetMembers.Kit(record=None, parent=None)¶
Bases:
RecordWrapper
- property barcode¶
- property expirationDate¶
- property lotNumber¶
- property partNumber¶
- class pbcore.io.dataset.DataSetMembers.OutputOptions(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbmeta'¶
- property collectionPathUri¶
- property copyFiles¶
- property resultsFolder¶
- class pbcore.io.dataset.DataSetMembers.ParentDataSet(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbds'¶
- property metaType¶
- property timeStampedName¶
- class pbcore.io.dataset.DataSetMembers.ParentTool(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbds'¶
- class pbcore.io.dataset.DataSetMembers.PbiFlags¶
Bases:
object
- ADAPTER_AFTER = 2¶
- ADAPTER_BEFORE = 1¶
- BARCODE_AFTER = 8¶
- BARCODE_BEFORE = 4¶
- FORWARD_PASS = 16¶
- NO_LOCAL_CONTEXT = 0¶
- REVERSE_PASS = 32¶
- classmethod flagMap(flag)¶
- class pbcore.io.dataset.DataSetMembers.PrimaryMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
- Doctest:
>>> import os, tempfile >>> from pbcore.io import SubreadSet >>> import pbcore.data.datasets as data >>> ds1 = SubreadSet(data.getXml(5), skipMissing=True) >>> ds1.metadata.collections[0].primary.outputOptions.resultsFolder 'Analysis_Results' >>> ds1.metadata.collections[0].primary.outputOptions.resultsFolder = ( ... 'BetterAnalysis_Results') >>> ds1.metadata.collections[0].primary.outputOptions.resultsFolder 'BetterAnalysis_Results' >>> outdir = tempfile.mkdtemp(suffix="dataset-doctest") >>> outXml = 'xml:' + os.path.join(outdir, 'tempfile.xml') >>> ds1.write(outXml, validate=False) >>> ds2 = SubreadSet(outXml, skipMissing=True) >>> ds2.metadata.collections[0].primary.outputOptions.resultsFolder 'BetterAnalysis_Results'
- NS = 'pbmeta'¶
- TAG = 'Primary'¶
- property automationName¶
- property configFileName¶
- property outputOptions¶
- property sequencingCondition¶
- class pbcore.io.dataset.DataSetMembers.Properties(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- __getitem__(index)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- merge(other)¶
- class pbcore.io.dataset.DataSetMembers.Property(record=None)¶
Bases:
RecordWrapper
- NS = 'pbbase'¶
- property hashfunc¶
- property modulo¶
- property name¶
- property operator¶
- property value¶
- class pbcore.io.dataset.DataSetMembers.Provenance(record=None, parent=None)¶
Bases:
RecordWrapper
The metadata concerning this dataset’s provenance
- NS = 'pbds'¶
- addParentDataSet(uniqueId, metaType, timeStampedName)¶
- property createdBy¶
- property parentDataSet¶
- property parentTool¶
- class pbcore.io.dataset.DataSetMembers.ReadSetMetadata(record=None)¶
Bases:
DataSetMetadata
- property bioSamples¶
- merge(other)¶
- class pbcore.io.dataset.DataSetMembers.RecordWrapper(record=None, parent=None)¶
Bases:
object
The base functionality of a metadata element.
Many of the methods here are intended for use with children of RecordWrapper (e.g. append, extend). Methods in child classes often provide similar functionality for more raw inputs (e.g. resourceIds as strings)
- KEEP_WITH_PARENT = False¶
- NS = ''¶
- __getitem__(tag)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- __iter__()¶
Get each child iteratively (only useful in simple cases where children will not be wrapped in a special wrapper object)
- __repr__()¶
Return a pretty string represenation of this object:
“<type tag text attribs children>”
- addMetadata(key, value)¶
Add a key, value pair to this metadata object (attributes)
- append(newMember)¶
Append to the actual list of child elements
- property attrib¶
- clearCallbacks()¶
- property createdAt¶
- property description¶
- extend(newMembers)¶
Extend the actual list of child elements
- findChildren(tag)¶
- getMemberV(tag, container='text', default=None, asType=<class 'str'>, attrib=None)¶
Generic accessor for the contents of the children of this element, without having to interface with them directly
- getV(container='text', tag=None)¶
Generic accessor for the contents of this element’s ‘attrib’ or ‘text’ fields
- index(tag)¶
Return the index in ‘children’ list of item with ‘tag’ member
- merge(other)¶
- property metadata¶
Cleaner accessor for this node’s attributes. Returns mutable, doesn’t need setter
- property metaname¶
Cleaner accessor for this node’s tag
- property metavalue¶
Cleaner accessor for this node’s text
- property name¶
- property namespace¶
- pop(index)¶
- pruneChildrenTo(whitelist)¶
- registerCallback(func)¶
- removeChildren(tag)¶
- setMemberV(tag, value, container='text', attrib=None)¶
Generic accessor for the contents of the children of this element, without having to interface with them directly
- setV(value, container='text', tag=None)¶
Generic accessor for the contents of this element’s ‘attrib’ or ‘text’ fields
- property submetadata¶
Cleaner accessor for wrapped versions of this node’s children.
- property subrecords¶
Cleaner accessor for this node’s children. Returns mutable, doesn’t need setter
- property tags¶
Return the list of tags for children in this element
- property text¶
- property uniqueId¶
- property value¶
- property version¶
- class pbcore.io.dataset.DataSetMembers.RunDetailsMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbmeta'¶
- TAG = 'RunDetails'¶
- property name¶
- property timeStampedName¶
- class pbcore.io.dataset.DataSetMembers.SecondaryMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
- TAG = 'Secondary'¶
- property cellCountInJob¶
- class pbcore.io.dataset.DataSetMembers.StatsMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
The metadata from the machine sts.xml
- CHANNEL_DISTS = ['BaselineLevelDist', 'BaselineStdDist', 'SnrDist', 'HqRegionSnrDist', 'HqBasPkMidDist', 'BaselineLevelSequencingDist', 'TotalBaseFractionPerChannel', 'DmeAngleEstDist']¶
- MERGED_DISTS = ['ProdDist', 'ReadTypeDist', 'ReadLenDist', 'ReadQualDist', 'MedianInsertDist', 'InsertReadQualDist', 'InsertReadLenDist', 'ControlReadQualDist', 'ControlReadLenDist']¶
- OTHER_DISTS = ['PausinessDist', 'PulseRateDist', 'PulseWidthDist', 'BaseRateDist', 'BaseWidthDist', 'BaseIpdDist', 'LocalBaseRateDist', 'NumUnfilteredBasecallsDist', 'HqBaseFractionDist', 'NumUnfilteredBasecallsDist']¶
- UNMERGED_DISTS = ['BaselineLevelDist', 'BaselineStdDist', 'SnrDist', 'HqRegionSnrDist', 'HqBasPkMidDist', 'BaselineLevelSequencingDist', 'TotalBaseFractionPerChannel', 'DmeAngleEstDist', 'PausinessDist', 'PulseRateDist', 'PulseWidthDist', 'BaseRateDist', 'BaseWidthDist', 'BaseIpdDist', 'LocalBaseRateDist', 'NumUnfilteredBasecallsDist', 'HqBaseFractionDist', 'NumUnfilteredBasecallsDist']¶
- __getitem__(key)¶
Try to get the a specific child (only useful in simple cases where children will not be wrapped in a special wrapper object, returns the first instance of ‘tag’)
- property adapterDimerFraction¶
- availableDists()¶
- property channelDists¶
This can be modified to use the new accessors above instead of the brittle list of channel dists above
- property controlReadLenDist¶
- property controlReadQualDist¶
- getDist(key, unwrap=True)¶
- property insertReadLenDist¶
- property insertReadLenDists¶
- property insertReadQualDist¶
- property insertReadQualDists¶
- property medianInsertDist¶
- property medianInsertDists¶
- merge(other)¶
This can be modified to use the new accessors above instead of the brittle list of dists above
- property numSequencingZmws¶
- property otherDists¶
This can be modified to use the new accessors above instead of the brittle list of dists above
- property prodDist¶
- property readLenDist¶
- property readLenDists¶
- property readQualDist¶
- property readQualDists¶
- property readTypeDist¶
- property shortInsertFraction¶
- class pbcore.io.dataset.DataSetMembers.SubreadSetMetadata(record=None)¶
Bases:
ReadSetMetadata
The DataSetMetadata subtype specific to SubreadSets. Deals explicitly with the merging of Collections metadata hierarchies.
- TAG = 'DataSetMetadata'¶
- property collections¶
Return a list of wrappers around Collections elements of the Metadata Record
- merge(other)¶
- class pbcore.io.dataset.DataSetMembers.TemplatePrepKit(record=None, parent=None)¶
Bases:
Kit
TemplatePrepKit metadata
- property leftAdaptorSequence¶
- property rightAdaptorSequence¶
- class pbcore.io.dataset.DataSetMembers.WellSampleMetadata(record=None, parent=None)¶
Bases:
RecordWrapper
- NS = 'pbmeta'¶
- TAG = 'WellSample'¶
- property bioSamples¶
- property comments¶
- property concentration¶
- property sampleReuseEnabled¶
- property sizeSelectionEnabled¶
- property stageHotstartEnabled¶
- property useCount¶
- property wellName¶
- pbcore.io.dataset.DataSetMembers.accs(key, container='attrib', asType=<function <lambda>>, parent=False)¶
- pbcore.io.dataset.DataSetMembers.breakqname(qname)¶
- pbcore.io.dataset.DataSetMembers.filter_read(accessor, operator, value, read)¶
- pbcore.io.dataset.DataSetMembers.fromFile(value)¶
- pbcore.io.dataset.DataSetMembers.getter(key, container='attrib', asType=<function <lambda>>, parent=False)¶
- pbcore.io.dataset.DataSetMembers.histogram_percentile(counts, labels, percentile)¶
- pbcore.io.dataset.DataSetMembers.inOp(ar1, ar2)¶
- pbcore.io.dataset.DataSetMembers.isFile(string)¶
- pbcore.io.dataset.DataSetMembers.isListString(string)¶
Detect if string is actually a representation a stringified list
- pbcore.io.dataset.DataSetMembers.make_mod_hash_acc(accessor, mod, hashname)¶
- pbcore.io.dataset.DataSetMembers.mapOp(op)¶
- pbcore.io.dataset.DataSetMembers.map_val_or_vec(func, target)¶
- pbcore.io.dataset.DataSetMembers.n_subreads(index)¶
- pbcore.io.dataset.DataSetMembers.newUuid(record)¶
- pbcore.io.dataset.DataSetMembers.qname2vec(qnames, movie_map)¶
Break a list of qname strings into a list of qname field tuples
- pbcore.io.dataset.DataSetMembers.qnamer(qid2mov, qId, hn, qs, qe)¶
- pbcore.io.dataset.DataSetMembers.qnames2recarrays_by_size(qnames, movie_map, dtype)¶
Note that qname filters can be specified as partial qnames. Therefore we return a recarray for each size in qnames, in a dictionary
- pbcore.io.dataset.DataSetMembers.reccheck(records, qname_tables)¶
Create a mask for those records present in qname_tables. qname_tables is a dict of {numfields: recarray}, where each recarray contains records that specify a means of passing the filter, e.g. a qname in a whitelist. Qnames filters can be specified as partial qnames, however, e.g. just a contextid and holenumber (you cannot skip fields, e.g. just holenumber). We also want to allow people to mix partially and fully specified qnames in the whitelist. Therefore we have different tables for different lengths (so we can use np.in1d, which operates on recarrays quite nicely)
- pbcore.io.dataset.DataSetMembers.recordMembership(records, constraints)¶
- pbcore.io.dataset.DataSetMembers.runonce(func)¶
- pbcore.io.dataset.DataSetMembers.setify(value)¶
- pbcore.io.dataset.DataSetMembers.setter(key, container='attrib')¶
- pbcore.io.dataset.DataSetMembers.str2list(value)¶
- pbcore.io.dataset.DataSetMembers.subaccs(key, container='text', default=None, asType=<function <lambda>>, attrib=None)¶
- pbcore.io.dataset.DataSetMembers.subgetter(key, container='text', default=None, asType=<function <lambda>>, attrib=None)¶
- pbcore.io.dataset.DataSetMembers.subsetter(key, container='text', attrib=None)¶
- pbcore.io.dataset.DataSetMembers.updateNamespace(ele, ns)¶
- pbcore.io.dataset.DataSetMembers.updateTag(ele, tag)¶
- pbcore.io.dataset.DataSetMembers.uri2fn(fn)¶
- pbcore.io.dataset.DataSetMembers.uri2scheme(fn)¶