casacore
Loading...
Searching...
No Matches
StatisticsAlgorithm.h
Go to the documentation of this file.
1//# Copyright (C) 2000,2001
2//# Associated Universities, Inc. Washington DC, USA.
3//#
4//# This library is free software; you can redistribute it and/or modify it
5//# under the terms of the GNU Library General Public License as published by
6//# the Free Software Foundation; either version 2 of the License, or (at your
7//# option) any later version.
8//#
9//# This library is distributed in the hope that it will be useful, but WITHOUT
10//# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11//# FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public
12//# License for more details.
13//#
14//# You should have received a copy of the GNU Library General Public License
15//# along with this library; if not, write to the Free Software Foundation,
16//# Inc., 675 Massachusetts Ave, Cambridge, MA 02139, USA.
17//#
18//# Correspondence concerning AIPS++ should be addressed as follows:
19//# Internet email: casa-feedback@nrao.edu.
20//# Postal address: AIPS++ Project Office
21//# National Radio Astronomy Observatory
22//# 520 Edgemont Road
23//# Charlottesville, VA 22903-2475 USA
24//#
25
26#ifndef SCIMATH_STATISTICSALGORITHM_H
27#define SCIMATH_STATISTICSALGORITHM_H
28
29#include <casacore/casa/aips.h>
30#include <casacore/casa/Exceptions/Error.h>
31#include <casacore/scimath/StatsFramework/StatsDataProvider.h>
32#include <casacore/scimath/StatsFramework/StatisticsData.h>
33#include <casacore/scimath/StatsFramework/StatisticsDataset.h>
34#include <casacore/scimath/StatsFramework/StatisticsTypes.h>
35
36#include <map>
37#include <set>
38#include <vector>
39#include <memory>
40
41namespace casacore {
42
43// Base class of statistics algorithm class hierarchy.
44
45// The default implementation is such that statistics are only calculated when
46// methods that actually compute statistics are called. Until then, the
47// iterators which point to the beginning of data sets, masks, etc. are held in
48// memory. Thus, the caller must keep all data sets available for the statistics
49// object until these methods are called, and of course, if the actual data
50// values are changed between adding data and calculating statistics, the
51// updated values are used when calculating statistics. Derived classes may
52// override this behavior.
53//
54// PRECISION CONSIDERATIONS
55// Many statistics are computed via accumulators. This can lead to precision
56// issues, especially for large datasets. For this reason, it is highly
57// recommended that the data type one uses as the AccumType be of higher
58// precision, if possible, than the data type pointed to by input iterator. So
59// for example, if one has a data set of Float values (to which the
60// InputIterator type points to), then one should use type Double for the
61// AccumType. In this case, the Float data values will be converted to Doubles
62// before they are accumulated.
63//
64// METHODS OF PROVIDING DATA
65// Data may be provided in one of two mutually exclusive ways. The first way is
66// simpler, and that is to use the setData()/addData() methods. Calling
67// setData() will clear any previous data that was added via these methods or
68// via a data provider (see below). Calling addData() after having called
69// setData() will add a data set to the set of data sets on which statistics
70// will be calculated. In order for this to work correctly, the iterators which
71// are passed into these methods must still be valid when statistics are
72// calculated (although note that some derived classes allow certain statistics
73// to be updated as data sets are added via these methods. See specific classes
74// for details).
75//
76// The second way to provide data is via an object derived from class
77// StatsDataProvider, in which methods are implemented for retrieving various
78// information about the data sets to be included. Such an interface is
79// necessary for data structures which do not easily lend themselves to be
80// provided via the setData()/addData() methods. For example, in the case of
81// iterating through a Lattice, a lattice iterator will overwrite the memory
82// location of the previous chunk of data with the current chunk of data.
83// Therefore, if one does not wish to load data from the entire lattice into
84// memory (which is why LatticeIterator was designed to have the behavior it
85// does), one must use the LatticeStatsDataProvider class, which the statistics
86// framework will use to iterate through the lattice, only keeping one chunk of
87// the data of the lattice in memory any given moment.
88//
89// STORAGE OF DATA
90// In order to reduce maintenance costs, the accounting details of the data sets
91// are maintained in a StatisticsDataset object. This object is held in memory
92// at the StatisticsAlgorithm level in the _dataset private field of this class
93// when a derived class is instantiated. A StatisticsDataset object should never
94// need to be explicitly instantiated by an API developer.
95//
96// QUANTILES
97// A quantile is a value contained in a data set, such that, it has a zero-based
98// index of ceil(q*n)-1 in the equivalent ordered dataset, where 0 < q < 1
99// specifies the fractional location within the ordered dataset and n is the
100// total number of valid elements. Note that, for a dataset with an odd number
101// of elements, the median is the same as the quantile value when q = 0.5.
102// However, there is no such correspondence between the median in a dataset with
103// an even number of elements, since the median in that case is given by the
104// mean of the elements of zero-based indices n/2-1 and n/2 in the equivalent
105// ordered dataset. Thus, in the case of a dataset with an even number of
106// values, the median may not even exist in the dataset, while a generic
107// quantile value must exist in the dataset by definition. Note when calculating
108// quantile values, a dataset that does not fall in specified dataset ranges,
109// is not included via a stride specification, is masked, or has a weight of
110// zero, is not considered a member of the dataset for the purposes of quantile
111// calculations.
112//
113// CLASS ORGANIZATION
114// In general, in the StatsFramework class hierarchy, classes derived from
115// StatisticsAlgorithm and its descendants contain methods which calculate the
116// relevant statistics which are computed via accumulation. These classes also
117// contain the top level methods for computing the quantile-like statistics, for
118// the convenience of the API developer. Derived classes of StatisticsAlgorithm
119// normally will have a private field which is an object that contains methods
120// which compute the various quantile-like statistics. These so-called
121// QuantileComputer classes have been created to reduce maintainability costs;
122// because putting all the code into single class files was becoming unwieldy.
123// The concrete QuantileComputer classes are ultimately derived from
124// StatisticsAlgorithmQuantileComputer, which is the virtual base class of this
125// hierarchy. StatisticsAlgorithm objects do not contain a
126// StatisticsAlgorithmQuantileComputer private field, since StatisticsAlgorithm
127// is also a virtual base class and hence no actual statistics are computed
128// within it. The design is such that the only classes an API developer should
129// over instantiate are the derived classes of StatisticsAlgorithm; the
130// QuantileComputer classes should never be explicitly instantiated in code
131// which uses the StatsFramework API.
132
133template <
134 class AccumType, class DataIterator, class MaskIterator=const Bool *,
135 class WeightsIterator=DataIterator
136>
138
139public:
140
142
143 // Clone this instance
145
146 // <group>
147 // Add a dataset to an existing set of datasets on which statistics are to
148 // be calculated. nr is the number of points to be considered. If
149 // <src>dataStride</src> is greater than 1, when
150 // <src>nrAccountsForStride</src>=True indicates that the stride has been
151 // taken into account in the value of <src>nr</src>. Otherwise, it has not
152 // so that the actual number of points to include is nr/dataStride if
153 // nr % dataStride == 0 or (int)(nr/dataStride) + 1 otherwise. if one calls
154 // this method after a data provider has been set, an exception will be
155 // thrown. In this case, one should call setData(), rather than addData(),
156 // to indicate that the underlying data provider should be removed.
157 // <src>dataRanges</src> provide the ranges of data to include if
158 // <src>isInclude</src> is True, or ranges of data to exclude if
159 // <src>isInclude</src> is False. If a datum equals the end point of a data
160 // range, it is considered good (included) if <src>isInclude</src> is True,
161 // and it is considered bad (excluded) if <src>isInclude</src> is False.
162
164 const DataIterator& first, uInt nr, uInt dataStride=1,
165 Bool nrAccountsForStride=False
166 );
167
169 const DataIterator& first, uInt nr,
170 const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
171 Bool nrAccountsForStride=False
172 );
173
175 const DataIterator& first, const MaskIterator& maskFirst,
176 uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False,
177 uInt maskStride=1
178 );
179
181 const DataIterator& first, const MaskIterator& maskFirst,
182 uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
183 uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
184 );
185
187 const DataIterator& first, const WeightsIterator& weightFirst,
188 uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False
189 );
190
192 const DataIterator& first, const WeightsIterator& weightFirst,
193 uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
194 uInt dataStride=1, Bool nrAccountsForStride=False
195 );
196
198 const DataIterator& first, const WeightsIterator& weightFirst,
199 const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
200 Bool nrAccountsForStride=False, uInt maskStride=1
201 );
202
204 const DataIterator& first, const WeightsIterator& weightFirst,
205 const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
206 Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
207 uInt maskStride=1
208 );
209 // </group>
210
211 // get the algorithm that this object uses for computing stats
213
214 virtual AccumType getMedian(
215 std::shared_ptr<uInt64> knownNpts=nullptr,
216 std::shared_ptr<AccumType> knownMin=nullptr,
217 std::shared_ptr<AccumType> knownMax=nullptr,
218 uInt binningThreshholdSizeBytes=4096*4096,
219 Bool persistSortedArray=False, uInt nBins=10000
220 ) = 0;
221
222 // The return value is the median; the quantiles are returned in the
223 // <src>quantileToValue</src> map.
224 virtual AccumType getMedianAndQuantiles(
225 std::map<Double, AccumType>& quantileToValue,
226 const std::set<Double>& quantiles,
227 std::shared_ptr<uInt64> knownNpts=nullptr,
228 std::shared_ptr<AccumType> knownMin=nullptr,
229 std::shared_ptr<AccumType> knownMax=nullptr,
230 uInt binningThreshholdSizeBytes=4096*4096,
231 Bool persistSortedArray=False, uInt nBins=10000
232 ) = 0;
233
234 // get the median of the absolute deviation about the median of the data.
235 virtual AccumType getMedianAbsDevMed(
236 std::shared_ptr<uInt64> knownNpts=nullptr,
237 std::shared_ptr<AccumType> knownMin=nullptr,
238 std::shared_ptr<AccumType> knownMax=nullptr,
239 uInt binningThreshholdSizeBytes=4096*4096,
240 Bool persistSortedArray=False, uInt nBins=10000
241 ) = 0;
242
243 // Purposefully not virtual. Derived classes should not implement.
244 AccumType getQuantile(
245 Double quantile, std::shared_ptr<uInt64> knownNpts=nullptr,
246 std::shared_ptr<AccumType> knownMin=nullptr,
247 std::shared_ptr<AccumType> knownMax=nullptr,
248 uInt binningThreshholdSizeBytes=4096*4096,
249 Bool persistSortedArray=False, uInt nBins=10000
250 );
251
252 // get a map of quantiles to values.
253 virtual std::map<Double, AccumType> getQuantiles(
254 const std::set<Double>& quantiles, std::shared_ptr<uInt64> npts=nullptr,
255 std::shared_ptr<AccumType> min=nullptr, std::shared_ptr<AccumType> max=nullptr,
256 uInt binningThreshholdSizeBytes=4096*4096,
257 Bool persistSortedArray=False, uInt nBins=10000
258 ) = 0;
259
260 // get the value of the specified statistic. Purposefully not virtual.
261 // Derived classes should not implement.
263
264 // certain statistics such as max and min have locations in the dataset
265 // associated with them. This method gets those locations. The first value
266 // in the returned pair is the zero-based dataset number that was set or
267 // added. The second value is the zero-based index in that dataset. A data
268 // stride of greater than one is not accounted for, so the index represents
269 // the actual location in the data set, independent of the dataStride value.
271
272 // Return statistics. Purposefully not virtual. Derived classes should not
273 // implement.
275
276 // reset this object by clearing data.
277 virtual void reset();
278
279 // <group>
280 // setdata() clears any current datasets or data provider and then adds the
281 // specified data set as the first dataset in the (possibly new) set of data
282 // sets for which statistics are to be calculated. See addData() for
283 // parameter meanings. These methods are purposefully not virtual. Derived
284 // classes should not implement.
286 const DataIterator& first, uInt nr, uInt dataStride=1,
287 Bool nrAccountsForStride=False
288 );
289
291 const DataIterator& first, uInt nr, const DataRanges& dataRanges,
292 Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False
293 );
294
296 const DataIterator& first, const MaskIterator& maskFirst, uInt nr,
297 uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
298 );
299
301 const DataIterator& first, const MaskIterator& maskFirst,
302 uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
303 uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
304 );
305
307 const DataIterator& first, const WeightsIterator& weightFirst, uInt nr,
308 uInt dataStride=1, Bool nrAccountsForStride=False
309 );
310
312 const DataIterator& first, const WeightsIterator& weightFirst, uInt nr,
313 const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
314 Bool nrAccountsForStride=False
315 );
316
318 const DataIterator& first, const WeightsIterator& weightFirst,
319 const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
320 Bool nrAccountsForStride=False, uInt maskStride=1
321 );
322
324 const DataIterator& first, const WeightsIterator& weightFirst,
325 const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
326 Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
327 uInt maskStride=1
328 );
329 // </group>
330
331 // instead of setting and adding data "by hand", set the data provider
332 // that will provide all the data sets. Calling this method will clear
333 // any other data sets that have previously been set or added. Method
334 // is virtual to allow derived classes to carry out any necessary
335 // specialized accounting when resetting the data provider.
337
338 // Provide guidance to algorithms by specifying a priori which statistics
339 // the caller would like calculated.
340 virtual void setStatsToCalculate(std::set<StatisticsData::STATS>& stats);
341
342protected:
344
345 // use copy semantics, except for the data provider which uses reference
346 // semantics
348
349 // use copy semantics, except for the data provider which uses reference
350 // semantics
352
353 // Allows derived classes to do things after data is set or added.
354 // Default implementation does nothing.
355 virtual void _addData() {}
356
357 // <group>
358 // These methods are purposefully not virtual. Derived classes should
359 // not implement.
361 return _dataset;
362 }
363
365 // </group>
366
367 virtual AccumType _getStatistic(StatisticsData::STATS stat) = 0;
368
370
371 const std::set<StatisticsData::STATS> _getStatsToCalculate() const {
372 return _statsToCalculate;
373 }
374
375 virtual const std::set<StatisticsData::STATS>&
377 return _unsupportedStats;
378 }
379
380 // Derived classes should normally call this in their constructors, if
381 // applicable.
383 const std::set<StatisticsData::STATS>& stats
384 ) {
385 _unsupportedStats = stats;
386 }
387
388private:
389 std::set<StatisticsData::STATS> _statsToCalculate{}, _unsupportedStats{};
392
394
395};
396
397}
398
399#ifndef CASACORE_NO_AUTO_TEMPLATES
400#include <casacore/scimath/StatsFramework/StatisticsAlgorithm.tcc>
401#endif
402
403#endif
#define DataRanges
Base class of statistics algorithm class hierarchy.
virtual void reset()
reset this object by clearing data.
AccumType getQuantile(Double quantile, std::shared_ptr< uInt64 > knownNpts=nullptr, std::shared_ptr< AccumType > knownMin=nullptr, std::shared_ptr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)
Purposefully not virtual.
virtual AccumType getMedian(std::shared_ptr< uInt64 > knownNpts=nullptr, std::shared_ptr< AccumType > knownMin=nullptr, std::shared_ptr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
virtual StatsData< AccumType > _getStatistics()=0
void addData(const DataIterator &first, const WeightsIterator &weightFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
std::set< StatisticsData::STATS > _statsToCalculate
void _setUnsupportedStatistics(const std::set< StatisticsData::STATS > &stats)
Derived classes should normally call this in their constructors, if applicable.
virtual void setDataProvider(StatsDataProvider< CASA_STATP > *dataProvider)
instead of setting and adding data "by hand", set the data provider that will provide all the data se...
virtual AccumType getMedianAbsDevMed(std::shared_ptr< uInt64 > knownNpts=nullptr, std::shared_ptr< AccumType > knownMin=nullptr, std::shared_ptr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
get the median of the absolute deviation about the median of the data.
virtual void setStatsToCalculate(std::set< StatisticsData::STATS > &stats)
Provide guidance to algorithms by specifying a priori which statistics the caller would like calculat...
virtual AccumType getMedianAndQuantiles(std::map< Double, AccumType > &quantileToValue, const std::set< Double > &quantiles, std::shared_ptr< uInt64 > knownNpts=nullptr, std::shared_ptr< AccumType > knownMin=nullptr, std::shared_ptr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
The return value is the median; the quantiles are returned in the quantileToValue map.
const std::set< StatisticsData::STATS > _getStatsToCalculate() const
AccumType getStatistic(StatisticsData::STATS stat)
get the value of the specified statistic.
virtual AccumType _getStatistic(StatisticsData::STATS stat)=0
virtual StatisticsData::ALGORITHM algorithm() const =0
get the algorithm that this object uses for computing stats
StatisticsAlgorithm(const StatisticsAlgorithm &other)
use copy semantics, except for the data provider which uses reference semantics
StatsData< AccumType > getStatistics()
Return statistics.
virtual StatisticsAlgorithm< CASA_STATP > * clone() const =0
Clone this instance.
std::set< StatisticsData::STATS > _unsupportedStats
const StatisticsDataset< CASA_STATP > & _getDataset() const
These methods are purposefully not virtual.
void setData(const DataIterator &first, const WeightsIterator &weightFirst, const MaskIterator &maskFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
virtual const std::set< StatisticsData::STATS > & _getUnsupportedStatistics() const
void setData(const DataIterator &first, const WeightsIterator &weightFirst, const MaskIterator &maskFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
virtual std::map< Double, AccumType > getQuantiles(const std::set< Double > &quantiles, std::shared_ptr< uInt64 > npts=nullptr, std::shared_ptr< AccumType > min=nullptr, std::shared_ptr< AccumType > max=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
get a map of quantiles to values.
void addData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
Add a dataset to an existing set of datasets on which statistics are to be calculated.
StatisticsAlgorithm & operator=(const StatisticsAlgorithm &other)
use copy semantics, except for the data provider which uses reference semantics
StatisticsDataset< CASA_STATP > _dataset
void setData(const DataIterator &first, const WeightsIterator &weightFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
virtual LocationType getStatisticIndex(StatisticsData::STATS stat)=0
certain statistics such as max and min have locations in the dataset associated with them.
void setData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
setdata() clears any current datasets or data provider and then adds the specified data set as the fi...
void addData(const DataIterator &first, const MaskIterator &maskFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
StatisticsDataset< CASA_STATP > & _getDataset()
void addData(const DataIterator &first, const MaskIterator &maskFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
void addData(const DataIterator &first, const WeightsIterator &weightFirst, const MaskIterator &maskFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
void setData(const DataIterator &first, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False)
void setData(const DataIterator &first, const MaskIterator &maskFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
void setData(const DataIterator &first, const WeightsIterator &weightFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False)
void addData(const DataIterator &first, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False)
virtual void _addData()
Allows derived classes to do things after data is set or added.
void addData(const DataIterator &first, const WeightsIterator &weightFirst, uInt nr, const DataRanges &dataRanges, Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False)
void addData(const DataIterator &first, const WeightsIterator &weightFirst, const MaskIterator &maskFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
void setData(const DataIterator &first, const MaskIterator &maskFirst, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1)
ALGORITHM
implemented algorithms
Representation of a statistics dataset used in statistics framework calculatations.
Abstract base class which defines interface for providing "datasets" to the statistics framework in c...
struct Node * first
Definition malloc.h:328
this file contains all the compiler specific defines
Definition mainpage.dox:28
const Bool False
Definition aipstype.h:42
LatticeExprNode max(const LatticeExprNode &left, const LatticeExprNode &right)
unsigned int uInt
Definition aipstype.h:49
LatticeExprNode min(const LatticeExprNode &left, const LatticeExprNode &right)
bool Bool
Define the standard types used by Casacore.
Definition aipstype.h:40
const Bool True
Definition aipstype.h:41
double Double
Definition aipstype.h:53
std::pair< Int64, Int64 > LocationType