Re: [guppi-list] Patches for Goose



On Tue, Mar 30, 1999 at 08:39:40PM -0500, Bradford Hovinen wrote:
> 
> Greetings!
> 
> I'm submitting to the list several additions to Goose that I hope you will 
> find useful. 

Excellent!

> 
> First, I looked through the source code for a while for statistical tests,
> and while my knowledge of statistics hardly qualifies me as an expert, I
> was somewhat hard-pressed to find anything that did a performed the tests
> for sample proportions and sample means.

You didn't find them because they weren't there.  In Goose, we do the
obscure first... the commonplace comes later. :-)

> `hypo-test.patch.gz' implements several common tests, including 1- and 2-
> proportion z-tests, 1- and 2- sample t-tests, a 1-sample z-test, and a
> paired t-test.

I actually had many of these written already, but hadn't checked them
in.  I'll go through your patch and extract what I can from it.

> It also has a rather rudimentary chi square test based
> on CategoricalSet, but it is currently #ifdef'ed out since CategoricalSet
> is apparently not functioning.

Yeah, I still haven't decided how to best represent categorical data.
What is in there now is a (fairly broken) early rough draft.  The
question of how to do this right needs to be addressed eventually.

> Each test is implemented as a class inheriting the base HypothesisTest...

Good.  I've been focusing on confidence interval methods so, to avoid
code duplication, we might want to define the hypothesis tests in
terms of the (more general) confidence intervals.

> The proportion z-tests' constructors throw an exception if the basic
> assumptions to combat skewness in the sampling distribution are not met.

Now this raises an interesting philosophical issue: what do you do if
someone runs a test on data that doesn't match the underlying
assumptions of the test?  I don't mean pathological data, but
situations where the test statistic (or whatever) can full well be
calculated, but just won't necessarily be meaningful.

I think that throwing an exception is not The Right Thing to do here.
Exceptions should be from unrecoverable errors, not a tool to stop
people from performing well-defined but ill-advised operations.  If
nothing else, this would make it impossible to write programs that
analyze how common tests *fail* when various assumptions are
violated.  (This isn't exactly an everyday thing to do, but it
certainly isn't something that we should implicitly disallow.)

> `ndtr.patch.gz' justs adds a couple of overloads for the normal_cdf and
> t_cdf functions, allowing the calculation of the cdf between two points
> rather than just the area to the left of the given point.

Good.  Maybe it would be nice to have those kinds of functions for
all of the various cdfs?

> `confint-smob.patch.gz' implements some Guile bindings to produce
> confidence intervals in the Guile console.

Excellent.  I've been too lazy to do this previously.

> `specfns-smob.patch.gz' adds bindings for the t cdf (t-cdf) and the
> inverse t cdf (inv-t-cdf) to Guile.

Ah, I did forget that one.  Good.

> Finally, `confint-prop.patch.gz' implements a couple of functions to
> produce confidence intervals for proportions. 

I've got a new unchecked-in file called
"parametric_estimation.{cpp,h}", which is where I'm putting all code
that (logically enough) estimates parameters under parametric
assumptions (like normality, poisson, binomial, etc.)

> I hope I'm not duplicating too much of your work here. If I am, please let
> me know so that I know where to look for these features in the future.

The code duplication involved here isn't that bad.  In the future, we
can coordinate our efforts more.

I'll apply your patches, tweak things to eliminate duplication, and
check stuff into CVS in the next day or two.

If you (or anyone else reading this) are interested in projects to
work on, here are some ideas.  (Of course, what anyone does should be
driven by their own interests.)

* Confidence intervals and inference on estimated variances.

* Multiple Regression, following the model of how things are done
  in the case of simple Regression.

* Confidence intervals and inference on parameters of other
  distributions, such as exponential, poisson, etc.

* Figure out a good interface and internal representation for
  categorical data sets.  A good solution will be general enough to
  work well for N-way tables.  However, 1-way and 2-way layouts should
  still be very easy to deal with.

* Anything for dealing with time series.  A way to fit time series
  data to various models (AR, ARIMA, ARCH, GARCH, etc.) would be
  excellent.

-JT



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]