1. Title of Database: Case Studies in Scientific Function Finding 2. Sources -- Donor: Cullen Schaffer Department of Computer Science Rutgers University New Brunswick, NJ 08903 schaffer@paul.rutgers.edu -- Source: Cullen Schaffer, Domain-Independent Scientific Function Finding. PhD Thesis, Department of Computer Science, Rutgers University, 1990 (Technical Report LCSR-TR-149). -- Date Received: Sept. 1, 1990 3. Past Usage: -- See Cullen Schaffer, "A Proven Domain-Independent Scientific Function-Finding Algorithm," in AAAI-90 for a brief account of the results of the original study based on this collection or the PhD thesis cited above for an in-depth report. Schaffer's work includes (1) development of an algorithm E* designed to find functional relationships of scientific significance in data of the kind collected in this database (2) analysis of previous scientific function-finding algorithms in the light of real data and (3) a general inquiry into the nature of scientific function finding as practiced by scientists. 4. Overview: [Please note the use of Latex format here for algebraic expressions. See Leslie Lamport, Latex: A Document Preparation System, Addison-Wesley, 1986 for details.] This database contains 352 bivariate numeric data sets collected from diverse sources and resulting, with a few exceptions, from investigations in physical science. For each data set, the collection includes: 1. Source: Bibliographic information for the source of the data. 2. Description: Identification of the variables $x$ and $y$. Except in a few clearly identified instances, the abbreviated format $y$ vs. $x$ is employed. An entry of the form Description: Force vs. separation. indicates that $x$ is a separation and $y$ is a force. In some cases--when the information was readily available--the description also includes the units in which the data was originally reported. 3. Reference relation: The functional relationship proposed by the reporting scientist in the original source. 4. Comments (optional): Additional information pertaining to the case. In recording reference relations, the database often omits details of parameter values. If a scientist proposes $y=23.1x-.0014$, the reference relation may be given as just $y=k_{1}x+k_{2}$. Also, since algebraic transformations have been employed freely, the same relation might be given as $y/x=k_{2}/x+k_{1}$. In general, data collected here is given in full as it appeared in the original source. Fractions have been converted to decimals, numbers have been freely translated to and from scientific notation and zeros have sometimes been added to decimal numbers to facilitate tabulation. Any additional deviations from verbatim transcription are noted in the Comments entry of the associated case. Note in particular that, in a few clearly identified cases, apparent typographical errors have been corrected and that, in others, data points identified by the reporting scientist as *not* conforming to the proposed relationship have been omitted. 5. Database organization: The 352 data sets in this collection are organized into 217 cases, each case normally consisting of one to four data sets reported in support of a common hypothesized relationship. An example is Case 91, which consists of two data sets--91a and 91b--reported to show the linear dependence of electric force on the inverse root of the radius of a conducting wire. With a very few exceptions, cases are formed from data sets reported together in a single article or other publication. Cases are numbered in order of collection. A few early cases consisted of data for which no reference relation was proposed and these have been omitted here. Hence, for example, the collection does not include Case 26. A complete listing of the case numbers appearing in this collection is listed at the end of this file. Briefly, the cases are organized as follows (as explained below): Cases 1 through 62 are "Selected": -- These were willfully chosen as useful, notable or interesting from a wide variety of sources including handbooks, theses, journal articles, textbooks, student laboratory reports and others. Cases 63 through 222 are "Sampled": -- These were obtained by scanning issues of the journal Physical Review from the early years of this century and recording {\em all} examples of scientific function finding satisfying four conditions: 1. The source reported a governing functional relationship. 2. This relationship was bivariate. 3. The data was reported in tabular rather than graphic form. 4. The data was measured rather than theoretically postulated. The rationale for these conditions is given in Chapter 1 of the thesis cited above. Chapter 3 gives a detailed account of methodological difficulties encountered in attempting to apply the conditions objectively to obtain a representative sample of scientific function-finding problems. Finally, Appendix D lists a large number of data sets *not* collected for various specified reasons. ANYONE INTENDING TO EMPLOY THE DATA SETS IN THIS DATABASE IN A SERIOUS RESEARCH PROGRAM IS STRONGLY ADVISED TO CONSULT THESE CHAPTERS AND APPENDIX, SINCE THEY CONTAIN A HOST OF IMPORTANT CAVEATS REGARDING THE COLLECTION. 6. A note on use of the collection: This is, to date, the only collection of its kind in existence and as such it may be of use to researchers studying scientific function finding. Schaffer, for example, designed his E* function-finding algorithm on the basis of experience with Cases 1 though 122 and then tested the algorithm prospectively on Cases 123 through 222 to see how often it proposed the reference relation and how often it proposed other, presumably spurious relationships. Future researchers may wish to use the same data for similar purposes, but they must be careful to avoid "testing on the training set"--designing algorithms on the basis of this collection of problems and then reporting performance on the same problems. By contrast, Cases 123 through 222 were fresh data for Schaffer, since he collected them after E* was fixed. Researchers intending to use such a subset of cases for testing should refrain from examining them in any fashion prior to the test. 7. Maintenance: Send comments or corrections to Cullen Schaffer at the addresses listed above. 8. Case Numbers Appearing in this Database: 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24, 25,27,28,30a,30b,31,35,36,37a,37b,38,39,40,41a,41b,41c,41d,42, 43a,43b,43c,44,45,46a,46b,47,48,49,50,51,52,53,54,55,56a,56b, 56c,58,59,60,61a,61b,62a,62b,63,64,66,67,68,69,70a,70b,71a,71b, 72,73,74,75,76a,76b,76c,77,78,79,80a,80b,81a,81b,82a,82b,83,84a, 84b,84c,84d,85,86,87,88,89,90,91a,91b,92,93,94,95,96a,96b,97, 98a,98b,98c,98d,99a,99b,100a,100b,100c,100d,101a,101b,102,103,104a, 104b,105,106a,106b,106c,107,108a,108b,108c,108d,109,110,111a,111b, 111c,112a,112b,113a,113b,113c,114,115,116,117,118,119,120,121, 122a,122b,122c,122d,123,124,125a,125b,126a,126b,127a,127b,127c,128a, 128b,128c,128d,129a,129b,129c,130a,130b,130c,131a,131b,131c,132,133, 134a,134b,135a,135b,135c,135d,136a,136b,136c,136d,137,138a,138b,139, 140,141,142,143,144a,144b,145a,145b,146,147,148a,148b,148c,148d, 149,150,151,152a,152b,152c,153a,153b,153c,153d,154,155,156,157, 158a,158b,158c,159a,159b,160,161a,161b,162,163,164,165,166a,166b, 167,168,169,170a,170b,171a,171b,171c,172,173a,173b,174,175,176a, 176b,177,178a,178b,178c,179a,179b,180,181a,181b,181c,181d,182a,182b, 182c,182d,183a,183b,183c,184,185,186a,186b,186c,186d,187a,187b,187c,, 187d.188,189,190a,190b,190c,191,192,193,194a,194b,194c,194d,195a,195b, 196a,196b,196c,196d,197a,197b,197c,197d,198,199a,199b,200a,200b,200c, 200d,201,202,203,204a,204b,204c,204d,205,206a,206b,206c,206d,207, 208,209a,209b,209c,209d,210,211,212a,212b,212c,213a,213b,214a,214b, 214c,214d,215,216,217,218a,218b,218c,218d,219,220,221,222