Filter
train_test_filter(seed=180, train_p=0.5)
A pipeline filter which pseudorandomly assigns values from the input into "train" or "test" groups. This is particularly useful when working with streams of data of potentially unbounded length.
This isn't a great method for small datasets, as it doesn't guarantee producing training
and test sets which are as close as possible to the specified desired proportions.
Consider using the scikit-learn train_test_split
for cases where it's practical to
enumerate the full dataset in advance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seed |
int
|
random number generator seeding value |
180
|
train_p |
float
|
proportion of data which go into the training set. A float between 0 and 1. |
0.5
|
Returns:
Type | Description |
---|---|
Tuple[Callable[[Iterable], Iterable], Callable[[Iterable], Iterable]]
|
a tuple of callables |
Examples:
We can create complementary train and test filters using the function:
>>> train_filter, test_filter = train_test_filter(train_p=0.6, seed=180)
The train_filter
generates a sequence of ~60% of the input list –
in this case, 15 of 20 datapoints.
Note that the correct split would be 12 of 20 data points.
Again, for data with bounded length it is advisable
to use scikit-learn train_test_split
instead.
>>> list(train_filter(range(20)))
[0, 2, 3, 4, 5, 6, 9, 10, 11, 12, 15, 16, 17, 18, 19]
When we run the test_filter
, it fills in the gaps, giving us the remaining 5 values:
>>> list(test_filter(range(20)))
[1, 7, 8, 13, 14]
We can continue to generate new values for as long as we like using the same filter and the continuation of the input range:
>>> list(train_filter(range(20, 40)))
[20, 22, 23, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39]
... and some more.
>>> list(train_filter(range(40, 50)))
[41, 42, 44, 45, 46, 49]
As the number of samples grows, the fraction in the train and test sets
will approach train_p
and 1 - train_p
.
The test_filter fills in the gaps again.
>>> list(test_filter(range(20, 30)))
[21, 24, 25, 26]
If you rerun the same test_filter on a fresh range, then the results will be different to the first time around:
>>> list(test_filter(range(20)))
[5, 10, 13, 17, 18]
... but if you regenerate the test_filter, it'll reproduce the original sequence
>>> _, test_filter_regenerated = train_test_filter(train_p=0.6, seed=180)
>>> list(test_filter_regenerated(range(20)))
[1, 7, 8, 13, 14]
It also works on tuple-valued lists:
>>> from itertools import product
>>> train_filter_tuple, test_filter_tuple = train_test_filter(train_p=0.3, seed=42)
>>> list(test_filter_tuple(product(["a", "b"], [1, 2, 3])))
[('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 3)]
>>> list(train_filter_tuple(product(["a","b"], [1,2,3])))
[('b', 2)]
>>> from itertools import count, takewhile
>>> train_filter_unbounded, test_filter_unbounded = train_test_filter(train_p=0.5, seed=21)
>>> list(takewhile(lambda s: s < 90, count(79)))
[79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
>>> train_pool = train_filter_unbounded(count(79))
>>> list(takewhile(lambda s: s < 90, train_pool))
[82, 85, 86, 89]
>>> test_pool = test_filter_unbounded(count(79))
>>> list(takewhile(lambda s: s < 90, test_pool))
[79, 80, 81, 83, 84, 87, 88]
>>> list(takewhile(lambda s: s < 110, test_pool))
[91, 93, 94, 97, 100, 105, 106, 109]
Source code in autora/experimentalist/filter.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|