Skip to content

filter

train_test_filter(seed=180, train_p=0.5)

A pipeline filter which pseudorandomly assigns values from the input into "train" or "test" groups. This is particularly useful when working with streams of data of potentially unbounded length.

This isn't a great method for small datasets, as it doesn't guarantee producing training and test sets which are as close as possible to the specified desired proportions. Consider using the scikit-learn train_test_split for cases where it's practical to enumerate the full dataset in advance.

Parameters:

Name Type Description Default
seed int

random number generator seeding value

180
train_p float

proportion of data which go into the training set. A float between 0 and 1.

0.5

Returns:

Type Description
Tuple[Callable[[Iterable], Iterable], Callable[[Iterable], Iterable]]

a tuple of callables (train_filter, test_filter) which split the input data into two complementary streams.

Examples:

We can create complementary train and test filters using the function:

>>> train_filter, test_filter = train_test_filter(train_p=0.6, seed=180)

The train_filter generates a sequence of ~60% of the input list – in this case, 15 of 20 datapoints. Note that the correct split would be 12 of 20 data points. Again, for data with bounded length it is advisable to use scikit-learn train_test_split instead.

>>> list(train_filter(range(20)))
[0, 2, 3, 4, 5, 6, 9, 10, 11, 12, 15, 16, 17, 18, 19]

When we run the test_filter, it fills in the gaps, giving us the remaining 5 values:

>>> list(test_filter(range(20)))
[1, 7, 8, 13, 14]

We can continue to generate new values for as long as we like using the same filter and the continuation of the input range:

>>> list(train_filter(range(20, 40)))
[20, 22, 23, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39]

... and some more.

>>> list(train_filter(range(40, 50)))
[41, 42, 44, 45, 46, 49]

As the number of samples grows, the fraction in the train and test sets will approach train_p and 1 - train_p.

The test_filter fills in the gaps again.

>>> list(test_filter(range(20, 30)))
[21, 24, 25, 26]

If you rerun the same test_filter on a fresh range, then the results will be different to the first time around:

>>> list(test_filter(range(20)))
[5, 10, 13, 17, 18]

... but if you regenerate the test_filter, it'll reproduce the original sequence

>>> _, test_filter_regenerated = train_test_filter(train_p=0.6, seed=180)
>>> list(test_filter_regenerated(range(20)))
[1, 7, 8, 13, 14]

It also works on tuple-valued lists:

>>> from itertools import product
>>> train_filter_tuple, test_filter_tuple = train_test_filter(train_p=0.3, seed=42)
>>> list(test_filter_tuple(product(["a", "b"], [1, 2, 3])))
[('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 3)]
>>> list(train_filter_tuple(product(["a","b"], [1,2,3])))
[('b', 2)]
>>> from itertools import count, takewhile
>>> train_filter_unbounded, test_filter_unbounded = train_test_filter(train_p=0.5, seed=21)
>>> list(takewhile(lambda s: s < 90, count(79)))
[79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
>>> train_pool = train_filter_unbounded(count(79))
>>> list(takewhile(lambda s: s < 90, train_pool))
[82, 85, 86, 89]
>>> test_pool = test_filter_unbounded(count(79))
>>> list(takewhile(lambda s: s < 90, test_pool))
[79, 80, 81, 83, 84, 87, 88]
>>> list(takewhile(lambda s: s < 110, test_pool))
[91, 93, 94, 97, 100, 105, 106, 109]
Source code in autora/experimentalist/filter.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def train_test_filter(
    seed: int = 180, train_p: float = 0.5
) -> Tuple[Callable[[Iterable], Iterable], Callable[[Iterable], Iterable]]:
    """
    A pipeline filter which pseudorandomly assigns values from the input into "train" or "test"
    groups. This is particularly useful when working with streams of data of potentially
    unbounded length.

    This isn't a great method for small datasets, as it doesn't guarantee producing training
    and test sets which are as close as possible to the specified desired proportions.
    Consider using the scikit-learn `train_test_split` for cases where it's practical to
    enumerate the full dataset in advance.

    Args:
        seed: random number generator seeding value
        train_p: proportion of data which go into the training set. A float between 0 and 1.

    Returns:
        a tuple of callables `(train_filter, test_filter)` which split the input data
            into two complementary streams.


    Examples:
        We can create complementary train and test filters using the function:
        >>> train_filter, test_filter = train_test_filter(train_p=0.6, seed=180)

        The `train_filter` generates a sequence of ~60% of the input list –
        in this case, 15 of 20 datapoints.
        Note that the correct split would be 12 of 20 data points.
        Again, for data with bounded length it is advisable
        to use scikit-learn `train_test_split` instead.
        >>> list(train_filter(range(20)))
        [0, 2, 3, 4, 5, 6, 9, 10, 11, 12, 15, 16, 17, 18, 19]

        When we run the `test_filter`, it fills in the gaps, giving us the remaining 5 values:
        >>> list(test_filter(range(20)))
        [1, 7, 8, 13, 14]

        We can continue to generate new values for as long as we like using the same filter and the
        continuation of the input range:
        >>> list(train_filter(range(20, 40)))
        [20, 22, 23, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39]

        ... and some more.
        >>> list(train_filter(range(40, 50)))
        [41, 42, 44, 45, 46, 49]

        As the number of samples grows, the fraction in the train and test sets
        will approach `train_p` and `1 - train_p`.

        The test_filter fills in the gaps again.
        >>> list(test_filter(range(20, 30)))
        [21, 24, 25, 26]

        If you rerun the *same* test_filter on a fresh range, then the results will be different
        to the first time around:
        >>> list(test_filter(range(20)))
        [5, 10, 13, 17, 18]

        ... but if you regenerate the test_filter, it'll reproduce the original sequence
        >>> _, test_filter_regenerated = train_test_filter(train_p=0.6, seed=180)
        >>> list(test_filter_regenerated(range(20)))
        [1, 7, 8, 13, 14]

        It also works on tuple-valued lists:
        >>> from itertools import product
        >>> train_filter_tuple, test_filter_tuple = train_test_filter(train_p=0.3, seed=42)
        >>> list(test_filter_tuple(product(["a", "b"], [1, 2, 3])))
        [('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 3)]

        >>> list(train_filter_tuple(product(["a","b"], [1,2,3])))
        [('b', 2)]

        >>> from itertools import count, takewhile
        >>> train_filter_unbounded, test_filter_unbounded = train_test_filter(train_p=0.5, seed=21)

        >>> list(takewhile(lambda s: s < 90, count(79)))
        [79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]

        >>> train_pool = train_filter_unbounded(count(79))
        >>> list(takewhile(lambda s: s < 90, train_pool))
        [82, 85, 86, 89]

        >>> test_pool = test_filter_unbounded(count(79))
        >>> list(takewhile(lambda s: s < 90, test_pool))
        [79, 80, 81, 83, 84, 87, 88]

        >>> list(takewhile(lambda s: s < 110, test_pool))
        [91, 93, 94, 97, 100, 105, 106, 109]

    """

    test_p = 1 - train_p

    _TrainTest = Enum("_TrainTest", ["train", "test"])

    def train_test_stream():
        """Generates a pseudorandom stream of _TrainTest.train and _TrainTest.test."""
        rng = np.random.default_rng(seed)
        while True:
            yield rng.choice([_TrainTest.train, _TrainTest.test], p=(train_p, test_p))

    def _factory(allow):
        """Factory to make complementary generators which split their input
        corresponding to the values of the pseudorandom train_test_stream."""
        _stream = train_test_stream()

        def _generator(values):
            """Generator which yields items from the `values` depending on
            whether the corresponding item from the `_stream`
            matches the `allow` parameter."""
            for v, train_test in zip(values, _stream):
                if train_test == allow:
                    yield v

        return _generator

    return _factory(_TrainTest.train), _factory(_TrainTest.test)