summaryrefslogtreecommitdiff
path: root/lib/TUWF/Validate.pod
blob: 94fee3df68bb36a2af6e4fff3cbedd0caf1e1a0e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
=head1 NAME

TUWF::Validate - Data and form validation and normalization

=head1 DESCRIPTION

This module provides an easy and simple interface for data validation. It can
handle most types of data structures (scalars, hashes, arrays and nested data
structures), and has some conveniences for validating form-like data.

This module requires no additional modules from CPAN, and can be used
stand-alone, outside of the L<TUWF> ecosystem.  For integration with L<TUWF>,
see the C<compile()> and C<validate()> methods in L<TUWF::Misc>.

Note that this module will not solve B<all> your input validation problems. It
can validate the format and the structure of the data, but it does not support
validations that depend on other input values. For example, it is not possible
to specify that the contents of a I<password> field must be equivalent to that
of a I<confirm_password> field, but you can specify that both fields need to be
filled out. Recursive data structures are not supported. There is also no
built-in support for validating hashes with dynamic keys or arrays where not
all elements conform to the same schema. These could technically still be
validated with custom validations, but it won't be as convenient.

This module is designed to validate any kind of program input after it has been
parsed into a Perl data structure. It should not be used to validate function
parameters within Perl code. In fact, the correct answer to "how do I validate
function parameters?" is "don't, document your assumptions instead".

=head1 API

=head2 Validation

L<TUWF::Validate> provides two functions: C<compile()> and C<validate()>, these
functions can be called with the full package name, but are also exported on
request.

  use TUWF::Validate;
  state $validator = TUWF::Validate::compile($validations, $schema);
  my $result = $validator->validate($input);

  # Equivalent:
  use TUWF::Validate qw/compile/;
  state $validator = compile $validations, $schema;
  my $result = $validator->validate($input);

C<validate()> can also be used as a function with three arguments, so you can
skip the compilation step:

  use TUWF::Validate qw/validate/;
  my $result = validate $validations, $schema, $input;

  # Is equivalent to:
  use TUWF::Validate qr/compile/;
  my $result = compile($validations, $schema)->validate($input);

But if you are going to use the same schema to validate multiple inputs, it may
be faster to call C<compile()> only once and reuse the compiled C<$validator>
object.

In the above examples, C<$schema> is the schema that describes the data to be
validated (see L</SCHEMA DEFINITION> below), C<$validations> is a hashref
containing L<custom validations|/Custom validations> that C<$schema> can refer
to, C<$input> is the data to be validated, and the C<$result> object is
L<described below|/Result object>.

Both C<compile()> and C<validate()> may throw an error if the C<$validations>
or C<$schema> are invalid. Errors in the C<$input> should never cause an error
to be thrown, these are always reported in the C<$result> object.

This module takes great care that C<$input> is not being modified in place,
even if data normalization is being performed. The normalized data can be read
from the C<$result> object.

=head2 Result object

The C<$result> object returned by C<validate()> overloads boolean context, so
you can check if the validation succeeded with a simple if statement:

  my $result = TUWF::Validate::validate(..);
  if($result) {
    # Success!
    my $data = $result->data;
  } else {
    # Input failed to validate...
    my $error = $result->err;
  }

In addition, the result object implements the following methods:

=over

=item data()

Returns the validated and normalized data. This method will throw an error if
validation failed, so if you're lazy and don't want to bother too much with
proper error reporting, you can safely I<validate-and-die> in a single step:

  my $validated_data = validate(..)->data;

(Note regarding reference semantics: The returned data will usually be a
(possibly modified) copy of C<$input>, but may in some cases still have nested
references to data in C<$input> - so if you are working with nested hashrefs,
arrayrefs or other objects and are going to make modifications to the values
embedded within them, these changes may or may not also affect the values in
the original C<$input>. Make a deep copy of the data if you're concerned about
this).

=item unsafe_data()

Same as C<data()>, but does not throw an error if validation failed. Instead,
it returns the partially validated/normalized data. Can be used to throw the
data back at the user in a "Here, this is what I made of it, but I still don't
like it so please fix it!" fashion.

=item err()

Returns I<undef> if validation succeeded, an error object otherwise.

An error object is a hashref containing at least one key: I<validation>, which
indicates the name of the that validation failed. Additional keys with more
detailed information may be present, depending on the validation. These are
documented in L</SCHEMA DEFINITION> below.

=back


=head1 SCHEMA DEFINITION

A schema is a hashref, each key is the name of a built-in option or of a
validation to be performed. None of the options or validations are required,
but some built-ins have default values. This means that the empty schema C<{}>
is actually equivalent to:

  { type         => 'scalar',
    rmwhitespace => 1,
    required     => 1
  }

=head2 Built-in options

=over

=item type => $type

Specify the required type of the input, this can be I<scalar>, I<array>,
I<hash> or I<any>. If no type is specified or implied by other validations, the
default type is I<scalar>.

Upon failure, the error object will look something like:

  { validation => 'type',
    expected   => 'hash',
    got        => 'scalar'
  }

=item required => 0/1

Whether this input is required to have a value. Specifically, this means
C<exists($x) && defined($x) && $x ne ''>. If the input is empty and this option
is disabled, the I<default> option is returned when it is set, otherwise the
input is simply returned as-is.

As a corollary: Other validations will never get to validate undef or an empty
string, these values are either rejected or substituted with a default.

Note that this option is checked after I<rmwhitespace> and before any other
validation. So a string containing only whitespace is considered an empty
string, and will fail the I<required> test.

Default: true.

=item default => $val

The value to return if I<required> is false and the input is empty or undef.

=item rmwhitespace => 0/1

By default, any whitespace around scalar-type input is removed before testing
any other validations. Setting I<rmwhitespace> to a false value will disable
this behavior.

=item keys => $hashref

For C<< type => 'hash' >>, this option specifies which keys are permitted, and
how to validate the values. Each key in C<$hashref> corresponds to a key with
the same name in the input. Each value is a schema definition by which the
value in the input will be validated. The schema definition may be a bare
hashref or a validator returned by C<compile()>. If a value with
C<< required => 0 >> is not present in the input hash, it will be created in
the output with the default value (or undef).

For example, the following schema specifies that the input must be a hash with
three keys:

  { type => 'hash',
    keys => {
      username => { maxlength => 16 },
      password => { minlength => 8 },
      email    => { required => 0, email => 1 }
    }
  }

If validation on one or more keys fail, the error object that is returned looks
like:

  { validation => 'keys',
    errors => [
      # List of error objects, each with an additional 'key' field.
      { key => 'username', validation => 'required' }
      # In this case, the username was required but either absent or empty.
    ]
  }

=item unknown => $option

For C<< type => 'hash' >>, this option specifies what to do with keys in the
input data that have not been defined in the I<keys> option. Possible values
are I<remove> to remove unknown keys from the output data (this is the
default), I<reject> to return an error if there are unknown keys in the input,
or I<pass> to pass through any unknown keys to the output data. Note that the
values for passed-through keys will not be validated against any schema!

In the case of I<reject>, the error object will look like:

  { validation => 'unknown',
    # List of unknown keys present in the input
    keys       => ['unknown1', .. ],
    # List of known keys (which may or may not be present
    # in the input - that is checked at a later stage)
    expected   => ['known1', .. ]
  }

=item values => $schema

For C<< type => 'array' >>, this defines the schema that applies to all items
in the array.  The schema definition may be a bare hashref or a validator
returned by C<compile()>.

Failure is reported in a similar fashion to I<keys>:

  { validation => 'values',
    errors => [
      { index => 1, validation => 'required' }
    ]
  }

=item scalar => 0/1

For C<< type => 'array' >>, this option will also permit the input to be a
scalar. In this case, the input is interpreted and returned as an array with
only one element. This option exists to make it easy to validate multi-value
form inputs. For example, suppose that we wanted to parse a query string where
an option may be present multiple times with different values, like in
C<a=1&b=2&a=3>, and suppose that we have a query string parser that, given such
a string, would parse that into the following hash:

  { a => [1, 3], b => 1 }

But if C<a> is only specified once, it would parse into a scalar instead of an
array. With the I<scalar> option, we can permit C<a> to be a scalar and force
it into a single-element array. The following schema definition will validate
the above hash:

  { type => 'hash',
    keys => {
      a => { type => 'array', scalar => 1 },
      b => { }
    }
  }

=item sort => $option

For C<< type => 'array' >>, sort the array after validating its elements.
C<$option> determines how the array is sorted, possible values are I<str> for
string comparison, I<num> for numeric comparison, or a subroutine reference for
custom comparison function. The subroutine must be similar to the one given to
Perl's C<sort()> function, except it should compare C<$_[0]> and C<$_[1]>
instead of C<$a> and C<$b>.

=item unique => $option

For C<< type => 'array' >>, require elements to be unique. That is, don't allow
duplicate elements. There are several ways to specify what uniqueness means in
this context:

If C<$option> is a subroutine reference, then the subroutine is given an
element as first argument, and it should return a string that is used to check
for uniqueness. For example, if array elements are hashes, and you want to
check for uniqueness of a hash key named I<id>, you can specify this as
C<< unique => sub { $_[0]{id} } >>.

Otherwise, if C<$option> is true and the I<sort> option is set, then the
comparison function used for sorting is also used as uniqueness check. Two
elements are the same if the comparison function returns C<0>.

If C<$option> is true and I<sort> is not set, then the elements will be
interpreted as strings, similar to setting C<< unique => sub { $_[0] } >>.

All of that may sound complicated, but it's quite easy to use. Here's a few
examples:

  # This describes an array of hashes with keys 'id' and 'name'.
  { type => 'array',
    values => {
      type => 'hash',
      keys => {
        id   => { uint => 1 },
        name => {}
      }
    },
    # Sort the array on 'id'
    sort => sub { $_[0]{id} <=> $_[1]{id} },
    # And require that 'id' fields are unique
    unique => 1
  }

  # Contrived example: An array of strings, and we want
  # each string to start with a different character.
  { type => 'array',
    values => { minlength => 1 },
    unique => sub { substr $_[0], 0, 1 }
  }

On failure, this validation returns the following error object. This output
assumes the first schema from the previous example.

  { validation => 'unique',
    # Index and value of element a
    index_a => 1,
    value_a => { id => 3, name => 'whatever' }
    # Index and value of duplicate element b
    index_b => 4,
    value_b => { id => 3, name => 'something else' },
    # If string-based uniqueness was used, this is included as well:
    # key => '..'
  }


=item func => $sub

Run the input through a subroutine to perform additional validation or
normalization. The subroutine is only called after all other validations have
been checked. The subroutine is called with the input as its only argument.
Normalization of the input can be done by assigning to the first argument or
modifying its value in-place.

On success, the subroutine should return a true value. On failure, it should
return either a false value or a hashref. The hashref will have the
I<validation> key set to I<func>, and this will be returned as error object.

(Note that, when I<func> is used inside a custom validation, the returned error
object will have its I<validation> field set to the name of the custom
validation. This makes custom validations to behave as first-class validations
in terms of error reporting).


=back

=head2 Standard validations

Standard validations are provided by the module. It is possible to override,
re-implement and supplement these with custom validations. Internally, these
are, in fact, implemented as custom validations.

=over

=item regex => $re

Implies C<< type => 'scalar' >>. Validate the input against a regular
expression.

=item enum => $options

Implies C<< type => 'scalar' >>. Validate the input against a list of known
values. C<$options> can be either a scalar (in which case that is the only
permitted input), an array (listing all possible inputs) or a hash (where the
hash keys are considered to be the list of permitted inputs).

=item minlength => $num

Minimum length of the input. The I<length> is the string C<length()> if the
input is a scalar, the number of elements if the input is an array, or the
number of keys if the input is a hash.

=item maxlength => $num

Maximum length of the input.

=item length => $option

If C<$option> is a number, then this specifies the exact length of the input.
If C<$option> is an array, then this is a shorthand for
C<[$minlength,$maxlength]>.

=item anybool => 1

Accept any value of any type as input, and normalize it to either a C<0> or a
C<1> according to Perl's idea of truth.

=item jsonbool => 1

Require the input to be a boolean type returned by a JSON parser. Supported
types are L<JSON::PP>, L<JSON::XS>, L<Types::Serialiser>, L<Cpanel::JSON::XS>
and L<boolean>.

=item num => 1

Implies C<< type => 'scalar' >>. Require the input to be a number formatted
using the format permitted by JSON. Note that this is slightly more restrictive
from Perl's number formatting, in that 'NaN', 'Inf' and thousand separators are
not permitted.

=item int => 1

Implies C<< type => 'scalar' >>. Require the input to be an (arbitrarily large)
integer.

=item uint => 1

Implies C<< type => 'scalar' >>. Require the input to be an (arbitrarily large)
positive integer.

=item min => $num

Implies C<< num => 1 >>. Require the input to be larger than or equal to
C<$num>.

=item max => $num

Implies C<< num => 1 >>. Require the input to be smaller than or equal to
C<$num>.

=item range => [$min,$max]

Equivalent to C<< min => $min, max => $max >>.

=item ascii => 1

Implies C<< type => 'scalar' >>. Require the input to wholly consist of
printable ASCII characters.

=item ipv4 => 1

Implies C<< type => 'scalar' >>. Require the input to be an IPv4 address.

=item ipv6 => 1

Implies C<< type => 'scalar' >>. Require the input to be an IPv6 address. Note
that the IP address is not normalized, and fancy features such as
IPv4-manned-IPv6 addresses are not permitted.

=item ip => 1

Require either C<< ipv4 => 1 >> or C<< ipv6 => 1 >>.

=item email => 1

Implies C<< type => 'scalar' >>. Validate the email address against a
monstrosity of a regular expression. This email validation is designed to catch
obviously invalid addresses and addresses that, while compliant with some RFCs,
will not be accepted by most actual SMTP implementations.

Email validation is quite a minefield, see L<Data::Validate::Email> for an
alternative solution.

=item weburl => 1

Implies C<< type => 'scalar' >>. Requires the input to be a C<http://> or
C<https://> url.

=back


=head2 Custom validations

Custom validations can be passed to C<compile()> and C<validate()> as the
C<$validations> hashref argument.  A custom validation is, in simple terms,
either a schema or a subroutine that returns a schema.  The custom validation
can then be referenced from other schemas.

Here's a simple example that defines and uses a custom validation named
I<stringbool>, which accepts either the string I<true> or I<false>.

  my $validations = {
    stringbool => { enum => ['true', 'false'] }
  };
  my $schema = { stringbool => 1 };
  my $result = validate $validations, $schema, 'true';
  # $result->data() eq 'true'

A custom validation can also be defined as a subroutine, in which case it can
accept options. Here is an example of a I<prefix> custom validation, which
requires that the string starts with the given prefix. The subroutine returns a
schema that contains the I<func> built-in option to do the actual validation.

  my $validations = {
    prefix => sub {
      my $prefix = shift;
      return {
        func => sub { $_[0] =~ /^\Q$prefix/ }
      }
    }
  };
  my $schema = { prefix => 'Hello, ' };
  my $result = validate $validations, $schema, 'Hello, World!';

=head3 Custom validations and built-in options

Custom validations can also set built-in options, but the semantics differ a
little depending on the option. First, be aware that many of the built-in
options apply to the whole schema and not just to the custom validation.  For
example, if the top-level schema sets C<< rmwhitespace => 0 >>, then all of the
validations used in that schema may get input with whitespace around it.

All validations used in a schema need to agree upon a single I<type> option.
If a custom validation does not specify a I<type> option (and no type is
implied by another validation such as I<enum> or I<regex>), then the validation
should work with every type. It is an error to define a schema that mixes
validations of different types. For example, the following will throw an error:

  compile {}, {
    # top-level schema says we expect a hash
    type => 'hash',
    # but the 'int' validation implies that the type is a scalar
    int => 1
  };

The I<keys>, I<values> and C<func> built-in options will be validated
separately for each custom validation. So if you have multiple custom
validations that set the I<values> option, then the array elements must
validate all the listed schemas. The same applies to I<keys>: If the same key
is listed in multiple custom validations, then the key must conform to all
schemas. With respect to the I<unknown> option, a key that is mentioned in any
of the I<keys> options is considered "known".

All other built-in options follow inheritance semantics: These options can be
set in a custom validation, and they will be inherited by the top-level schema.
If the same option is set in multiple validations, only the first one (in
alphabetic order by the name of the validation) will be inherited. The
top-level schema can always override options set by custom validations.


=head1 SEE ALSO

L<TUWF>.

TUWF::Validate has drawn inspiration from L<Brannigan>. Brannigan is very
similar, but slightly more complex and more buggy (and, unfortunately,
unmaintained). TUWF::Validate has more detailed error types and more powerful
I<custom validations>, but lacks grouping, inheritance and wildcard hash keys.

L<Sah> and L<Data::Sah> provide a more advanced interface for data validation.
I have found Sah schemas to not be terribly convenient for form validation.  I
haven't done any benchmarks, but I suspect that Sah is a bit faster than
TUWF::Validate, at the cost of higher memory usage and a large dependency tree.

L<JSON::Schema> is similar to Sah: It features more advanced data structure
validation, but the schema is not terribly convenient for form validation, and
the module has more dependencies than I'd prefer.

=head1 COPYRIGHT

Copyright (c) 2008-2018 Yoran Heling.

This module is part of the TUWF framework and is free software available under
the liberal MIT license. See the COPYING file in the TUWF distribution for the
details.


=head1 AUTHOR

Yoran Heling <projects@yorhel.nl>

=cut