RFC-7230 Sec 3.2.4 expressly forbids line-folding in header
field-names.
This change no longer allows obsolete line-folding between the
header field-name and the colon. If HTTP_PARSER_STRICT is unset,
the parser still allows space characters.
Reviewed-By: Fedor Indutny <fedor@indutny.com>
Always discard leading whitespace in a header value, even if it
is folded.
Pay attention to values of interesting headers (Connection,
Content-Length, etc.) even when they come on a continuation line.
Add a test case to check that requests and responses using only LF
to separate lines are handled correctly.
The support for folding of multi-line header values does not conform
to the specs. Given a request containing
Multi-Line-Header: foo<CRLF>
bar<CRLF>
http-parser will eliminate the whitespace breaking the header value to
yield a header value of "foobar". This is confirmed by the
LINE_FOLDING_IN_HEADER case in tests.c.
But from rfc2616, section 2.2:
A CRLF is allowed in the definition of TEXT only as part of a header
field continuation. It is expected that the folding LWS will be
replaced with a single SP before interpretation of the TEXT value.
And from draft-ietf-httpbis-p1-messaging-25, section 3.2.4:
A server that receives an obs-fold in a request message that is not
within a message/http container MUST either reject the message by
sending a 400 (Bad Request), preferably with a representation
explaining that obsolete line folding is unacceptable, or replace
each received obs-fold with one or more SP octets prior to
interpreting the field value or forwarding the message downstream.
So in the example above, the header value should be interpreted as
"foo bar", possibly with multiple spaces. The current http-parser
behaviour of eliminating the LWS altogether clearly deviates from the
specs.
For http-parser itself to confirm exactly would involve significant
changes in order to synthesize replacement SP octets. Such changes
are unlikely to be worth it to support what is an obscure and
deprecated feature. But http-parser should at least preserve some
separating whitespace when folding multi-line header values, so that
applications using http-parser can conform to the specs.
This commit is a minimal change to preserve whitespace when folding
lines. It eliminates the CRLF, but retains any trailing and leading
whitespace in the header value.
The HTTP_MAX_HEADER_SIZE check is not there to guard against
buffer overflows, it's there to protect unwitting embedders
against denial-of-service attacks.
Remove the HTTP_PARSER_DEBUG macro for two reasons:
* It changes the size of struct http_parser, resulting in spurious memory
corruption bugs if part of your application is built with HTTP_PARSER_DEBUG=1
and other parts with HTTP_PARSER_DEBUG=0.
* It's a debugging tool for maintainers. It should never have been exposed in
the API in the first place.
the v6 parsing works by adding extra states for working with the
[] notation for v6 addresses. hosts and ports cannot be 0-length
because we url parsing from ending when we expect those fields to
begin.
http_parser_parse_url gets a free check for the correctness of
CONNECT urls (they can only be host:port).
this addresses the following issues:
i was bored and had my head in this space.
Before this change it would include the last slash in the separator between the
schema and host as part of the host. we cant use the trick used for skipping the
separator before ports, query strings, and fragments because if it was a CONNECT
style url string (host:port) it would skip the first character of the hostname.
Work around this by introducing a few more states to represent these separators
in a url differently to what theyre separating. this in turn lets us simplify
the url parsing so can simply skip what it considers delimiters rather than
having to special case certain types of url parts and skip their prefixes.
Add tests for the http_parser_parse_url().
This compares the http_parser_url struct that http_parser_parse_url()
produces against one that we expect from the test. If they differ
then http_parser_parse_url() misbehaved.
'inline' is not a recognized C89 keyword, it made the build fail with strict or
older compilers (msvc 2008, gcc with -std=c89).
'inline' is also just a hint, one that gcc 4.4.3 in this particular case happily
ignored. Ergo, remove it.
Summary:
- Add http_parser_pause() API. A callback may invoke this at any time.
This will cause http_parser_parse() to return indicating that it
parsed less than the number of requested bytes and set an error to
HBE_PAUSED. A paused parser with fail with HBE_PAUSED until it is
un-paused with http_parser_pause().
- Stop using 'state', 'header_state', 'index', and 'nread' shadow
variables and then updating their http_parser fields when we're done.
Instead, update the live values as we go. This will make it possible
to return from anywhere in the parser (say, due to EPAUSED) and have
valid/expected state.
- Update state before making callbacks so that if the want to pause,
we'll know the correct state already.
- Make sure that every callback has a state that uniquely identifies the
next step so that we can resume in the right place if we were suppoed
to be paused.
- Clean and re-factor up CALLBACK() macros.
- Use CALLBACK() macros for (almost) all callbacks; on_headers_complete
is still a special case. This includes on_body which we used to invoke
manually with a long run of bytes. We now use a 'body' mark and hit
its callback just like every other data callback.
- Clean up (most) gotos and replace with real states.
- Add some unit tests.
Fixes#70
- Break EOF handling out of http_should_keep_alive() into
http_message_needs_eof(), which we now use when determining what to do
with a message of unknown length. This prevents us from falling into
the s_body_identity_eof state in the cases where we actually *do* know
the length of the message (e.g. because the response status was 204).
- Add an http_parser_parse_url() method to parse a URL into its
constituent components. This uses the same underlying parser
as http_parser_parse() and doesn't do any data copies.
- Re-add the URL components in various test.c structures; validate
them when parsing.
- Treat ' ' specially, as apparently IIS6.0 can send this in headers.
Allow this character through if we're not in strict mode.
- Move some test code around so that test indices don't break when
HTTP_PARSER_STRICT changes.
Fixes#13.
With HTTP/1.1, if neither Content-Length nor Transfer-Encoding is present,
section 4.4 of RFC 2616 suggests http-parser needs to read a response body
until the connection is closed (except the response must not include a body)
See also joyent/node#2457.
Fixes#72
- Get rid of support for these callbacks in http_parser_settings.
- Retain state transitions between different URL portions in
http_parser_execute() so that we're making the same correctness
guarantees as before.
- These are being removed because making multiple callbacks for the same
byte makes it more difficult to pause the parser.
- This was only used by CALLBACK() (which then cleared the mark anyway),
and the end of the http_parser_execute() body (after which they
go out of scope).
- Add http_errno enum w/ values for many parsing error conditions. Stash
this in http_parser.state if the 0x80 bit is set.
- Report line numbers on error generation if the (new) HTTP_PARSER_DEBUG
cpp symbol is set. Increases http_parser struct size by 8 bytes in
this case.
- Add http_errno_*() methods to help turning errno values into
human-readable messages.
- When handling upgraded bodies, http_parser_execute() used to return
one fewer bytes parsed than expected. This caused the final LF to be
interpreted by the caller as part of the body.
- Add a bunch of upgrade body unit tests.
Normal value cb is called for subsequent lines. LWS is skipped.
Note that \t whitespace character is now supported after header field name.
RFC 2616, Section 2.2
"HTTP/1.1 header field values can be folded onto multiple lines if the
continuation line begins with a space or horizontal tab. All linear
white space, including folding, has the same semantics as SP. A
recipient MAY replace any linear white space with a single SP before
interpreting the field value or forwarding the message downstream."
- Add IS_ALPHA(), IS_NUM(), IS_HOST_CHAR(), etc. macros for determining
membership in a character class. HTTP_PARSER_STRICT causes some of
these definitions to change.
- Support '_' character in hostnames in non-strict mode.
- Support leading digits in hostnames when the method is HTTP_CONNECT.
- Don't re-define HTTP_PARSER_STRICT in http_parser.h if it's already
defined.
- Tweak Makefile to run non-strict-mode unit tests. Rearrange non-strict
mode unit tests in test.c.
- Add test_fast to .gitignore.
Fixes#44
- This is non-spec behavior, but it appears that most HTTP servers
implicitly support non-ASCII characters when parsing path components.
Extend http-parser to allow this.
- Fill out slots [128, 256) in normal_url_char[] with 1 so that these
high octets are accepted in path components.
- Add unit test for paths that include such non-ASCII characters.
Fixes#37.
Check for overflow during chunk trailer by removing unnecessary check in macro PARSING_HEADER. This will force the parser to abort if the chunk trailer contains more than HTTP_MAX_HEADER_SIZE of data.
Without this change, it is possible to get an assertion to fail by
continuing to call http_parser_execute after it has returned an error.
Specifically, the parser could be called with parser->state ==
s_chunk_size_almost_done and parser->flags & F_CHUNKED set. Then,
F_CHUNKED could have been cleared, and an error could be hit. In this
case, the parser would have returned with F_CHUNKED clear, but
parser->state == s_chunk_size_almost_done, resulting in an assertion
failure on the next call.
There are alternate solutions possible, including just saving all of
the fields (state included) on error.
I didn't add a test case because this is a bit annoying to test, but I
can add one if necesssary.
acceptable_header[x] is always assigned to a variable of type char, so
the 'unsigned' is unnecessary.
The other arrays can be of type int8_t/uint8_t to save space.