Project

General

Profile

URL patterns and ReDoS attacks

Added by 0gitnick over 1 year ago

While I was checking if Haketilo could eat Libredirect, I noticed the URL pattern matching is quite limited which led me to finding out about ReDoS attacks. I didn't find any further discussion about ReDoS and URL pattern matching pertaining to Haketilo in the issues, forums, or the wiki and I think it warrants discussion.

Currently, I assume we're using wildcards only (a safe subset of RegEx) to prevent ReDoS.

There are a few options to prevent ReDoS but still allow complex RegEx for URL patterns:

  1. Shift URL pattern matching to Haketilo.
  2. Include a warning about ReDoS in Hydrilla asking repo maintainers to be careful which URL patterns they let in.
  3. Use a library to detect possibly catastrophic RegEx and preemptively blacklist it.
  4. Set a RegEx engine timeout and/or blacklist URL patterns that cause frequent timeouts.

I think 3 and 4 are good choices. In our case, RegEx engine timeout isn't a disaster. It just means the resources for that one URL pattern won't be delivered (at least until the RegEx is fixed).

Alternatively, we could continue limiting patterns to a safe RegEx subset and discuss which features should be included in the gradually expanding subset.

Thoughts?


Replies (3)

Reply

RE: URL patterns and ReDoS attacks - Added by koszko over 1 year ago

I was checking if Haketilo could eat Libredirect

It could and "should" (I don't mean anything socially bad, just giving it the facility to do the same thing Libredirect does).

I noticed the URL pattern matching is quite limited which led me to finding out about ReDoS attacks. I didn't find any further discussion about ReDoS and URL pattern matching pertaining to Haketilo in the issues, forums, or the wiki and I think it warrants discussion.

Currently, I assume we're using wildcards only (a safe subset of RegEx) to prevent ReDoS.

The wildcards we are using rn are indeed a safe subset of RegEx with respect to the kind of attacks you are talking about. However, this is not the reason we're not using regexes. The true reason is that I was worried about the computational complexity of finding matches for URLs. Assume patterns could allow arbitrary regexes. With n patterns in the database we'd need O(n) time to naively check a URL against each of them. A better, non-naive algorithm may be possible to design but this might prove non-trivial. And additionally using limited patterns made it easy to decide which match should take precedence over another.

Now I think my approach might have been a bit of a premature optimization. Rn in Haketilo the limited nature of WebExtension APIs is forcing me to include ALL enabled patterns in a dynamic content scripts anyway :/ (although it is possible to optimize this as well).

There are a few options to prevent ReDoS but still allow complex RegEx for URL patterns:

  1. Shift URL pattern matching to Haketilo.

That actually seems like a good idea. I was wondering how users could maintain privacy when accessing Hydrilla and updating resources/mappings. If we employ the strategy used by, say, APT (i.e. having the repo client download a list of all repo packages and process them client-side) it should not cause performance problems (we'd need millions of packages for it to become an issue) and it should make it easier for users to decide which packages and when to update without sending Hydrilla server unnecessary information about websites being browsed at the given moment (cautious users will still need to use Tor or similar, though).

  1. Include a warning about ReDoS in Hydrilla asking repo maintainers to be careful which URL patterns they let in.
  2. Use a library to detect possibly catastrophic RegEx and preemptively blacklist it.
  3. Set a RegEx engine timeout and/or blacklist URL patterns that cause frequent timeouts.

I think 3 and 4 are good choices. In our case, RegEx engine timeout isn't a disaster. It just means the resources for that one URL pattern won't be delivered (at least until the RegEx is fixed).

I don't like 2 and 4. And I'd rather ensure the regex engine used doesn't employ the naive algorithm discussed in the article you linked. If we find out given runtime is vulnerable to this, we can find some good regex library instead. If it turns out the Python3 implementations we're interested in have safe RegEx implementations, we could just add a warning in the documentation (for possible porters to other environments).

Alternatively, we could continue limiting patterns to a safe RegEx subset and discuss which features should be included in the gradually expanding subset.

Thoughts?

Indeed, once our current patterns schema becomes insufficient (which probably won't be too soon anyway), we can combine the 2 approaches. We could for instance allow a special URL pattern that can contain regexes:

https://some.example.com/segmenta/segmentb/view/146              <- a normal pattern
https://some.example.com/segmenta/segmentb/**                    <- another normal pattern

Now, let's assume that when a path segment starts with an
asterisk ('!'), everything after that asterisk is considered a
regex that the rest of the URL has to match.

https://some.example.com/segmenta/segmentb/!(view|widok)/[1-3].* <- a (implicitly anchored) regex pattern
                         \_______________/ \___________________/
                            initial part        regex part
\______/\______________/ \_____________________________________/
  proto    domain part                 path part

Matches:
https://some.example.com/segmenta/segmentb/view/1
https://some.example.com/segmenta/segmentb/view/23/aaa
https://some.example.com/segmenta/segmentb/widok/32

Does NOT match:
https://some.example.com/segmenta/segmentb/view
https://some.example.com/segmenta/segmentb/widok/45

This is just a rough idea. We'd have to decide how to do the analogous for domains, how to express literal '!' in patterns, whether and how to enable passing regex flags (for case-insensitive matching, etc.) and whether we really want the regexes to be implicitly anchored. And how the specificity of regex patterns should be decided (although I think http://example.com/!i_am_a_regex should be considered more specific than http://example.com/* and less than both http://example.com/ and http://example.com/something).

I hope you see how this enables flexibility while still allowing optimizations we're using right now :)

Btw, thank you for bringing up those issues

RE: URL patterns and ReDoS attacks - Added by 0gitnick over 1 year ago

koszko wrote:

If we employ the strategy used by, say, APT (i.e. having the repo client download a list of all repo packages and process them client-side) it should not cause performance problems (we'd need millions of packages for it to become an issue) and it should make it easier for users to decide which packages and when to update without sending Hydrilla server unnecessary information about websites being browsed at the given moment

I agree. It improves user privacy and would distribute computation to the client, making Hydrilla instances less CPU-intensive. Importantly though, 1. alone does not prevent ReDoS.

2. Include a warning about ReDoS in Hydrilla asking repo maintainers to be careful which URL patterns they let in.
3. Use a library to detect possibly catastrophic RegEx and preemptively blacklist it.
4. Set a RegEx engine timeout and/or blacklist URL patterns that cause frequent timeouts.

I don't like 2 and 4. And I'd rather ensure the regex engine used doesn't employ the naive algorithm discussed in the article you linked. If we find out given runtime is vulnerable to this, we can find some good regex library instead. If it turns out the Python3 implementations we're interested in have safe RegEx implementations, we could just add a warning in the documentation (for possible porters to other environments).

I suggested timeouts because I was under the impression that safe RegEx implementations weren't bulletproof, but it looks like RE2 does avoid exponential time with O(n) as a worst case. I agree we should add a warning for porters. What do you think about safe regex in conjunction with client-side URL processing?

Indeed, once our current patterns schema becomes insufficient (which probably won't be too soon anyway), we can combine the 2 approaches. We could for instance allow a special URL pattern that can contain regexes:

https://some.example.com/segmenta/segmentb/view/146              <- a normal pattern
https://some.example.com/segmenta/segmentb/**                    <- another normal pattern

Now, let's assume that when a path segment starts with an
asterisk ('!'), everything after that asterisk is considered a
regex that the rest of the URL has to match.

https://some.example.com/segmenta/segmentb/!(view|widok)/[1-3].* <- a (implicitly anchored) regex pattern
                         \_______________/ \___________________/
                            initial part        regex part
\______/\______________/ \_____________________________________/
  proto    domain part                 path part

Matches:
https://some.example.com/segmenta/segmentb/view/1
https://some.example.com/segmenta/segmentb/view/23/aaa
https://some.example.com/segmenta/segmentb/widok/32

Does NOT match:
https://some.example.com/segmenta/segmentb/view
https://some.example.com/segmenta/segmentb/widok/45

This is just a rough idea. We'd have to decide how to do the analogous for domains, how to express literal '!' in patterns, whether and how to enable passing regex flags (for case-insensitive matching, etc.) and whether we really want the regexes to be implicitly anchored. And how the specificity of regex patterns should be decided (although I think http://example.com/!i_am_a_regex should be considered more specific than http://example.com/* and less than both http://example.com/ and http://example.com/something).

I hope you see how this enables flexibility while still allowing optimizations we're using right now :)

I agree about the specificity order. If we use only RegEx, I think it will be difficult to evaluate specificity at all. But a hybrid scheme would add extra complexity. If we expect, say, 90% of URL patterns in usage to be expressed as wildcard-only-containing patterns and wildcard-only evaluation is much faster, it might make sense to have a hybrid scheme. But if we use Re2, worst case is already O(n) where n is the string size. Obviously RegEx-only would be a breaking change. I'm not sure a hybrid approach would. I'm not sure what to think about this yet.

RE: URL patterns and ReDoS attacks - Added by koszko over 1 year ago

I suggested timeouts because I was under the impression that safe RegEx implementations weren't bulletproof

Interesting. After the theory of automata course I was under impression that all real-life RegEx implementations out there either convert the NFA to a DFA for O(n) performance (with respect to input length) or use the O(k*n) algorithm (with k and n being the lengths of regex and input, respectively). Of course, I didn't know about backtracking which explains everything.

Anyway, it seems we also need to be careful what regexes we put in the code. For example in my last update to JSON schemas:
https://hydrilla.koszko.org/schemas/api_mapping_description-1.0.1.schema.json

The pattern for validating $schema property looks like it could be an evil regex, right where the schema version gets matched.

It really isn't ;) If you represent its NFA as a graph, you'll see that there are no multiple alternative paths because the grouping is always terminated by a literal ".".

Nevertheless, I checked that ECMAScript regex implementations in some browsers (Abrowser 97) are indeed vulnerable to the attack... We could look for an alternative regex library at some point (with priorities being the Roadmap tasks, of course!).

    (1-3/3)

    Reply