Page 1 of 1

Rules for Mail Flitering

Posted: Mon 28 Jul 2014 15:30
by David Gibson
In a previous topic, http://british-caving.org.uk/phpBB3/vie ... =31&t=1254, I noted some useful filter rules, including
  • Body Contains Content-Type: application/x-zip-compressed, "or" Body contains Content-Type: application/zip "or" Body matches regex name=".{1,20}\.zip to get rid of ZIP attachments (often containing nasty payloads). See note 1 and next posting
  • Any Header Matches regex \nX-mailer[^\n]+?(sourceforge|[a-z]{4,6}[^ \n][0-9]{2}|[A-Z][a-z]* v[0-9].[0-9])\n to get rid of suspect mailer programs
  • Any header matches regex (?s:(some-string.*?,.*?){3,}) to match any message with three or more occurrences of the same (specified) string (see note2)
My latest problem is that I want to discard any message where the subject line was coded in UTF-8. Typically, UTF-8 strings begin =?UTF-8 and end in ?=. There is no legitimate reason why a subject line needs to be coded in UTF-8, and spammers often do it to escape the spam filters. The problem is that, at the time the filtering is applied to messages on the BCA server, the UTF characters have already been transliterated, thus the filter never sees the =?UTF-8 string. As far as I could see, the only way to achieve what I wanted was to work on another aspect of such spam messages, which is that the body is often coded in base64 (again, to avoid the filters I assume). So I ended up with Body matches regex charset=utf-8.*Content-Transfer-Encoding: base64 which is a bit of a sledgehammer as it requires the entire message to be scanned. Note: obviously you can only include this in messages to addresses where you are not expecting any base64 encoding. Since binary attachments are usually encoded like this, its a bit of a restriction.

Another useful regular expression: I wanted to discard all messages to the BCRA Trustees list, where the sender was not in a .uk, .com or .net domain. (Non-members of the list are rejected anyway, but I wanted to reduce my admin burden). Multiple negatives are difficult to do in regular expressions. In this case, what is needed is a "lookbehind" construction, but this is not always supported. Rather then mess around with an expression that I didnt know would be valid, I used a "lookahead" construction instead, viz: List of non-member addresses whose postings will be automatically rejected ^.*(?!(\.uk|com|net)$)...$. This is a bit more convoluted but still pretty neat. You can (if youre familiar with the concepts) see why a lookbehind would be neater.

Notes
  1. To accept ZIPs, ask your sender to include an authorisation string in the subject line, and use a rule like Subject Begins Authorised <password> Stop Processing
  2. I wanted a rule that discarded any message where the To: address list contained three or more occurrences of the same (specified) string (e.g. for spammers using lists of similar addresses. My problem was that the To: list may be split over several lines, and a 'dot' matches any character other than new-line. You cannot alter the 'mode' of the regexp because that is fixed by the software that processes the user-specified regexp, but you can switch different modes on and off in sub-expressions, so you just need to encapsulate your regexp in (?s:<your regexp>). So, for example, to discard any email where the string "d.gibson" appears three or more times you would write (?s:(d\.gibson.*?,.*?){3,})

Re: Rules for Mail Flitering

Posted: Wed 30 Jul 2014 11:20
by David Gibson
David Gibson wrote:Body matches regex name=".{1,20}\.zip
WARNING: You need to be very "diligent" when constructing regular expressions that contain a double-quote character because there is a bug - or at least a badly-written piece of code - in phpBB. When displaying your filter rule to you, it will escape the double-quote with a backslash. But when it re-interprets your edits, it gets it wrong. Instead of interpreting backslash-char as literal-char it interprets it as literal-backslash literal-char.

This is a common problem in PHP, connected with the difference between single-quoted and double-quoted strings, and the magicQuotes setting (now deprecated) and stripslashes(). and its not limited to phpBB - Ive seen a similar problem with Gradwell's mail filter editor - viz: it inserts backslashes (which by themselves are harmless) bit it mis-interprets them when re-reading your edits.

So... for example, the above expression name=".{1,20}\.zip is correct: it will work as intended. BUT... if you try to edit that expression, cpanel will bring it up with a backslash in front of the double-quotation mark. You MUST delete that backslash before confirming your edits. If you ignore it - which is easy to do if you've got several filter rules listed under the same rule name - you'll find that, each time you edit the filter rule, additional backslashes are added to your expression.

A very subtle problem - it is easy to break your* filter rules without realising. (* well, mine anyway).

Re: Rules for Mail Flitering

Posted: Wed 30 Jul 2014 11:41
by David Gibson
David Gibson wrote:So I ended up with Body matches regex charset=utf-8.*Content-Transfer-Encoding: base64 which is a bit of a sledgehammer as it requires the entire message to be scanned.
Actually, although that worked using the "Test filter" facility, it didnt work for real messages. You need to encapsulate the rule inside a mode-modifier, viz: (?s:charset=utf-8.*Content-Transfer-Encoding: base64) but that's a bit "extreme" as a filter rule. Something like charset=utf-8.*?\n.*?Content-Transfer-Encoding: base64 would be better ... testing it now.