ngx_http_html_sanitize_module HTML 过滤器模块 - 文章教程

ngx_http_html_sanitize_module HTML 过滤器模块

发布于 2021-09-20 字数 18520 浏览 978 评论 0

ngx_http_html_sanitize_module – 它基于作为 HTML5 解析器的 google 的 gumbo-parser 和作为内联 CSS 解析器的 hackers-painters 的 katana-parser 来清理带有白名单元素、白名单属性和白名单 CSS 属性的 HTML。

例子

根据 https://dev.w3.org/html5/html-author/#the-elements 有一个 nginx 配置示例,如下所示:

server {
    listen 8888;

    location = /sanitize {
        # Explicitly set utf-8 encoding
        add_header Content-Type "text/html; charset=UTF-8";

        client_body_buffer_size 10M;
        client_max_body_size 10M;

        html_sanitize on;

        # Check https://dev.w3.org/html5/html-author/#the-elements

        # Root Element
        html_sanitize_element html;

        # Document Metadata
        html_sanitize_element head title base link meta style;

        # Scripting
        html_sanitize_element script noscript;

        # Sections
        html_sanitize_element body section nav article aside h1 h2 h3 h4 h5 h6 header footer address;

        # Grouping Content
        html_sanitize_element p hr br pre dialog blockquote ol ul li dl dt dd;

        # Text-Level Semantics
        html_sanitize_element a q cite em strong small mark dfn abbr time progress meter code var samp kbd sub sup span i b bdo ruby rt rp;

        # Edits
        html_sanitize_element ins del;

        # Embedded Content
        htlm_sanitize_element figure img iframe embed object param video audio source canvas map area;

        # Tabular Data
        html_sanitize_element table caption colgroup col tbody thead tfoot tr td th;

        # Forms
        html_sanitize_element form fieldset label input button select datalist optgroup option textare output;

        # Interactive Elements
        html_sanitize_element details command bb menu;

        # Miscellaneous Elements
        html_sanitize_element legend div;

        html_sanitize_attribute *.style;
        html_sanitize_attribute a.href a.hreflang a.name a.rel;
        html_sanitize_attribute col.span col.width colgroup.span colgroup.width;
        html_sanitize_attribute data.value del.cite del.datetime;
        html_sanitize_attribute img.align img.alt img.border img.height img.src img.width;
        html_sanitize_attribute ins.cite ins.datetime li.value ol.reversed ol.stasrt ol.type ul.type;
        html_sanitize_attribute table.align table.bgcolor table.border table.cellpadding table.cellspacing table.frame table.rules table.sortable table.summary table.width;
        html_sanitize_attribute td.abbr td.align td.axis td.colspan td.headers td.rowspan td.valign td.width;
        html_sanitize_attribute th.abbr th.align th.axis th.colspan th.rowspan th.scope th.sorted th.valign th.width;

        html_sanitize_style_property color font-size;

        html_sanitize_url_protocol http https tel;
        html_sanitize_url_domain *.google.com google.com;

        html_sanitize_iframe_url_protocol http https;
        html_sanitize_iframe_url_domain  facebook.com *.facebook.com;
    }
}

并且建议使用以下命令来清理 HTML5:

$ curl -X POST -d "<h1>Hello World </h1>" http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0
<h1>Hello World </h1>

此查询字符串 element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0 如下:

  • element=2:html_sanitize_element 输出白名单元素
  • attribute=1:通过 html_sanitize_attribute 输出任意属性
  • style_property=1:通过 html_sanitize_style_property 输出任何样式属性
  • style_property_value=1:检查 url 函数和 表达式 函数的样式值,避免 style_property_value 注入XSS
  • url_protocol=1:通过 html_sanitize_url_protocol 检查白名单 url_protocol 中的绝对 URL
  • url_domain=0:不检查绝对 URL 的 url 域
  • iframe_url_protocol=1:与 url_protocol 相同,但仅适用 iframe.src 于 html_sanitize_iframe_url_protocol
  • iframe_url_domain=0:与 url_domain 相同,但仅适用 iframe.src 于 html_sanitize_iframe_url_domain

使用ngx_http_html_sanitize_module,我们可以通过 directive 和 querystring 指定是否输出 HTML5 的元素、属性和内联 CSS 的属性,如下所示:

白名单元素

禁用元素:

如果我们不想输出任何元素,我们可以这样做:

curl -X POST -d "<h1>h1</h1>" http://127.0.0.1:8888/sanitize?element=0

启用元素:

如果我们想输出任何元素,我们可以这样做:

$ curl -X POST -d "<h1>h1</h1><h7>h7</h7>" http://127.0.0.1:8888/sanitize?element=1
<h1>h1</h1><h7>h7</h7>

启用白名单元素:

如果我们想输出列入白名单的元素,我们可以这样做如下

$ curl -X POST -d "<h1>h1</h1><h7>h7</h7>" http://127.0.0.1:8888/sanitize?element=1
<h1>h1</h1>

白名单属性

禁用属性:

如果我们不想输出任何属性,我们可以这样做:

curl -X POST -d "<h1 ha=\"ha\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=0"
<h1>h1</h1>

启用属性:

如果我们想输出任何属性,我们可以这样做:

$ curl -X POST -d "<h1 ha=\"ha\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1"
<h1 ha="ha">h1</h1>

启用白名单属性:

如果我们想输出列入白名单的元素,我们可以这样做:

$ curl -X POST -d "<img src=\"/\" ha=\"ha\" />" "http://127.0.0.1:8888/sanitize?element=1&attribute=2"
<img src="/" />

白名单样式属性

禁用样式属性:

如果我们不想输出任何样式属性,我们可以这样做:

# It will do not output any style property
curl -X POST -d "<h1 style=\"color:red;\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=0"
<h1>h1</h1>

启用样式属性:

如果我们想输出任何样式属性,我们可以这样做:

$ curl -X POST -d "<h1 style=\"color:red;text-align:center;\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=1"
<h1 style="color:red;text-align:center">h1</h1>

启用白名单样式属性:

如果我们想输出列入白名单的样式属性,我们可以这样做:

$ curl -X POST -d "<h1 style=\"color:red;text-align:center;\" >h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=2"
<h1 style="color:red;">h1</h1>

描述

现在 ngx_http_html_sanitize_module 的实现基于 gumbo-parserkatana-parser。我们将其组合起来,然后在 nginx 上运行,作为由专业安全人员维护的中心 Web 服务,以消除语言级别的差异。如果我们想获得更高的性能(这里是基准),建议在纯 c 库之上编写语言级库包装,以克服网络传输的开销。

基准

与测试 wrk -s benchmarks/shot.lua -d 60s "http://127.0.0.1:8888" Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz and 64GB 内存。

Name 大小 平均延迟 QPS
hacker_news.html 30KB 9.06ms 2921.82
baidu.html 76KB 13.41ms 1815.75
arabic_newspapers.html 78KB 16.58ms 1112.70
bbc.html 115KB 17.96ms 993.12
xinhua.html 323KB 33.37ms 275.39
google.html 336KB 26.78ms 351.54
yahoo.html 430KB 29.16ms 323.04
wikipedia.html 511KB 57.62ms 160.10
html5_spec.html 7.7MB 1.63s 2.00

Todo

  • Gumbo-parser(硬):使用 SSE-4.2 改进以加快字符串处理
  • Gumbo-parser(硬):算法级别的额外性能改进
  • katana-parser(硬):使用 SSE-4.2 进行改进以加速字符串处理
  • katana-parser(硬):算法级别的额外性能改进
  • 指令(可选):添加模式指令以仔细控制 HTML5 和内联 CSS 输出
  • html_sanitize_attribute(硬):添加新算法而不是当前哈希查找以减少内存分配
  • 测试(简单):通过更多的 xss 安全测试
  • querystring(可选):允许外部白名单查询字符串控制白名单元素、属性、style_properties。

优化性能的技巧是从 On-CPU Flamegraph 中学习的,如下所示:

ngx_http_html_sanitize_module HTML 过滤器模块

Directive

html_sanitize

syntax: html_sanitize on | off

default: html_sanitize on

context: location

Specifies whether enable html sanitize handler on location context

html_sanitize_hash_max_size

syntax: html_sanitize_hash_max_size size

default: html_sanitize_hash_max_size 2048

context: location

Sets the maximum size of the element、attribute、style_property、url_protocol、url_domain、iframe_url_protocol、iframe_url_domain hash tables.

html_sanitize_hash_bucket_size

syntax: html_sanitize_hash_bucket_size size

default: html_sanitize_hash_bucket_size 32|64|128

context: location

Sets the bucket size for element、attribute、style_property、url_protocol、url_domain、iframe_url_protocol、iframe_url_domain. The default value depends on the size of the processor’s cache line.

html_sanitize_element

syntax: html_sanitize_element element …

default:

context: location

Set the whitelisted HTML5 elements when enable whitelisted element by setting the querystring element whitelist mode as the following:

html_sanitize_element html head body;

html_sanitize_attribute

syntax: html_sanitize_attribute attribute …

default:

context: location

Set the whitelisted HTML5 attributes when enable whitelisted element by setting the querystring attribute whitelist mode as the following:

html_sanitize_attribute a.href h1.class;

PS: attribute format must be the same as element.attribute and support *.attribute (prefix asterisk) and element.* (suffix asterisk)

html_sanitize_style_property

syntax: html_sanitize_style_property property …

default:

context: location

Set the whitelisted CSS property when enable whitelisted element by setting the querystring style_property whitelist mode as the following:

html_sanitize_style_property color background-color;

html_sanitize_url_protocol

syntax: html_sanitize_url_protocol [protocol] …

default:

context: location

Set the allowed URL protocol at linkable attribute when only the URL is absoluted rahter than related and enable URL protocol check by setting the querystring url_protocol check mode as the following:

html_sanitize_url_protocol http https tel;

html_sanitize_url_domain

syntax: html_sanitize_url_domain domain …

default:

context: location

Set the allowed URL domain at linkable attribute when only the URL is absoluted rahter than relatived and enable URL protocol check、URL domain check by setting the querystring url_protocol check mode and the querystring [url_domain][#url_domain] check mode as the following:

html_sanitize_url_domain *.google.com google.com;

html_sanitize_iframe_url_protocol

syntax: html_sanitize_iframe_url_protocol [protocol] …

default:

context: location

is the same as html_sanitize_url_protocol but only for iframe.src attribute

html_sanitize_iframe_url_protocol http https tel;

html_sanitize_iframe_url_domain

syntax: html_sanitize_iframe_url_domain [protocol] …

default:

context: location

is the same as html_sanitize_url_domain but only for iframe.src attribute

html_sanitize_iframe_url_domain *.facebook.com facebook.com;

linkable_attribute

The linkable attribute is the following:

  • a.href
  • blockquote.cite
  • q.cite
  • del.cite
  • img.src
  • ins.cite
  • iframe.src
  • CSS URL function

Querystring

the querystring from request URL is used to control the ngx_http_html_sanitize_module internal action.

document

value: 0 or 1

default: 0

context: querystring

Specifies whether append <!DOCTYPE> to response body

html

value: 0 or 1

default: 0

context: querystring

Specifies whether append <html></html> to response body

script

value: 0 or 1

default: 0

context: querystring

Specifies whether allow <script></script>

style

value: 0 or 1

default: 0

context: querystring

Specifies whether allow <style></style>

namespace

value: 0、1 or 2

default: 0

context: querystring

Specifies the mode of gumbo-parser with the value as the following:

  • GUMBO_NAMESPACE_HTML: 0
  • GUMBO_NAMESPACE_SVG: 1
  • GUMBO_NAMESPACE_MATHML: 2

context

value: [0, 150)

default: 38(GUMBO_TAG_DIV)

context: querystring

Specifies the context of gumbo-parser with the value at the this file tag_enum.h

element

value: 0、1、2

default: 0

context: querystring

Specifies the mode of output element with the value as the following:

  • 0: do not output element
  • 1: output all elements
  • 2: output whitelisted elements

attribute

value: 0、1、2

default: 0

context: querystring

Specifies the mode of output attribute with the value as the following:

  • 0: do not output attributes
  • 1: output all attributes
  • 2: output whitelisted attributes

style_property

value: 0、1、2

default: 0

context: querystring

Specifies the mode of output CSS property with the value as the following:

  • 0: do not output CSS property
  • 1: output all CSS property
  • 2: output whitelisted CSS property

style_property_value

value: 0、1

default: 0

context: querystring

Specifies the mode of output CSS property_value with the value as the following:

  • 0: do not check the CSS property’s value
  • 1: check the CSS property’s value for URL function and IE’s expression function to avoid XSS inject

url_protocol

value: 0、1

default: 0

context: querystring

Specifies whether check the URL protocol at linkable_attribute. The value is as the following:

  • 0: do not check the URL protocol
  • 1: output whitelisted URL protocol

url_domain

value: 0、1

default: 0

context: querystring

Specifies whether check the URL domain at linkable_attribute when enable url_protocol check. The value is as the following:

  • 0: do not check the URL domain
  • 1: output whitelisted URL domain

iframe_url_protocol

value: 0、1

default: 0

context: querystring

is the same as url_protocol but only for iframe.src

iframe_url_domain

value: 0、1

default: 0

context: querystring

is the same as url_domain but only for iframe.src

项目地址:https://github.com/youzan/ngx_http_html_sanitize_module

如果你对这篇文章有疑问,欢迎到本站 社区 发帖提问或使用手Q扫描下方二维码加群参与讨论,获取更多帮助。

扫码加入群聊

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

目前还没有任何评论,快来抢沙发吧!

关于作者

JSmiles

生命进入颠沛而奔忙的本质状态,并将以不断告别和相遇的陈旧方式继续下去。

2512 文章
30 评论
83586 人气
更多

推荐作者

魏剑帆

文章 0 评论 0

yanggwq

文章 0 评论 0

qq_c2gI5

文章 0 评论 0

qq_iQVWB

文章 0 评论 0