语义元素与 org 的 html5 导出

通过前面的几篇文章，我们学习了 org 导出到 HTML 以及以 emacs 作为静态网站生成器的一些基本知识。本文将在这些知识的基础上对 ox-html 进行一些修改，以生成符合 HTML5 标准且可读的 HTML 文件。

本文的目标是通过修改 ox-html 生成完全达到 W3C validator 标准的 HTML5 文件，我会尽力覆盖 org 中的元素，但某些不常用的可能涉及不到。需要注意的是这些修改已经涉及到 ox-html 的实现了，等到几个月后的 emacs 29 发布，这些修改可能需要一些变化以适应新版本的 emacs 或者说是 org 9.6，不过那就是之后的事了。

本文使用的环境如下：

emacs 28.2 x86_64 on windows 11
org-mode 9.5.5

通过阅读前面的一些文章，你应该对 HTML 中的一些标签有所了解，不过就算不看的话，右键这个网页打开页面源代码也能看到 HTML 内容。就算 div 不是最多的，它也应该是最显眼的。 div 是一个通用的流内容容器，它是一个纯粹的容器，在语义上不表示任何特定类型的内容，它起的作用是将内容分组。实际上我们光凭 div 和 span 已经可以表达几乎所有的东西了。

但是我们现在有更好的选择，HTML5 提供了一堆语义元素来作为 div 的替代。所谓的语义元素就是使用更有意义的方式来标记文档中的元素，这样可以更清楚地表明页面的结构。参考 MDN，使用语义元素有如下优点：

更便于开发 — 如上所述，你可以使 HTML 更易于理解，并且可以毫不费力的获得一些功能

更适配移动端 — 语义化的 HTML 文件比非语义化的 HTML 文件更加轻便，并且更易于响应式开发

更便于 SEO 优化 — 比起使用非语义化的 <div> 标签，搜索引擎更加重视在“标题、链接等”里面的关键字，使用语义化可使网页更容易被用户搜索到

HTML 元素列表可以参考 MDN，下面我介绍一些比较感兴趣的标签：

<article> 用于表示文档、页面、应用或网站中的独立结构，其意在成为可独立分配的或可复用的结构，如在发布中，它可能是论坛帖子、杂志或新闻文章、博客、用户提交的评论、交互式组件、或者其他独立的内容项目

MDN 中说到：给定文档中可以包含多篇文章；例如，阅读器在在博客上滚动时一个接一个地显示每篇文章的文本，每个帖子将包含在 <article> 中，可能包含一个或多个 <section> 。

从这个解说来看， <article> 可能适合用于博客中包含摘要的归档页中，比如 nullprogram 的主页中显示的文章就使用了 article 标签。另外，对于一整篇的博客，我们也许可以在文章的标题前面使用一个 <article> 标签。
<aside> 表示一个和其余页面内容几乎无关的部分，被认为是独立于该内容的一部分，并且可以被单独的拆分出来而不会使整体受影响，通常表现为侧边栏或标注框

在一般的博客中我似乎很少见到侧边栏，不过 ox-html 也提供了 BEGIN_aside ，可用。如果使用的话，我可能不会以侧边栏的形式来呈现这一块的内容，而是虚化文字和边框。
<details> 提供一个挂件，仅在切换为展开状态时，它才会显示内含的信息。可以使用 <summary> 提供概要

目前似乎想不到它的用途，也许可以用来折叠比较长的但又不得不写在博文中的代码，这样就不用专门把代码放到某一文件中了。
<figure> 代表一段独立的内容，可能包含 <figcaption> 元素定义的说明元素

就像我们在上一篇文章中看到的，当启用了 ox-html 的 html5 后，图片的导出将使用 <figure> 而不是 <div> 作为块级元素。
<footer> 表示最近一个章节内容或根节点元素的页脚。一个页脚通常包含该章节作者、版权数据或者与文档相关的链接等信息

<footer> 应该就是放在 HTML 文档末尾的东西了，我们也许可以在编写 postamble 时在最外层使用 <footer> 。
<header> 用于展示介绍性内容，通常包含一组介绍性的或是辅助导航的使用元素。它可能包含一些标题元素，但也可能包含其他元素，如 Logo、搜索框，作者名称等等

这样说的话 <header> 就是页面的页眉，也可以是文章的开头。 <header> 在 HTML 中的出现没有次数限制，我们可以考虑对 preamble 使用 <header> 。
<main> 用于呈现文档或应用的主题部分。主题部分由与文档直接相关，或者扩展于文档的中心主题，应用的主要功能部分组成

对于博客的话，也许 <main> 标签应该用在 preamble 之后， </main> 在 postamble 之前，然后在 <main> 里面加上 <article> 标签。
<nav> 表示页面的一部分，器目的是在当前文档或其他文档中提供导航链接。导航部分的常见示例是菜单，目录和索引

<nav> 用于目录还挺好的，也许在 preamble 中的导航栏中也应该使用它。
<section> 表示 HTML 文档中的一个通用独立章节，它没有更具体的语义元素来表示

也许可以考虑将 org 中的一级标题以 <section> 表示，随后的次级标题再用 <div> 。
<time> 用来表示 24 小时制时间或公历日期，若表示日期则也可包含时间和时区

也许博文的创建时间和修改时间可以使用它。

以上就是 MDN 在 Semantics 中列出的语义元素的一部分，我会考虑在下一节中使用它们。

按下 F12 ，找到元素一栏，然后展开一些标签，你会看到以下内容：

可以看到，虽然我启用了 org 的 HTML5 功能，但是其中的一些还是 <div> 标签，我还需要一些调整来使用更多的语义元素。参考这篇文章，整个结构应该是这样的：

下面，让我们来修改 ox-html 以达到期望的效果吧。你在本文中打开的 F12 应该是比较符合上面这张图的。

这里再推荐几篇介绍 HTML5 语义化的文章

下面我们正式开始魔改 ox-html.el 来达到输出语义化 HTML5 文件的目的。原本我打算按照 ox-html 的内容重写一个，但是重写并不是说写完就完了，需要做的测试很费神，不如直接复用代码。

在这一节中我们将研究几乎每一个被 ox-html 导出的元素，所以二级标题可能会有一点多。这里需要一些 advice 的知识，可以阅读这一篇文章复习一下。

在上一篇文章中我说到了 org 导出的 HTML 中 <meta> 自闭合标签带 / 的问题（当然这也没问题），然后我给出了如下解决方法：

(defun ad-org-html-meta-entry (st)
  (let ((len (length st)))
    (concat (substring st nil (- len 4))
            ">\n")))

(advice-add 'org-html--build-meta-entry :filter-return 'ad-org-html-meta-entry)
;;(advice-remove 'org-html--build-meta-entry 'ad-org-html-meta-entry)

在导出到 HTML 时，对于 <meta> 标签的内容，我们有这些关键字可用： DESCRIPTION ， KEYWORDS 。 DESCRIPTION 会导出为 <meta name="description" content="......"> ， KEYWORDS 会导出为 <meta name="keywords" content="......"> 。我们也可以使用 HTML_HEAD 和 HTML_EXTRA_HEAD 添加更多的 <meta> 标签。

关于 <meta> 标签还有两点要提一下， charset 指定了文档的编码，这可以通过 org-html-codeing-system 进行修改，它默认为 utf-8 。另一个是 viewport ，使用它可以控制视口的大小和形状，ox-html 中的代码如下：

(defcustom org-html-viewport '((width "device-width")
				 (initial-scale "1")
				 (minimum-scale "")
				 (maximum-scale "")
				 (user-scalable ""))
  ...)

在介绍 org 导出的 CSS 样式时我们再来研究 viewport ，这里附上 MDN 的 viewport meta 标记。

我们可以通过 org-html-meta-tags 修改默认插入的 <meta> 标签，默认的标签包括这些：

(defun org-html-meta-tags-default (info)
  "A default value for `org-html-meta-tags'.

Generate a list items, each of which is a list of arguments that can
be passed to `org-html--build-meta-entry', to generate meta tags to be
included in the HTML head.

Use document's plist INFO to derive relevant information for the tags."
  (let ((author (and (plist-get info :with-author)
                     (let ((auth (plist-get info :author)))
                       ;; Return raw Org syntax.
                       (and auth (org-element-interpret-data auth))))))
    (list
     (when (org-string-nw-p author)
       (list "name" "author" author))
     (when (org-string-nw-p (plist-get info :description))
       (list "name" "description"
             (plist-get info :description)))
     (when (org-string-nw-p (plist-get info :keywords))
       (list "name" "keywords" (plist-get info :keywords)))
     '("name" "generator" "Org Mode"))))

如果我们不想在 <meta> 中出现作为 generator 的 Org Mode，我们也可以把它去掉（不过我是懒得这样做了）。

在 org 文件中我们可以使用 HTML_LINK_HOME 和 HTML_LINK_UP 来指定当前页面的目录和网站主页。只要我们制定了其中的任意一条，ox-html 就会导出一条如下所示的 div ：

<div id="org-div-home-and-up">
 <a accesskey="h" href="../../index.html"> UP </a>
 |
 <a accesskey="H" href="link-to-home-page"> HOME </a>
</div>

考虑到 HOME 和 UP link 起到的是导航作用，也许使用 nav 会更好一些。我们可以通过 org-html-home/up-format 来修改这个默认内容：

(defcustom org-html-home/up-format
  "<div id=\"org-div-home-and-up\">
 <a accesskey=\"h\" href=\"%s\"> UP </a>
 |
 <a accesskey=\"H\" href=\"%s\"> HOME </a>
</div>"
  "Snippet used to insert the HOME and UP links.
This is a format string, the first %s will receive the UP link,
the second the HOME link.  If both `org-html-link-up' and
`org-html-link-home' are empty, the entire snippet will be
ignored."
  :group 'org-export-html
  :type 'string)

我看了看 taingram 的博客，很显然他对这一格式进行了修改，并且赋予了 position: sticky ，这样 Blog 和 Home 会跟着页面滚动而移动：

<div id="org-div-home-and-up"><a href="https://taingram.org/blog">Blog</a> <a href="https://taingram.org/">Home</a> </div>

#org-div-home-and-up {
	  max-width: 56rem;
	  margin: 0 auto;
	  display: flex;
	  flex-direction: row-reverse;
	  justify-content: flex-end;
	  position: sticky;
	  top: 0px;
	  float: left;
}

我们可以考虑将这一格式定义成这样：

(setq org-html-home/up-format
	"<nav id=\"org-div-home-and-up\">\
<a href=\"%s\">UP</a> \
<a href=\"%s\">HOME</a>
</nav>")

如果我们完全不需要这两个链接的话，我们也不需要对它进行修改。

preamble 就是位于 home-and-up 之后和 content 之前的一块内容，postabmle 是位于 content 和 </body> 之间的一块内容。先前的文章中我们已经介绍过它们的使用方法，这里就不再赘述了。这里我只说一下确定它标签的变量 org-html-divs ：

(defcustom org-html-divs
  '((preamble  "div" "preamble")
    (content   "div" "content")
    (postamble "div" "postamble"))
  ...)

上一节中我们提到了一些语义元素，我们可以使用它们来替换一下这些 div ：

(setq org-html-divs
	'((preamble "header" "preamble")
	  (content "main" "content")
	  (postamble "footer" "postamble")))

经过修改后我们得到如下结构的 HTML：

在上一篇文章中我们分析过图片的导出，当使用 HTML5 时，图片使用 <figure> 作为标签：

[[./0.png]]

<figure id="orgf21bc05">
<img src="./0.png" alt="0.png">

</figure>

这当然没什么问题，但是从 <img> 到 </figure> 之间怎么空了一行？即使我们加上了标题也是这样：

#+CAPTION: zero
[[./0.png]]

<figure id="org0f3135b">
<img src="./0.png" alt="0.png">

<figcaption><span class="figure-number">Figure 1: </span>zero</figcaption>
</figure>

那就只能从代码找找原因了。org 中的图片是以链接形式给出的，当 org-html-link 发现 link 是图片类型时，它会调用 org-html--format-image 来获得图片的 HTML 导出：

;; org-html--format-image snippet
;; Image file.
((and (plist-get info :html-inline-images)
	(org-export-inline-image-p
	 link (plist-get info :html-inline-image-rules)))
 (org-html--format-image path attributes-plist info))

org-html--format-image 只是导出 <img> 标签的函数而已，它不是造成空行的原因。通过在 ox-html 中全局搜索我找到了 org-html-paragraph ，它在处理图片时调用了 org-html--wrap-image ，我们可以试试这个函数的效果：

(cl-letf (((symbol-function 'org-html--html5-fancy-p)
	     (lambda (x) t)))
  (org-html--wrap-image "<img hello>" nil))
=>
"
<figure>
<img hello>
</figure>"

当我们仅输入 <img> 标签时， org-html--wrap-image 函数的输出是没有空行的字符串。那我只能怀疑是 org-html-paragraph 的问题了。我们给它加个 trace 看看效果：

#+HTML_DOCTYPE: html5
#+OPTIONS: html5-fancy:t
#+OPTIONS: html-style:nil

[[./0.png]]

(trace-function 'org-html-paragraph)

1 -> (org-html-paragraph #2=(paragraph ...) #("<img src=\"./0.png\" alt=\"0.png\">
" 31 32 (:parent #2#)) (:export-options nil ...))

可见它的 contents 参数中的 <img> 标签结束后确实带了一个换行，这也就说明是 org-html-paragraph 的问题，当然这也不是什么 bug 就是了。通过 occur 命令查找 org-html--wrap-image 的调用点，发现只有 org-html-paragraph 调用了它，我们可以直接给它加上 advice：

(defun ad-org-html--wrap-image (st)
  (replace-regexp-in-string "\n\n" "\n" st))

(advice-add 'org-html--wrap-image :filter-return 'ad-org-html--wrap-image)
(advice-remove 'org-html--wrap-image 'ad-org-html--wrap-image)

现在生成的就是没有空格的 <figure> 块了（感觉有点多管闲事了）：

#+HTML_DOCTYPE: html5
#+OPTIONS: html5-fancy:t
#+OPTIONS: html-style:nil

#+CAPTION: hello
[[./0.png]]

<figure id="org64a267f">
<img src="./0.png" alt="0.png">
<figcaption><span class="figure-number">Figure 1: </span>hello</figcaption>
</figure>

最后，让我们修改一下 org-html-inline-image-rules ，让 org 在导出 HTML 时能够识别更多的图片（我发现原本的列表已经够全了，这里改下格式吧（笑））

(setq org-html-inline-image-rules
	(let ((reg (regexp-opt '(".jpeg" ".jpg" ".png" ".gif" ".svg" ".webp")))
	      (type '("file" "http" "https")))
	  (mapcar (lambda (x) (cons x reg)) type)))

我们可以使用 org-html-container-element 来设置包裹一级标题的块级标签，默认是 div ，我们可以把它改为 <section> ，各 section 对应于文中的一级标题：

#+HTML_DOCTYPE: html5
#+OPTIONS: html5-fancy:t
#+OPTIONS: html-style:nil
#+HTML_CONTAINER: section
* hello
abc
** world
def

<section id="outline-container-orgd6a0b52" class="outline-2">
<h2 id="orgd6a0b52"><span class="section-number-2">1.</span> hello</h2>
<div class="outline-text-2" id="text-1">
<p>
abc
</p>
</div>
<div id="outline-container-org8cd40d0" class="outline-3">
<h3 id="org8cd40d0"><span class="section-number-3">1.1.</span> world</h3>
<div class="outline-text-3" id="text-1-1">
<p>
def</p>
</div>
</div>
</section>

对于 <section> 以下的内容我其实感觉没必要太关系了，毕竟我想要的只是顶层标题的 <section> 而已。但标题的导出还涉及到其他的 org 元素，所以还不得不看。

在之前的文章中我们已经了解了一些可用于标题的选项，这里再结合标题的导出函数实现重新认识一下，负责标题导出的函数叫做 org-html-headline ，与之相关的还有 org-html-format-headline-function 和 org-html--container 。从 org-html-headline 的 let* 代码块我们就能看出标题上挂了多少东西：

(let* ((numberedp (org-export-numbered-headline-p headline info))
	 (numbers (org-export-get-headline-number headline info))
	 (level (+ (org-export-get-relative-level headline info)
		   (1- (plist-get info :html-toplevel-hlevel))))
	 (todo (and (plist-get info :with-todo-keywords)
		    (let ((todo (org-element-property :todo-keyword headline)))
		      (and todo (org-export-data todo info)))))
	 (todo-type (and todo (org-element-property :todo-type headline)))
	 (priority (and (plist-get info :with-priority)
			(org-element-property :priority headline)))
	 (text (org-export-data (org-element-property :title headline) info))
	 (tags (and (plist-get info :with-tags)
		    (org-export-get-tags headline info)))
	 (full-text (funcall (plist-get info :html-format-headline-function)
			     todo todo-type priority text tags info))
	 (contents (or contents ""))
	 (id (org-html--reference headline info))
	 (formatted-text
	  (if (plist-get info :html-self-link-headlines)
	      (format "<a href=\"#%s\">%s</a>" id full-text)
	    full-text)))
  ...)

numberdp 是根据标题是否具有 UNNUMBERED 属性等条件来判断标题是否应该带序号
level 就是标题的级数加上最高级别标题的基础级数，比如一级标题的级数就是 2
todo 是挂在标题前的完成状态， todo-type 和它有关
priority ，挂在标题前面的优先级状态
text 是标题的内容
tags 是挂在标题后面的标签
full-text 就是通过 format-headline-function 把上面这些整合到一起的函数
contents 是标题下面的内容
formatted-text ，如果我们使用了 org-html-self-link-headlines ，那么标题自己也会成为指向自己的链接

以下是一个完全体的标题和各部分的示意图，以及它的导出结果：（记得设置 #+OPTIONS: pri:t ）

获取 full-text 所用的函数 org-html-format-headline-function 的作用是将那些零散的东西组合成一个：

(defun org-html-format-headline-default-function
    (todo _todo-type priority text tags info)
  "Default format function for a headline.
See `org-html-format-headline-function' for details."
  (let ((todo (org-html--todo todo info))
	(priority (org-html--priority priority info))
	(tags (org-html--tags tags info)))
    (concat todo (and todo " ")
	    priority (and priority " ")
	    text
	    (and tags "&#xa0;&#xa0;&#xa0;") tags)))

如果我们不想在导出结果中显示 tags，priority 和 todo 的话，我们可以写一个只返回标题内容的函数，然后赋给 org-html-format-headline-function 。

接着再往下执行就是判断标题是否应该成为标题，这与 OPTIONS:H (org-export-headline-levels) 有关，当标题级数大于这个值时，标题将以列表的形式导出。至于是有序列表还是无序列表取决于 numberedp ，而它又取决于 UNNUMBERED 和 OPTIONS:num (org-export-with-section-numbers) ，当前标题的级数小于 section-numbers 时为有序列表导出，大于时为无序列表。由于 org-export-with-section-numbers 的默认值为 t ，默认情况下所有的标题都将带有数字。

如果标题级数足够小，那就会进入正常的标题导出流程。首先最外层包上 div 和 outline-container-N 的样式，然后开始构建使用 hN （N=2,3,…）的标题，接着就是内容等东西了。在编写 CSS 时我们根据文档列出的可用 CSS 编写即可，这里我就不贴代码了，有点长。

标题中得到的内联元素，也就是 todo，priority 和 tags 那些都是调用了对应函数得到的：

todo 对应 org-html--todo
priority 对应 org-html--priority
tags 对应 org-html--tags

在默认情况下，toc 是位于 content 的最前面的。之前我使用 org 创建网页时一直不知道怎么把图片放到 toc 之前，索性直接把图片的代码写到 <title> 里了，虽然也能正常显示，但是效果非常奇怪（笑）。这是默认情况下的 content 生成函数，可见有 :with-toc 的话 toc 位于 content 之前：

(defun org-html-inner-template (contents info)
  "Return body of document string after HTML conversion.
CONTENTS is the transcoded contents string.  INFO is a plist
holding export options."
  (concat
   ;; Table of contents.
   (let ((depth (plist-get info :with-toc)))
     (when depth (org-html-toc depth info)))
   ;; Document contents.
   contents
   ;; Footnotes section.
   (org-html-footnote-section info)))

我们可以设置 toc 为 nil ，然后在文章中想要的位置插入 #+TOC: headlines N 。如果使用 #+TOC: headlines N local 的话还能在标题下插入局部于标题的 toc。

需要说明的是，由于所有的目录都是由 org-html-toc 生成的，而其中的 id 值被写死了，为 text-table-of-contents 和 table-of-contents 。在 HTML 文件中存在不同元素具有相同 id 的情况似乎是被允许的，但调用 getElementBId 时只会返回第一个元素[1]。不过这对我们来说应该不是个问题，首先整个 HTML 页面几乎用不到对目录进行操作的 JS 代码，其次我们似乎很少需要单文件多目录。

toc 的 HTML 代码没什么好说的，非常正常的 HTML 输出。

标题中的 markup 指的是一些标记，比如行内代码，斜体，粗体，下划线和删除线。ox-html 中提供了 org-html-text-markup-list 来从 markup 导出到 html：

(defcustom org-html-text-markup-alist
  '((bold . "<b>%s</b>")
    (code . "<code>%s</code>")
    (italic . "<i>%s</i>")
    (strike-through . "<del>%s</del>")
    (underline . "<span class=\"underline\">%s</span>")
    (verbatim . "<code>%s</code>"))
  ...)

可见 bold 输出到 <b> ， code 和 verbatim 输出到 <code> ， italic 输出到 <i> ，删除线输出到 <del> ，下划线输出到 <span> 。如果代码中用到的 = 号比较多的话，我们可以使用 ~ （verbatim）来包裹代码。

除了 <b> 之外，能表示强调的标签还有 <em> ， <strong> 和 <mark> 。我从一些文章中了解到 <b> 最初只是用来表示加粗，没有什么语义，至于现在是个什么样子，我们还是看看当前的 HTML 标准吧：HTML 元素。

<b> The b element represents a span of text to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood

As with the i element, authors can use the class attribute on the b element to identify why the element is being used, so that if the style of a particular use is to be changed at a later date, the author doesn't have to go through annotating each use.

The b element should be used as a last resort when no other element is more appropriate. In particular, headings should use the h1 to h6 elements, stress emphasis should use the em element, importance should be denoted with the strong element, and text marked or highlighted should use the mark element.
<em> The em element represents stress emphasis of its contents. The level of stress that a particular piece of content has is given by its number of ancestor em elements.

The em element isn't a generic "italics" element. Sometimes, text is intended to stand out from the rest of the paragraph, as if it was in a different mood or voice. For this, the i element is more appropriate.

The em element also isn't intended to convey importance; for that purpose, the strong element is more appropriate.
<strong> The strong element represents strong importance

the strong element can be used in a heading, caption, or paragraph to distinguish the part that really matters from other parts that might be more detailed, more jovial, or merely boilerplate. (This is distinct from marking up subheadings, for which the hgroup element is appropriate.)
<mark> The mark element represents a run of text in one document marked or highlighted for reference purposes

When used in a quotation or other block of text referred to from the prose, it indicates a highlight that was not originally present but which has been added to bring the reader's attention to a part of the text that might not have been considered important by the original author when the block was originally written, but which is now under previously unexpected scrutiny.

可以看到 <strong> 的表意太强了，可能不是很适合出现在 blog 中， <mark> 用来表示吸引读者注意的高亮，而且用途是引用。 <b> 的文档说明里还强调了 The b element should be used as a last resort when no other element is more appropriate 。这样看的话表示 = 的最佳元素应该是 <em> 。 <em> 的默认样式是斜体，也许我们需要修改一下它的 CSS，不过这就是之后的工作了。

关于斜体没什么好说的，用 <i> 就是了。MDN 中说它用于表现因某些原因需要区分普通文本的一系列文本。例如技术术语、外文短语或是小说中人物的思想活动等，它的内容通常以斜体显示。我一般在中文中插入英文就会用它。

你可能知道有个 <u> 元素可以用来表达下划线，但是这里 ox-html 为什么使用 <span> 呢？早期版本的 HTML 被用来设置下划线样式，但是到了 HTML5 中 <u> 具有语义，用来表示一个需要标注为 non-textual 的文本域。如果我们只是想要下划线效果的话，也许 <span> 会更好一些。

关于 <code> 和 <del> 没什么好说的，这两个标签的意义非常明显。我也不认为应该给 verbatim 添加别的标签（不过如果有需要的话可以试试）。

这里我们来了解一下除源代码块之外的其他块的导出，这里我 M-x occur block 在 ox-html.el 找到了如下的 block：

BEGIN_CENTER ，输出 <div class="org-center">
BEGIN_EXAMPLE ，输出 <pre class="example">
BEGIN_EXPORT ，如果 export 类型为 html，原样输出
dynamic-block ，直接插入内容
BEGIN_QUOTE ，插入 <blockquote>
BEGIN_VERSE ，插入 <p class="verse">

CENTER 的作用就是表示内容居中， EXAMPLE 表示内容按原格式输出， EXPORT 表示直接输出 HTML， QUOTE 表示引用， VERSE 表示保持换行。其中 VERSE ， CENTER 我基本上没用过，也就没什么使用建议好说，至于 DYNAMIC 更是闻所未闻。

除了这些外还有源代码 block 和内联代码 block，这些我们留到之后的章节。最后是一种叫做 special 的 block，它是 html5 中的一些新元素，比如 aside ， video 等等，如果 BEGIN 后面接的标签 ox-html 无法通过 org-html-html5-elements 识别的话，那就会以 div 导出， div 的 class 为标签名。

在 org 里面有三种列表，分别是无序列表，有序列表和描述性列表，我只用过前两种。通过在一行的开头使用 -, + 或 * ，我们可以表示无序列表。通过在一行的开头使用 1., 1, ，我们可以表示有序列表。如果我们在列表内容的开头写上 [@N] ，那么我们还可以控制有序列表的序号：

这是 1
这是加上了 [@20]

如果我们要使用 description list 的话，在列表的后面通过 :: 加上描述性内容即可。

ox-html 使用 org-html-plain-list 来导出列表，分别会赋予不同列表 org-ol, org-ul 和 org-dl 的 class。

前面的文章中我提了一嘴 table 的配置变量，但是直接跳过了，因为这一部分稍显复杂，现在让我们继续。说到表格，我们就不得不从两个方面来学习，一是 org 中的表格，二是 HTML 中的表格。

在 org-mode buffer 中，我们可以使用 | 加上 TAB 来创建表格，具体的操作我就不搁着好为人师了，请阅读文档：Build-in Table Editor。整个表的样子大概可以分为以下几种：

基础表格

| num | val |
|   1 |   1 |
|   2 |   2 |

带水平线表格

| title | author |
|-------+--------|
| foo   | bar    |
| wo    | ni     |

带对齐表格

| 100000 |  20 | woooo |
| <l>    | <r> |  <c>  |
| 3      |   4 |   1   |

限制行宽

| 1 2     | 3 |
| <6>     |   |
| 1234567 |   |

其中，行宽和对其可以放在一个尖括号里，也就是字符加上数字 <xN> 。

参考 3.3 Coulmn Groups 这一节，我们还可以在表中指定组，这样表格会用竖线将组与组之间分隔开，就像这样：

| N | N^2 | N^3 | N^4 | sqrt(n) | sqrt[4](N) |
|---+-----+-----+-----+---------+------------|
| / |  <  |     |  >  |       < |          > |
| 1 |  1  |  1  |  1  |       1 |          1 |
| 2 |  4  |  8  | 16  |  1.4142 |     1.1892 |
| 3 |  9  | 27  | 81  |  1.7321 |     1.3161 |
|---+-----+-----+-----+---------+------------|

关于 org 的表格计算功能这里我就不提了，实在是有些复杂而且平时用的不是很多。下图是上面这些表格的导出效果（默认 org 样式太丑了，这里用我现在用的 CSS）：

下面我们看看在 HTML 中是如何表示表格的。

在 HTML 中，我们使用 <table> 来表示表格。参考 MDN， <table> 中的剩下内容由如下元素组成：

一个可选的 <caption> 元素
零个或多个 <colgroup> 元素
一个可选的 <thead> 元素
下列任意一个：
- 零个或多个 <tbody>
- 零个或多个 <tr>
一个可选的 <tfoot> 元素

下面我们一个一个分析这些元素的作用。以下内容部分自 MDN 文档。

<caption> 展示一个表格的标题，它常常作为 <table> 的第一个子元素出现，同时显示在表格内容的最前面，但是，它同样可被 CSS 样式化，所以，它同样可以出现在相对于表格的任意位置。

colgroup 用来定义表中的一组列表。通过使用 <colgroup> 标签可以向整个列应用 CSS 样式，而不需要重复为每一个单元格或行设置样式。 <colgroup> 要配合 <col> 使用， <col> 位于 <colgroup> 内，可以使用它的 span 属性来表示该 <col> 横框的列数，默认为 1。通过设置 span 和 class，style 属性，我们可以设置表格中的列样式，下面的 <colgroup> 元素表示将表格的第二列和第三列设为黄色，将第四列设为红色：

<colgroup>
  <col>
  <col span="2" style="background-color:yellow">
  <col style="background-color:red">
</colgroup>

<thead> 定义了一组定义表格的列头的行，也就是表头部分，下面是一个表头例子：

<thead>
  <tr>
    <th>Month</th>
    <th>Budget</th>
  </tr>
</thead>

<tbody> 封装了一系列表格的行，代表了它们是表格主要内容的组成部分：

<tbody>
  <tr>
    <th scope="row">Donuts</th>
    <td>3,000</td>
  </tr>
  <tr>
    <th scope="row">Stationery</th>
    <td>18,000</td>
  </tr>
</tbody>

<tr> 是表格中的行，同一行可以出现 <td> 和 <th> 元素。其中， <th> 表示表头单元格， <td> 表示表格中的标准单元格。

最后的 <tfoot> 定义了一组表格中各列的汇总行，允许包含 0 个或多个 <tr> 。

在 org manual 13.9.8 中，文档列出了 7 个选项变量，初看时我不是很明白各变量的用意，现在学了 <table> 之后再回来看就很清楚了。这里我们结合 ox-html 中的实现说说这些变量的作用。和 table 导出相关的函数主要是这几个：

org-html-table-cell
org-html-table-row
org-html-table-first-row-data-cells
org-html-table--tabel.el-table
org-html-table

整个 table 的导出入口函数应该是 org-html-table ，这个函数稍微有点复杂，我只得知难而退（笑），下面分析一下各选项变量吧：

org-html-table-align-individual-fields ，非空时将对其的 style 属性赋给表的各 field，默认为 t

该函数用于 org-html-table-cell 中，会给各 field 加上 class="org-?" 的内容， ? 可以是 left, right 或 center

;; org-html-table-cell snippet
(let* ((table-row (org-export-get-parent table-cell))
	   (table (org-export-get-parent-table table-cell))
	   (cell-attrs
	    (if (not (plist-get info :html-table-align-individual-fields)) ""
	      (format (if (and (boundp 'org-html-format-table-no-css)
			       org-html-format-table-no-css)
			  " align=\"%s\"" " class=\"org-%s\"")
		      (org-export-table-cell-alignment table-cell info)))))
  ...)

org-html-table-caption-above ，是否把表格的标题放在表格的上方，默认为 t

在 org-html-table 中会根据这个选项来决定 caption 的 css：

;; org-html-table snippet
(if (not caption) ""
  (format (if (plist-get info :html-table-caption-above)
		  "<caption class=\"t-above\">%s</caption>"
		"<caption class=\"t-bottom\">%s</caption>")
	      ...)
  ...)

org-html-table-data-tags ，用于表格 field 的标准标记，默认为 <td> 和 </td>
org-html-table-default-attributes ，表格的默认属性，不过在 HTML5 下不用它

它的默认值是 (:border "2" :cellspacing "0" :cellpadding "6" :rules "groups" :frame "hsides")
org-html-table-header-tags ，用于 thead 的 tag，默认是 <th> 和 </th>
org-html-table-row-open-tag 和 org-html-table-row-close-tag 分别是 <tr> 和 </tr> ，相信不用解释了

org-mode manual 9.6 上还写着 org-html-table-row-tags ，可能有点过时， org-html-table-row-open-tag 中的注释告诉了你如何自定义这个选项

org-html-table-use-header-tags-for-first-column ，是否使用 header tags 来构建第一列，默认为 nil

如果我们想让列表的第一列使用 <th> 的话，我们可将它设为 t 。以下是 org-html-table-cell 片段：

((and (plist-get info :html-table-use-header-tags-for-first-column)
	  (zerop (cdr (org-export-table-cell-address table-cell info))))
 (let ((header-tags (plist-get info :html-table-header-tags)))
   (concat "\n" (format (car header-tags) "row" cell-attrs)
	       contents
	       (cdr header-tags))))

下面是一个内容比较全的 org 表格，我们看看它的 HTML 导出：

| Day | Month | Year |
|-----+-------+------|
| <l> |   <r> | <c>  |
| 30  |     1 | 2023 |
| 20  |    02 | 2020 |
|-----+-------+------|
| I   |   You |  We  |
| yy  |    yy |  yy  |

<table>
<colgroup>
  <col  class="org-left">
  <col  class="org-right">
  <col  class="org-center">
</colgroup>
<thead>
  <tr>
    <th scope="col" class="org-left">Day</th>
    <th scope="col" class="org-right">Month</th>
    <th scope="col" class="org-center">Year</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="org-left">30</td>
    <td class="org-right">1</td>
    <td class="org-center">2023</td>
  </tr>
  <tr>
    <td class="org-left">20</td>
    <td class="org-right">02</td>
    <td class="org-center">2020</td>
  </tr>
</tbody>
<tbody>
  <tr>
    <td class="org-left">I</td>
    <td class="org-right">You</td>
    <td class="org-center">We</td>
  </tr>
  <tr>
    <td class="org-left">yy</td>
    <td class="org-right">yy</td>
    <td class="org-center">yy</td>
  </tr>
</tbody>
</table>

org 中的 footnote 是使用 [fn::xxx] 表示的，具体使用方法可以参考 12.10 Creating Footnotes。我们这里主要关注的是 footnote 的导出。

在 ox-html 中负责导出 footnote 引用的是 org-html-footnote-reference ，它会将 [fn::xxx] 出现的地方导出为链接，以下是 org-html-footnote-reference 的代码片段：

(format
 (plist-get info :html-footnote-format)
 (org-html--anchor
  id n (format " class=\"footref\" href=\"#fn.%d\" role=\"doc-backlink\"" n) info))

我们可以使用 org-html-footnote-format 控制 footnote 的格式，默认是 <sup>%s</sup> 。

在文章的末尾，ox-html 还会为我们导出包含所有 footnote 链接的 section，我们可以使用 org-html-footnotes-section 来控制导出的格式，它的默认值为：

(defcustom org-html-footnotes-section "<div id=\"footnotes\">
<h2 class=\"footnotes\">%s: </h2>
<div id=\"text-footnotes\">
%s
</div>
</div>")

我们可以考虑将最外层的 div 换成 section 。负责导出 section 的函数是 org-html-footnote-section ，暂时我没想出来有什么需要修改的地方。

我觉得这大概是整个导出过程中最复杂的一部分了，所以也就把它留到了这一节的最后一小节。org manual 花了第 16 章整整一章来讲解如何使用 org 中的代码块功能，我这里也只是在导出这点皮毛上介绍点东西而已。

在 org manual 的第 13.9 节中并没有对源代码块的导出做出什么说明，以下内容皆来自 ox-html.el 的代码和注释。

源代码的导出函数主要是 org-html-inline-src-block 和 org-html-src-block ，前者导出内联的代码，后者导出 #+BEGIN_SRC 块。我们先从内联代码开始说起。

对于像是 src_<language>{<body>} 的内联代码，ox-html 会调用 org-html-inline-src-block 来将其输出为 HTML 代码，比如我们可由 src_c[:exports]{int a = 1;} 得到 <code class="src src-c"><span style="color: #18b2b2;">int</span> <span style="color: #ff8700;">a</span> = 1</code> 。在整个过程中它会将代码交给 org-html-fontify-code 处理，来得到“上色”的代码：

(defun org-html-inline-src-block (inline-src-block _contents info)
  "Transcode an INLINE-SRC-BLOCK element from Org to HTML.
CONTENTS holds the contents of the item.  INFO is a plist holding
contextual information."
  (let* ((lang (org-element-property :language inline-src-block))
	 (code (org-html-fontify-code
		(org-element-property :value inline-src-block)
		lang))
	 (label
	  (let ((lbl (org-html--reference inline-src-block info t)))
	    (if (not lbl) "" (format " id=\"%s\"" lbl)))))
    (format "<code class=\"src src-%s\"%s>%s</code>" lang label code)))

在没有安装 htmlize 的情况下，调用 org-html-fontify-code 只会得到 plain-text 的结果。如果该语言在 emacs 中没有 major-mode 也会导出 plain-text。htmlize 在过去是 org-mode 的内置包，不过现在被移除了，现在可以在 github 上找到它。

对于代码块的导出，如果代码块没有指定语言，那么 BEGIN_SRC 与 BEGIN_EXAMPLE 一样，都是使用 class 为 example 的 pre 元素。若使用的语言在 emacs 中找到，那么使用 div class=org-src-container 来作为最外层块元素。接着在里面使用 pre class=src src-lang 作为包括代码的元素。

除了说使用 org 的原生方案，我们也可以考虑在客户端渲染，即使用 highlight.js 来给代码加高亮。想要试试的同学可以参考这篇文章，这里我就不演示了。

这里也有一个关于使用 Pygmentize 进行高亮的讨论，目前我没有更换高亮方式的想法，先就着 htmlize 用吧。

在简单读了读文档和 ox-html 实现后，我现在对 ox-html 的一些基本行为有了一定的了解，现在我们可以开始写一段能用的选项和配置代码了，此处的设定主要是两个文件，一是用于通过 #+SETUPFILE: 引入的 org 配置文件，而是一段需要执行的 elisp 代码，它用来修改一些 defcustom 选项。

#+HTML_DOCTYPE: html5
#+HTML_CONTAINER: section

#+OPTIONS: toc:nil
#+OPTIONS: html-style:nil
#+OPTIONS: html5-fancy:t
#+OPTIONS: ^:nil

#+LANGUAGE: en
#+AUTHOR: include-yy

实际上我们还可以加上 CSS 链接和 JS 链接，不过这里不是很必要。如果你对某些选项的默认设置不满意的话还可以继续加。

接着是上面我们提到的一些修改的综合：

(defun ad-org-html-meta-entry (st)
  (let ((len (length st)))
    (concat (substring st nil (- len 4))
	      ">\n")))

(advice-add 'org-html--build-meta-entry :filter-return 'ad-org-html-meta-entry)
;;(advice-remove 'org-html--build-meta-entry 'ad-org-html-meta-entry)

(setq org-html-divs
	'((preamble "header" "preamble")
	  (content "main" "content")
	  (postamble "footer" "postamble")))

(defun ad-org-html--wrap-image (st)
  (replace-regexp-in-string "\n\n" "\n" st))

(advice-add 'org-html--wrap-image :filter-return 'ad-org-html--wrap-image)
;;(advice-remove 'org-html--wrap-image 'ad-org-html--wrap-image)

(setq org-html-footnotes-section "<section id=\"footnotes\">
  <h2 class=\"footnotes\">%s: </h2>
  <div id=\"text-footnotes\">
  %s
  </div>
  </section>")

这样看来似乎也没有很多需要修改的地方（笑）。我将整个文件上传到了 W3C validator 检查，结果如下：

以下是浏览器中的 F12 打开的元素显示结果：

如果有时间的话我还是挺想自己写个 HTML5 导出后端的，而且这个工作不用我从头做起，我只需要照着 ox-html 做就好了。希望以后能有这个时间吧（笑）。org 在生成 HTML 时会生成很多不必要的 id，读者可以在网上搜索解决方案，比如这个：Disable auto id generation in org-mode html export。

在完成了本文后，我距离弄个博客出来的目标又近了一步。在这一系列的下一篇文章中我会介绍如何设置适用于 org-mode 导出的 HTML 文件的 CSS。

语义元素与 org 的 html5 导出

1. 什么是语义元素

1.1. 页面的总体结构

2. make ox-html more html5-style

2.1. <meta>

2.2. home and up link

2.3. preamble and postamble

2.4. image

2.5. headline

2.6. toc

2.7. markup

2.8. block

2.9. list

2.10. table

2.10.1. org 中的表格

2.10.2. html 中的表格

2.10.3. org 表格到 html 表格

2.11. footnote

2.12. source code

3. 一个简单的 HTML5 导出设定

4. 后记

References