Java的XML解析器

Java解析XML最常用的有两种,DOM解析和SAX解析,其它还有很多,但是最常用的是这两种,其中Tomcat使用的解析器是SAX解析,XML的解析思路和本节无关,因此只挑重要部分讲

Digester

Digester类全限定名是org.apache.tomcat.util.digester.Digester,Digester通过继承Java的org.xml.sax.ext.DefaultHandler2类实现XML文件解析

通过继承实现了四个方法startDocument、endDocument、startElement、endElement和其它一些方法(不是重点)来解析XML并初始化容器,另外Digester还将处理节点的时候的各个动作抽象成了Rule,一个Rule对应一个操作,例如ObjectCreateRule对应对象创建的操作、SetPropertiesRule对应设置属性的操作等等

在学习Digester的时候我发现至少用到了两种设计模式或者说思路,接下来分享

FactoryFinder

Java的源码src.zip里面有一个类javax.xml.parsers.SAXParserFactory,它的newInstance方法长这样:

public static SAXParserFactory newInstance() {
    return FactoryFinder.find(
            /* The default property name according to the JAXP spec */
            SAXParserFactory.class,
            /* The fallback implementation class name */
            "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl");
}

这里值得学习的是为了良好的可扩展性使用了一个FactoryFinder类来实现工厂类的初始化,且看它的find方法的实现:

static <T> T find(Class<T> type, String fallbackClassName)
        throws FactoryConfigurationError {
    final String factoryId = type.getName();
    dPrint("find factoryId =" + factoryId);

    // Use the system property first
    try {
        String systemProp = ss.getSystemProperty(factoryId);
        if (systemProp != null) {
            dPrint("found system property, value=" + systemProp);
            return newInstance(type, systemProp, null, true);
        }
    } catch (SecurityException se) {
        if (debug) se.printStackTrace();
    }

    // try to read from $java.home/lib/jaxp.properties
    try {
        if (firstTime) {
            synchronized (cacheProps) {
                if (firstTime) {
                    String configFile = ss.getSystemProperty("java.home") + File.separator +
                            "lib" + File.separator + "jaxp.properties";
                    File f = new File(configFile);
                    firstTime = false;
                    if (ss.doesFileExist(f)) {
                        dPrint("Read properties file " + f);
                        cacheProps.load(ss.getFileInputStream(f));
                    }
                }
            }
        }
        final String factoryClassName = cacheProps.getProperty(factoryId);

        if (factoryClassName != null) {
            dPrint("found in $java.home/jaxp.properties, value=" + factoryClassName);
            return newInstance(type, factoryClassName, null, true);
        }
    } catch (Exception ex) {
        if (debug) ex.printStackTrace();
    }

    // Try Jar Service Provider Mechanism
    T provider = findServiceProvider(type);
    if (provider != null) {
        return provider;
    }
    if (fallbackClassName == null) {
        throw new FactoryConfigurationError(
                "Provider for " + factoryId + " cannot be found");
    }

    dPrint("loaded from fallback value: " + fallbackClassName);
    return newInstance(type, fallbackClassName, null, true);
}

从注释里面我们可以知道,find方法首先会尝试从三个地方读取用户自定义工厂类,如果都找不到才会使用第二个参数指定的类:

  • 系统属性,可以通过-D指定
  • 从$java.home/lib/jaxp.properties文件里面读取
  • java的SPI机制

虽然这不是一个经典的设计模式,但是值得借鉴,当我们要定义一个工厂类的时候如果希望可以由用户来指定可以通过这种方式来实现

访问者模式(Visitor)

在Digester的UML图中的Rule类和Digester类其实分别对应Visitor模式中的Visitor和Element,每个Rule实现类里面都有一个digester对象,在初始化Rule类的时候会在Rule类内部设置一个digester,这样就能通过不同的Rule改变digester的行为,经典的Visitor模式

createStartDigester

上一节讲到Catalina.load方法用伪代码来表示关键代码就是:

public void load() {
    // 如果启动的时候不加-nonaming默认开启JNDI查询
    initNaming();
    Digester digester = createStartDigester();
    digester.parse("server.xml");
    getServer().init();
}

现在我们来看看createStartDigester方法,将其核心功能简化后得出:

protected Digester createStartDigester() {
    // 初始化Digester
    Digester digester = new Digester();

    // 为digester添加Rule,前面分析过,digester解析XML实际上要靠配置的Rule
    digester.addObjectCreate("Server",
            "org.apache.catalina.core.StandardServer",
            "className");
    digester.addSetProperties("Server");
    digester.addSetNext("Server",
            "setServer",
            "org.apache.catalina.Server");

    digester.addObjectCreate("Server/GlobalNamingResources",
            "org.apache.catalina.deploy.NamingResourcesImpl");
    digester.addSetProperties("Server/GlobalNamingResources");
    digester.addSetNext("Server/GlobalNamingResources",
            "setGlobalNamingResources",
            "org.apache.catalina.deploy.NamingResourcesImpl");

    digester.addRule("Server/Listener",
            new ListenerCreateRule(null, "className"));
    digester.addSetProperties("Server/Listener");
    digester.addSetNext("Server/Listener",
            "addLifecycleListener",
            "org.apache.catalina.LifecycleListener");

    digester.addObjectCreate("Server/Service",
            "org.apache.catalina.core.StandardService",
            "className");
    digester.addSetProperties("Server/Service");
    digester.addSetNext("Server/Service",
            "addService",
            "org.apache.catalina.Service");

    digester.addObjectCreate("Server/Service/Listener",
            null, // MUST be specified in the element
            "className");
    digester.addSetProperties("Server/Service/Listener");
    digester.addSetNext("Server/Service/Listener",
            "addLifecycleListener",
            "org.apache.catalina.LifecycleListener");

    //Executor
    digester.addObjectCreate("Server/Service/Executor",
            "org.apache.catalina.core.StandardThreadExecutor",
            "className");
    digester.addSetProperties("Server/Service/Executor");

    digester.addSetNext("Server/Service/Executor",
            "addExecutor",
            "org.apache.catalina.Executor");


    digester.addRule("Server/Service/Connector",
            new ConnectorCreateRule());
    digester.addRule("Server/Service/Connector", new SetAllPropertiesRule(
            new String[]{"executor", "sslImplementationName", "protocol"}));
    digester.addSetNext("Server/Service/Connector",
            "addConnector",
            "org.apache.catalina.connector.Connector");

    digester.addRule("Server/Service/Connector", new AddPortOffsetRule());

    digester.addObjectCreate("Server/Service/Connector/SSLHostConfig",
            "org.apache.tomcat.util.net.SSLHostConfig");
    digester.addSetProperties("Server/Service/Connector/SSLHostConfig");
    digester.addSetNext("Server/Service/Connector/SSLHostConfig",
            "addSslHostConfig",
            "org.apache.tomcat.util.net.SSLHostConfig");

    digester.addRule("Server/Service/Connector/SSLHostConfig/Certificate",
            new CertificateCreateRule());
    digester.addRule("Server/Service/Connector/SSLHostConfig/Certificate",
            new SetAllPropertiesRule(new String[]{"type"}));
    digester.addSetNext("Server/Service/Connector/SSLHostConfig/Certificate",
            "addCertificate",
            "org.apache.tomcat.util.net.SSLHostConfigCertificate");

    digester.addObjectCreate("Server/Service/Connector/SSLHostConfig/OpenSSLConf",
            "org.apache.tomcat.util.net.openssl.OpenSSLConf");
    digester.addSetProperties("Server/Service/Connector/SSLHostConfig/OpenSSLConf");
    digester.addSetNext("Server/Service/Connector/SSLHostConfig/OpenSSLConf",
            "setOpenSslConf",
            "org.apache.tomcat.util.net.openssl.OpenSSLConf");

    digester.addObjectCreate("Server/Service/Connector/SSLHostConfig/OpenSSLConf/OpenSSLConfCmd",
            "org.apache.tomcat.util.net.openssl.OpenSSLConfCmd");
    digester.addSetProperties("Server/Service/Connector/SSLHostConfig/OpenSSLConf/OpenSSLConfCmd");
    digester.addSetNext("Server/Service/Connector/SSLHostConfig/OpenSSLConf/OpenSSLConfCmd",
            "addCmd",
            "org.apache.tomcat.util.net.openssl.OpenSSLConfCmd");

    digester.addObjectCreate("Server/Service/Connector/Listener",
            null, // MUST be specified in the element
            "className");
    digester.addSetProperties("Server/Service/Connector/Listener");
    digester.addSetNext("Server/Service/Connector/Listener",
            "addLifecycleListener",
            "org.apache.catalina.LifecycleListener");

    digester.addObjectCreate("Server/Service/Connector/UpgradeProtocol",
            null, // MUST be specified in the element
            "className");
    digester.addSetProperties("Server/Service/Connector/UpgradeProtocol");
    digester.addSetNext("Server/Service/Connector/UpgradeProtocol",
            "addUpgradeProtocol",
            "org.apache.coyote.UpgradeProtocol");

    // RuleSet顾名思义其实就是添加批量的Rule
    digester.addRuleSet(new NamingRuleSet("Server/GlobalNamingResources/"));
    digester.addRuleSet(new EngineRuleSet("Server/Service/"));
    digester.addRuleSet(new HostRuleSet("Server/Service/Engine/"));
    digester.addRuleSet(new ContextRuleSet("Server/Service/Engine/Host/"));
    addClusterRuleSet(digester, "Server/Service/Engine/Host/Cluster/");
    digester.addRuleSet(new NamingRuleSet("Server/Service/Engine/Host/Context/"));

    // When the 'engine' is found, set the parentClassLoader.
    digester.addRule("Server/Service/Engine",
            new SetParentClassLoaderRule(parentClassLoader));
    addClusterRuleSet(digester, "Server/Service/Engine/Cluster/");

    return digester;
}

相信很多人看到这里有点晕了,我一开始也是,但是我们抓住重点即可,这里虽然有这么多的Rule,但是并不是每一条Rule都会被使用,后面我们用idea进行调试分析,同时使用示例的server.xml来分析每一个Rule如何生效,干了什么等,等我们熟悉了几个Rule之后另外的Rule就迎刃而解了,其实也没有必要所有Rule都了解,我们按需即可,比如Cluster相关的Rule大部分人用不到,一开始可以先不管,遵循需要什么学习什么的原则

Digester解析过程

下面我们将会用如下的server.xml进行示例Digester的解析过程:

<Server port="8005" shutdown="SHUTDOWN">
    <Listener className="org.apache.catalina.startup.VersionLoggerListener" />
    <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" />
    <Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener" />
    <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" />
    <Listener className="org.apache.catalina.core.ThreadLocalLeakPreventionListener" />
    <GlobalNamingResources>
        <Resource name="UserDatabase" auth="Container" type="org.apache.catalina.UserDatabase" description="User database that can be updated and saved" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" pathname="conf/tomcat-users.xml" />
    </GlobalNamingResources>
    <Service name="Catalina">
        <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" />
        <Engine name="Catalina" defaultHost="localhost">
            <Realm className="org.apache.catalina.realm.LockOutRealm">
                <Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase" />
            </Realm>
            <Host name="localhost" appBase="/usr/local/Cellar/tomcat/9.0.31_1/libexec/webapps" unpackWARs="true" autoDeploy="true" deployOnStartup="false" deployIgnore="^(?!(manager)|(tomee)$).*">
                <Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs" prefix="localhost_access_log" suffix=".txt" pattern="%h %l %u %t " %r" %s %b" />
            </Host>
        </Engine>
    </Service>
</Server>

我们前面提到了Digester的四个主要方法startDocument、endDocument、startElement、endElement,这里先有一个印象,startDocument和endDocument分别会在XML解析的开头和结尾调用一次并且endDocument会调用对应Rule的finish函数,而startElement会调用对应Rule的begin函数,endElement会调用对应Rule的body和end函数

我们在这四个关键函数下断点,首先来到startDocument:

public void startDocument() throws SAXException {
    if (saxLog.isDebugEnabled()) {
        saxLog.debug("startDocument()");
    }

    if (locator instanceof Locator2) {
        if (root instanceof DocumentProperties.Charset) {
            String enc = ((Locator2) locator).getEncoding();
            if (enc != null) {
                try {
                    ((DocumentProperties.Charset) root).setCharset(B2CConverter.getCharset(enc));
                } catch (UnsupportedEncodingException e) {
                    log.warn(sm.getString("digester.encodingInvalid", enc), e);
                }
            }
        }
    }

    // ensure that the digester is properly configured, as
    // the digester could be used as a SAX ContentHandler
    // rather than via the parse() methods.
    configure();
}

这个函数就是初始化操作,没有太值得注意的地方,后面是更重要的startElement函数,其核心部分为:

public void startElement(String namespaceURI, String localName, String qName, Attributes list)
        throws SAXException {
    // 获取namespaceURI对应的rules
    List<Rule> rules = getRules().match(namespaceURI, match);

    // 调用所有rule的begin方法
    for (int i = 0; i < rules.size(); i++) {
        Rule rule = rules.get(i);
        rule.begin(namespaceURI, name, list);
    }
}

我们在rule.begin这一行下个断点看看

第一个节点是Server节点,首先触发了关于Server节点的startElement方法,触发了三个Rule的begin方法,分别是ObjectCreateRule、SetPropertiesRule、SetNextRule,还记得我们之前提到的createStartDigester方法里面的:

digester.addObjectCreate("Server",
        "org.apache.catalina.core.StandardServer",
        "className");
digester.addSetProperties("Server");
digester.addSetNext("Server",
        "setServer",
        "org.apache.catalina.Server");

这些add方法的第一个参数是匹配规则,所以匹配到Server的规则会有三条

ObjectCreateRule的begin方法:

public void begin(String namespace, String name, Attributes attributes) throws Exception {

    String realClassName = getRealClassName(attributes);

    if (realClassName == null) {
        throw new NullPointerException(sm.getString("rule.noClassName", namespace, name));
    }

    // Instantiate the new object and push it on the context stack
    Class<?> clazz = digester.getClassLoader().loadClass(realClassName);
    Object instance = clazz.getConstructor().newInstance();
    digester.push(instance);
}

其实就是利用反射生成了一个org.apache.catalina.core.StandardServer的实例,这是addObjectCreate方法的第二个参数指定的

SetPropertiesRule的begin看名字也猜得出来是为Server类添加属性,也就是添加port和shutdown这两个属性

有一点要注意的是endElement并不是紧接着startElement调用的,而是遇到节点结尾的时候才会调用,因此第一个调用到endElement的节点是Listener节点,endElement方法核心代码有:

public void endElement(String namespaceURI, String localName, String qName)
        throws SAXException {
    List<Rule> rules = matches.pop();

    String bodyText = this.bodyText.toString().intern();
    for (int i = 0; i < rules.size(); i++) {
        Rule rule = rules.get(i);
        rule.body(namespaceURI, name, bodyText);
    }

    for (int i = 0; i < rules.size(); i++) {
        int j = (rules.size() - i) - 1;
        Rule rule = rules.get(j);
        rule.end(namespaceURI, name);
    }
}

matches是存储规则集的栈,startElement其实还会将当前匹配出来的规则集入栈,这里pop其实就是将规则出栈,避免了重复搜索规则

后面的rule.body方法就是当该节点之内有内容的时候会触发规则本例中没有这种情况所以不会被触发,举个能够触发的例子就是:

<a>test</a>

这种情况下test就是bodyText

最后就是endDocument方法,其主要功能有:

public void endDocument() throws SAXException {
    // 触发所有rule的finish方法
    for (Rule rule : getRules().rules()) {
        rule.finish();
    }

    // Perform final cleanup
    clear();
}

结语

至此Digester就讲完了,总地来讲Digester通过继承DefaultHandler2并定义了很多规则来实现对节点的解析,主要包括对象的创建、对象属性的赋值、对象函数的调用等一系列初始化操作