Refactored PD4ML API
New class package structureSince PD4ML v4 the public API classes are moved from
com.pd4mlpackage. The main converter fully qualified class name is
com.pd4ml.PD4MLnow. For backward compatibility we created a wrapper
org.zefer.PD4ML(and accompanying utility classes), which makes possible to use the newest pd4ml.jar in applications compiled with PD4ML v3. The wrapper class translates the old API calls to new ones (where possible).
Separated source read and target write methodsIn the new API the conversion process is split into two phases: reading/parsing HTML with
pd4ml.readHTML(...)method and target format writing methods
pd4ml.renderAsImages(...). The approach allows to read a source document only once and to write multiple document output types as well as makes possible to analyze parsed document metrics (for example, maximal width of HTML content) and to choose the best suitable target paper format.
Less dependency on Java AWTUnfortunately it is not possible for the time being to completely omit Java AWT classes usage in PD4ML: AWT is used to read font metrics, to write to BufferedImage etc. The new API reduces Java AWT dependency in the public API; AWT classes are replaced with better suitable custom ones: ( for example
Utility methodsPD4ML v4 includes a lot of features, previously available only in command-line mode or in third-party tools. Now you can directly from your Java application index font directory, analyze PDF documents, merge PDFs, remove selected pages, apply or reset PDF security settings, update metadata etc.
Page marks and decoration elementsIn addition to well known from previous versions
<pd4ml:page.footer>we added new proprietary tags
<pd4ml:watermark>. All of the tags allow you to define page header, footer, background or watermark in HTML and to specify a page scope (front page, even, odd or explicitly specified page number range) to apply to. The tags can be placed directly into source HTML, or they can be applied with corresponding PD4ML API methods.
Plug-in interface for custom tagsWith the new PD4ML it is possible to register a custom HTML tag and to assign a custom Java handler class to it. The handler receives parsed tag attributes, tag's outer HTML as a string and graphics context to print to (draw text, graphics primitives etc). Typical applications for the interface are, for example, to plug external SVG and MathML renderers.
Web fonts supportIn addition to the regular way of embedding TTFs (taken from a local preconfigured font directory or from fonts.jar), PD4ML supports Web fonts referencing using the standard CSS syntax. If specified, PD4ML downloads the fonts from a provided URL (either remote or local) and uses it in the conversion process. Currently TTF, OTF (with TrueType font outlines) and WOFF font file types are supported. WOFF2 support comes later.
Endnotes supportIn addition to footnotes support we implemented a proprietary
<pd4ml:endnote>tag. The tag, if present in the source HTML, is substituted with an endnote index, all nested content goes to the end of the document and is represented indexed, similar to footnotes. The feature can be useful for a creating of bibliography section, index of graphs etc.
Print and/or screen targeted document watermarkingNow you can specify arbitrary HTML content as a watermark (applying transparency, angle etc properties to it). Additionally the watermark can be targeted for particular media: i.e. no watermark by screen view, but watermarked print output.
PDF/UA supportThe new PD4ML architecture has been designed bearing PDF/UA (International Standard ISO 14289 for accessible PDF technology) in mind. Now PD4ML can output Tagged PDF conforms PDF/UA and PDF/A-2a standards
Visual refinementsThe new version changes look-and-feel of form widgets (defined with scalable vector graphics now), adds support for rounded borders and partial support for gradient fills.
HTML injectionNew PD4ML API method allows to virtually inject an arbitrary portion of HTML code right after opening
<body>tag or just before closing
</body>tag of a source document.
Page Number TagThe new version adds support for a proprietary
<pd4ml:page.number>tag. The tag (without attributes) is replaced with total number of pages in the resulting document. With "of" attribute
<pd4ml:page.number of="anchorName">the tag is replaced with a page number, where located a referenced
<a name="anchorName">or an element with
More HTML5/CSS3 Support
HTML5 tags supportThe new HTML rendering engine is optimized for HTML5. We do not claim full HTML5 specification support — some features are irrelevant for PDF/RTF conversion, a usefulness of some probably undervalued by us — but the most important tags and features are already there. Thanks to the new architecture, any missing feature can be added with small or moderate efforts.
Full HTML table tags supportWe totally refactored the table rendering subsystem and implemented a quite sophisticated table page break logic. Now it also supports all previously ignored table-specific tags, like
<col>etc and correctly implements fixed table layout (in addition to the default auto table layout). Table layout building logic has been rethought from performance optimization perspective.
Selected CSS at-rule directives supportPD4ML v4 introduces support of
@pageCSS at-rules. They intended to download and register in the font cache TTF/WOFF fonts and define document-specific target page format (incl. margins) correspondingly.
More CSS functions supportedThe improved CSS parser/cascading engine of PD4ML implements new CSS functions for color value computation (
rgb(), rgba(), hsl(), hsla()), opacity control (
alpha()), general calculations (
calc()). More functions are to be supported soon. More info...
New optimized single-pass HTML parserAs a significant part of our efforts to improve the software performance we developed an optimized single-pass HTML parser. The parser implicitly performs HTML normalization (of non-well-formed HTML) and builds DOM-like document representation in RAM. The second parsing pass is triggered only in a case the source document overrides document encoding or when the parser encounters a
<style>section nested to
<body>, as the style can potentially affect already parsed part of the document
New resource cacheAll resource requests from PD4ML (to load images, stylesheets, fonts etc) are dispatched via the new cache engine. The cache locally stores (in RAM or in temp dir) frequently used items, tracks their expiration time (if specified by HTTP), cleanups cache RAM if the cache exceeds reasonable size. Also the cache engine solves the JDK issue of flooding temp directory with locked objects, created on each
New font engineThe totally-reimplemented PD4ML font engine does a good job to efficiently lookup the best suitable font from a list of available ones, to match requested font family, face, style and a capacity to render a given text string. The font engine can handle multiple font folders, can deal with downloaded web fonts, can auto-index and use system fonts (optionally filtered by a specified criteria).
New PDF/Image output modulesPD4ML implements new PDF and raster image output subsystems, optimized for a better performance and a smaller memory footprint. RTF output module is ported from the latest PD4ML v3 with minimal changes and still shows great results and performance.
Product distribution changes
Apache MavenPD4ML development process is based on the concept of a project object model (POM) of Apache Maven. Maven allows us to manage project's build, testing, reporting, and documentation from a central piece of information. From customer perspective, the major benefit of the Maven-centricity is an instant availability of the newest versions or nightly built snapshots in our public Maven software repository.
Of course, we kept a possibility for our customers to obtain PD4ML from the usual software download area.
Continuous deliveryThe new PD4ML development infrastructure is build on continuous delivery (CD) principles. The development process is organized to produce software in short cycles, ensuring that the software can be automatically and reliably released at any time.
License APISince PD4ML v4 we build and deliver identical binaries for all license types. Particular license type-specific features are switched on or off depending on license activation code. The code can be passed directly to PD4ML constructor as a string, or provided to PD4ML API as