%begin{latexonly} \newif\ifpdf \ifx\pdfoutput\undefined \pdffalse \else \pdfoutput=1 \pdftrue \fi % Change this as needed : % - a4paper to your paper format % - the document class to your need (book, article, ...) \ifpdf \documentclass[a4paper, 12pt, pdftex]{report} \else %end{latexonly} \documentclass[a4paper, 12pt, dvips]{report} %begin{latexonly} \fi %end{latexonly} % The packages we need \usepackage{verbatim} \usepackage{moreverb} \usepackage{url} \usepackage{tabularx} \usepackage[final]{graphicx} \usepackage[hyperindex,breaklinks=true,pdfborder={0 0 0}]{hyperref} %begin{latexonly} \ifpdf \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=red} \fi %end{latexonly} \usepackage{html} \begin{htmlonly} \newcommand{\href}[2]{\htmladdnormallink{#2}{#1}} \end{htmlonly} % block style paragraphs tend to look better in technical docs \parindent=0in \parskip=10pt \begin{document} % Split Jar Specification \appendix \chapter{Split Jars and MANIFEST Extensions} The jar file specification allows archiving and packaging classes and resources, but is limited in size. To overcome these limitations with minimal changes to the jar format we create a set of jar files and add new key-value attributes to the jar MANIFEST. These attributes indicate now many jars are in the set, and which, if any, files are split across multiple jars, and which jars they are contained in. \section{Motivations and Limitations} Java's zip implementation limited to \~{}2GB. The problem is not solved by zip64 extensions (which will allow larger files), when medium limitations restrict the jar size. There must be a way to split the archive into multiple files, and indeed, the individual entries must be split across jars. A ``Split Jar'' is a set of normal jar files, one being the \emph{primary} jar, and zero or more \emph{secondary} jars. The Primary jar file has additional manifest attributes to help reconstruct the data. Entries may or may not be split across multiple jars, and need to be spliced back together upon extraction. Secondary jar names derive from the basename of the primary jar, and each segment of a split entry, shares a basename derived from the original entry name. Segments of a split entry need not be in separate jars. Thus if a jar is split to deal with media limitations, all the resulting jars may be combined into a single primary jar, as long as the Manifest is correctly updated. A major benefit to this format is that the split archive contents can be recovered manually by extracting the contents of all jar files in the set, and simply concatenating the segments of the split entries. Entry names in the primary and secondary jars must not conflict, so that together they represent a single archive. This includes the generated names of split entry segments. This ensures that each jar in a split archive may be extracted to the same location without risk of loosing data. Split file segments are then be concatenated manually or by automation to get the original data set. The manifest should always be consulted to ensure that files which look like split entry segments should actually be spliced together. It is possible that the files were intended to be part of the archive (See ``Naming Conventions'', for name conflict resolution). All segments of a split jar are given generated names so that normal jar tools will never unpack the original file. This ensures that no unsuspecting user mistakenly uses a truncated, partial file. \subsection{Warnings} Signing jar entries which have been split has not been addressed. Files can not be compressed directly into streams when there are potential name conflicts with the generated segment names. This requires that robust tools collect a list of files to be added, and determine any conflicts first to avoid the issue (See Naming Conventions: Entry Names). Adding files to existing split jars may also have problems with name conflicts. \section{Naming Conventions} A primary goal for this design is to allow split jars to be created and unpacked manually with minimal problems. This is accomplished by using a naming convention which lends to visual reconstruction. When a jar file must be split into multiple segments, there is a primary file, and multiple secondary jars with a common name. When an entry within the set of jars must be split, \emph{each} segment is given a numbered suffix. \subsection{Jar File Names} For the primary jar \texttt{\emph{basename}.jar}, the names of secondary jars must always be \texttt{\emph{basename}.split\#.jar} where \texttt{\#} is an integer \emph{secondary jar ID} starting at \texttt{1}. Left padded zeros in the ID are ignored, and encouraged to allow lexicographical sorting. The jars can be renamed, as long as the \emph{basename} is the same for all, and the suffixes (\texttt{.split\#.jar}) remain the same. All entries within the set must be unique. \subsection{Jar Entry Names} For the split entry named \texttt{\emph{basename}} (including suffixes), all segments are named using the template: \texttt{\emph{basename}}\texttt{.---\#.\~{}}, where \texttt{\#} is an integer \emph{segment ID} starting at \texttt{0}. These segments are rejoined by concatenating the segments in numeric order, to a file named \texttt{basename}. The template is recorded in the \emph{main} section of the manifest. In the rare case where an entry is split, and the name of a real entry may conflicts with a generated segment name, a non-default suffix template is used. In Our case, all of the generated segments will have '\texttt{\~{}}' characters appended, as needed, to eliminate potential conflicts. This non-default template is recorded in the \emph{per-entry} section of the manifest for the split entry. Non-default suffixes are used for all \emph{potential} conflicts even in cases where there is no actual conflict. \begin{itemize} \item When the split entry does not generate enough segments to conflict, but the suffix matches the default template. \item When the conflicting real entry must also be split, thus its actual entries use generated suffixes. \end{itemize}\ Examples are given below. Other tools implementing split jars may (though are not encouraged to) use different suffixes, though they must have numeric segment replaced by '\#' in the manifest. Tools must sort these numerically, not lexicographically as ``2'' is generally greater than ``10'' lexicographically. However, tools are encouraged to zero padding names, as needed, so that lexicographic sorting is correct. \section{Manifest Attributes} To minimize changes needed to implement the split jar, we simply add attributes to the manifest. Additional attributes are ignored by other jar tools, so the only consequences is that files split files, and files completely located in secondary jars will not be available to them. To prevent adding too much space overhead, and allow jar files to be renamed, the entries are kept minimalistic. \subsection{Main Section Attributes} Two attribute are added to indicate the number of secondary jars, and the default suffix added to the segments of split files. % TODO: make this like an html
...
...
\begin{itemize} \item \texttt{Split-Jar-Secondary-Count}: The number of secondary jars in the set. \item \texttt{Split-Jar-Secondary-Suffix}: the suffix template inserted prior to the \texttt{.jar} suffix typical of jar files, to make the names of secondary jar file in the set; typically \texttt{.split\#}. \item \texttt{Split-Entry-Suffix}: the suffix template appended to an entry name, to name each of the entries constituent parts; typically \texttt{.---\#.\~{}}. The \# char indicates the location of the numeric value. This cannot currently be changed. \end{itemize} \subsection{Per-Entry Section Attributes} Only files which are split require an attributes in the manifest. A space separated list of integers is recorded; one for each jar containing a segment of the entry. Entries which have a segment in the primary jar file, indicate this with the id \texttt{0}. No restriction is placed on the order of the entries, or the IDs of the jar in which any segment is contained. % TODO: make this like an html
...
...
\begin{itemize} \item \texttt{Split-Entry-Jar-IDs}: A space separated set of secondary jar IDs which contains the segments of the entry. Essentially a list of integers. \item \texttt{Split-Entry-Suffix}: Overrides the default Split-Entry-Suffix specified in the Main-Attributes. Needed when one (or more) '\texttt{\~{}}' chars are appended due to name conflict with real entries. This is not strictly necessary, as simply knowing the basename and unpacking all jars would allow the suffix to be determined, but is included to conserve processing. This is currently not user configurable. \end{itemize} \section{Examples} Two examples, one simple, and another cluttered with pathological cases. Notice that the jar ID number and segment of a split entry have no correlation. In most applications, there will seldom be more than two segments in a single file: the end of the last entry to the previous jar, and maybe the last entry of this jar, which is continued in the next. The examples aren't so well organized though. :-) \subsection{Basic Example} % TODO: format for tex TODO: format for TeX Jar to Create ------------- example.jar Files to Compress ----------------- movie.mpeg README song.mp3 text.txt Entries in Jars --------------- example.jar movie.mpeg.---0.\~{} README example.split1.jar movie.mpeg.---1.\~{} song.mp3.---0.\~{} example.split2.jar movie.mpeg.---2.\~{} song.mp3.---1.\~{} text.txt MANIFEST (primary jar only) --------------------------- Manifest-Version: 1.0 Created-By: 1.4.2\_04-b05 (Sun Microsystems Inc.) Built-By: IzPack 1.6.0 Main-Class: com.izforge.izpack.installer.Installer Split-Jar-Secondary-Count: 2 Split-Entry-Suffix: .---\#.\~{} movie.mpg Split-Entry-Jar-IDs: 0 1 2 song.mp3 Split-Entry-Jar-IDs: 1 2 \subsection{Name Conflicts} Pathological example showing name conflict resolution. Includes \begin{itemize} \item Direct conflict with real archive file (\texttt{foo...}). \item Indirect conflict with file by suffix template only (\texttt{bar...}). \item Conflict with real archive file that is also split. Due to both being split, there would be no name conflict amongst jar entries, however The default suffix is not used anyway (\texttt{yin...}). \item A \emph{near} conflict, just to be annoying. Normal behavior (\texttt{chi...}). \item Files which look like segments of a split file, but are not, requiring manifest to know the difference (\texttt{zig...}). \end{itemize} \begin{verbatim} Jar to Create ------------- example.jar Files to Compress ----------------- foo.dat foo.dat.---0.~{} .... Extremely unlikely that these would exist, much less need to be archived. Provided as an example. bar.dat bar.dat.---555.~{} .. Another unlikely case which would not conflict (assume bar.dat is split into only 2 segments) except for the suffix template. yin.dat yin.dat.---2.~{} .... Yet another template only conflict conflicting file needs to be split. chi.dat chi.dat.---0.~{}~{} ... No potential conflict. zig.dat.---0.~{} .... Files to be archived as they are, but not zig.dat.---1.~{} intended to be spliced back together. Entries in Jars --------------- example.jar foo.dat.---0.~{} foo.dat.---0.~{}~{} bar.dat.---555.~{} bar.dat.---0.~{}~{} yin.dat.---0.~{}~{} yin.dat.---2.~{}.---0.~{} chi.dat.---0.~{} chi.dat.---2.~{}~{} zig.dat.---0.~{} zig.dat.---1.~{} example.split1.jar foo.dat.---1.~{}~{} bar.dat.---1.~{}~{} yin.dat.---1.~{}~{} yin.dat.---2.~{}.---1.~{} chi.dat.---1.~{} MANIFEST (primary jar only) --------------------------- Manifest-Version: 1.0 IzPack-Version: X.X.X Created-By: 1.4.2_04-b05 (Sun Microsystems Inc.) Built-By: IzPack Class-Path: Main-Class: com.izforge.izpack.installer.Installer Split-Jar-Secondary-Count: 2 Split-Entry-Suffix: .---#.~{} foo.dat Split-Entry-Jar-IDs: 0 1 Split-Entry-Suffix: .---#.~{}~{} bar.dat Split-Entry-Jar-IDs: 0 1 Split-Entry-Suffix: .---#.~{}~{} fig.dat Split-Entry-Jar-IDs: 0 1 Split-Entry-Suffix: .---#.~{}~{} fig.dat.---2.~{} Split-Entry-Jar-IDs: 0 1 moa.dat Split-Entry-Jar-IDs: 0 1 \end{verbatim} \end{document}