%begin{latexonly} \newif\ifpdf \ifx\pdfoutput\undefined \pdffalse \else \pdfoutput=1 \pdftrue \fi % Change this as needed : % - a4paper to your paper format % - the document class to your need (book, article, ...) \ifpdf \documentclass[a4paper, 12pt, pdftex]{report} \else %end{latexonly} \documentclass[a4paper, 12pt, dvips]{report} %begin{latexonly} \fi %end{latexonly} % The packages we need \usepackage{verbatim} \usepackage{moreverb} \usepackage{url} \usepackage{tabularx} \usepackage[final]{graphicx} \usepackage[hyperindex,breaklinks=true,pdfborder={0 0 0}]{hyperref} %begin{latexonly} \ifpdf \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=red} \fi %end{latexonly} \usepackage{html} \begin{htmlonly} \newcommand{\href}[2]{\htmladdnormallink{#2}{#1}} \end{htmlonly} % block style paragraphs tend to look better in technical docs \parindent=0in \parskip=10pt \begin{document} % Split Jar Specification \appendix \chapter{Split Jars and MANIFEST Extensions} The jar file specification allows archiving and packaging classes and resources, but is limited in size. To overcome these limitations with minimal changes to the jar format we create a set of jar files and add new key-value attributes to the jar MANIFEST. These attributes indicate now many jars are in the set, and which, if any, files are split across multiple jars, and which jars they are contained in. \section{Motivations and Limitations} Java's zip implementation limited to \~{}2GB. The problem is not solved by zip64 extensions (which will allow larger files), when medium limitations restrict the jar size. There must be a way to split the archive into multiple files, and indeed, the individual entries must be split across jars. A ``Split Jar'' is a set of normal jar files, one being the \emph{primary} jar, and zero or more \emph{secondary} jars. The Primary jar file has additional manifest attributes to help reconstruct the data. Entries may or may not be split across multiple jars, and need to be spliced back together upon extraction. Secondary jar names derive from the basename of the primary jar, and each segment of a split entry, shares a basename derived from the original entry name. Segments of a split entry need not be in separate jars. Thus if a jar is split to deal with media limitations, all the resulting jars may be combined into a single primary jar, as long as the Manifest is correctly updated. A major benefit to this format is that the split archive contents can be recovered manually by extracting the contents of all jar files in the set, and simply concatenating the segments of the split entries. Entry names in the primary and secondary jars must not conflict, so that together they represent a single archive. This includes the generated names of split entry segments. This ensures that each jar in a split archive may be extracted to the same location without risk of loosing data. Split file segments are then be concatenated manually or by automation to get the original data set. The manifest should always be consulted to ensure that files which look like split entry segments should actually be spliced together. It is possible that the files were intended to be part of the archive (See ``Naming Conventions'', for name conflict resolution). All segments of a split jar are given generated names so that normal jar tools will never unpack the original file. This ensures that no unsuspecting user mistakenly uses a truncated, partial file. \subsection{Warnings} Signing jar entries which have been split has not been addressed. Files can not be compressed directly into streams when there are potential name conflicts with the generated segment names. This requires that robust tools collect a list of files to be added, and determine any conflicts first to avoid the issue (See Naming Conventions: Entry Names). Adding files to existing split jars may also have problems with name conflicts. \section{Naming Conventions} A primary goal for this design is to allow split jars to be created and unpacked manually with minimal problems. This is accomplished by using a naming convention which lends to visual reconstruction. When a jar file must be split into multiple segments, there is a primary file, and multiple secondary jars with a common name. When an entry within the set of jars must be split, \emph{each} segment is given a numbered suffix. \subsection{Jar File Names} For the primary jar \texttt{\emph{basename}.jar}, the names of secondary jars must always be \texttt{\emph{basename}.split\#.jar} where \texttt{\#} is an integer \emph{secondary jar ID} starting at \texttt{1}. Left padded zeros in the ID are ignored, and encouraged to allow lexicographical sorting. The jars can be renamed, as long as the \emph{basename} is the same for all, and the suffixes (\texttt{.split\#.jar}) remain the same. All entries within the set must be unique. \subsection{Jar Entry Names} For the split entry named \texttt{\emph{basename}} (including suffixes), all segments are named using the template: \texttt{\emph{basename}}\texttt{.---\#.\~{}}, where \texttt{\#} is an integer \emph{segment ID} starting at \texttt{0}. These segments are rejoined by concatenating the segments in numeric order, to a file named \texttt{basename}. The template is recorded in the \emph{main} section of the manifest. In the rare case where an entry is split, and the name of a real entry may conflicts with a generated segment name, a non-default suffix template is used. In Our case, all of the generated segments will have '\texttt{\~{}}' characters appended, as needed, to eliminate potential conflicts. This non-default template is recorded in the \emph{per-entry} section of the manifest for the split entry. Non-default suffixes are used for all \emph{potential} conflicts even in cases where there is no actual conflict. \begin{itemize} \item When the split entry does not generate enough segments to conflict, but the suffix matches the default template. \item When the conflicting real entry must also be split, thus its actual entries use generated suffixes. \end{itemize}\ Examples are given below. Other tools implementing split jars may (though are not encouraged to) use different suffixes, though they must have numeric segment replaced by '\#' in the manifest. Tools must sort these numerically, not lexicographically as ``2'' is generally greater than ``10'' lexicographically. However, tools are encouraged to zero padding names, as needed, so that lexicographic sorting is correct. \section{Manifest Attributes} To minimize changes needed to implement the split jar, we simply add attributes to the manifest. Additional attributes are ignored by other jar tools, so the only consequences is that files split files, and files completely located in secondary jars will not be available to them. To prevent adding too much space overhead, and allow jar files to be renamed, the entries are kept minimalistic. \subsection{Main Section Attributes} Two attribute are added to indicate the number of secondary jars, and the default suffix added to the segments of split files. % TODO: make this like an html