Development/Docs/Dependency Optimization
From Mandriva Community Wiki
RPM packages often carry many virtual dependencies generated at package creation time. Efficient dependency resolution calls for a reduced number of such relations, a goal that could be achieved by optimization at package creation time.
Contents |
[edit] Foreword
This paper presents a system that could greatly reduce the dependency resolution burden in RPM-based systems, with direct implication of performance gain due to reduced complexity of dependency chains and smaller system memory usage. Additional gain arise from reduced package base transfer time over the network for tools such as APT, URPMI and Smart, and from reduced size required for storage of the package base in the local disk.
That said, I don't really expect the system outlined in this paper to be accepted as it departs from conventional RPM wisdom and breaks the widely accepted canon of per-file symbolic dependencies. Religious issues set aside, I hope this work yields further study on dependency optimization and overall system improvement.
[edit] Introduction
[edit] Automatic dependencies
Automatic library and file dependencies are added by RPM as a convenient way to build packages without the need to manually keep track of all components needed by the package being built. The mechanism is to add a dependency on a virtual package named after the needed component. The package that carries the component provides the virtual symbol and is indirectly required by the package which uses the component.
The original purpose of this design is to locate a library or file regardless of the actual package containing the component. It allows free relocation of the component from one package to another without breaking dependencies, and also provides a primitive alternatives system when different, mutually-excluding packages containg versions of a needed component can all satisfy the requirements of the packages depending upon the virtual package name. Special handling can be done with certain libraries (e.g. libc), where the library version is appended to the name of the virtual symbol to improve compatibility.
On the other hand, the system also promotes sloppy packaging pratice regarding to dependencies, often resulting in many unnecessary relations and increased complexity. Automation is no substitute for proper planning.
[edit] Performance issues
Package managers such as RPM, APT, URPMI and Smart must compute dependency closure on each new package installed on the system. Complexity of the algorithmis are not linear, meaning that twice the number of relations will result in more (usually much more) than twice the computation effort needed to resolve the package dependencies. Informal testing has shown that APT in Conectiva Linux could be more than fifty times slower than APT in Debian in a comparable system, with Debian having four times as many available packages as Conectiva Linux (not all available repositories being used in Conectiva Linux). Part of the difference is caused by slower access to the db-format package base in RPM (Dpkg uses plaintext), but a significant slowdown can be caused by the higher number of dependency relations. Further investigation is recommended.
Table 1: APT package statistics
Distribution | Conectiva | Debian |
---|---|---|
Real packages | 12450 | 16609 |
Virtual packages | 10896 | 1478 |
Real/Virtual ratio | 1.14:1 | 11.24:1 |
Time for apt-cache stats | 14.82s (700MHz Xeon SCSI) | 0.51s (300MHz Ezra IDE) |
Table 2: Smart package statistics
Distribution | Conectiva | Debian | Mandriva |
---|---|---|---|
Real packages | 12655 | 15912 | 11129 |
Virtual packages | 10504 | 1323 | 26047 |
Real/Virtual ratio | 1.20:1 | 12.03:1 | 0.43:1 |
Time for Smart stats | 6.10s (3GHz P4 SATA-RAID) | 11.82s (2.4GHz Xeon SCSI) |
Table 2 shows results obtained with Smart, figures are similar but not the same as APT (further investigation is recommended). Smart and APT have similar performance on RPM systems, while APT is significantly faster in Debian, even in a much less powerful machine. Mandriva Linux seems to have an excessive number of virtual packages compared to real packages.
[edit] Caveat
All dependency resolution operations in the paper ignores the glibc to rpm requirement (see UnzipDependency for related information). Not ignoring this dependency causes all packages to depend on the entire core system and makes further analysis meaningless.
[edit] Package dependencies
[edit] Current situation
The current dependency graph for package bash in a Mandriva Cooker system circa the Mandriva Linux 2005 release is shown in Fig. 1. This layout is likely to be similar in other RPM-based Linux systems. All virtual dependencies in this case were automatically generated by RPM at package build time.
Fig. 1: Dependency graph of package bash
[edit] The new approach
Instead of leaving all dependency tracking to be done at package install time, optimization can be done at package creation time. In this study, the following rules are used:
- Virtual package resolution for libraries and files is done at package build time instead of package install time. The reasons are: 1) Multiple packages providing the same component is not the common case, and the alternatives system provides a better solution with a manually provided symbol; and 2) On a mature distribution, moving components from one package to another is a rare situation and can also be handled in per case basis.
- Remove all self-dependencies as they are automatically satisfied.
- Remove all double dependencies between packages since they're as good as a single dependency.
- Virtually all packages requiring ldconfig also require the libc, so ldconfig is added to the libc package.
- Remove the rpm dependency from glibc, as it ties every package to the entire system core.
The resulting dependency graph for package bash after application of these rules is shown in Fig. 2.
Fig. 2: Optimized dependency graph for bash
Note that the optimized dependency graph for bash is similar in its simplicity to the dependency graph of pdksh in Debian, shown in Fig. 3 (generated with apt-cache dotty with conflicts removed).
Fig. 3: Dependency graph for pdksh in Debian
[edit] Complexity analysis
Table 2 shows the number of requirement dependencies for an arbitrary set of packages using standard RPM dependencies with virtual library packages and optimized dependencies. The number of dependencies ratio for the current and optimized cases can be as good as 8% (in package emacs), with the worst value no greater than 31% (in urpmi). The abnormally high number of dependencies for more complex packages hint to a highly inter-dependent system core, which should be investigated and decoupled wherever possible (see Unzip Dependency). Once the core system is sanitized, the number of dependencies should be greatly reduced for all packages.
Table 2: Number of dependencies
Package | Current | Optimized | O/C Ratio |
---|---|---|---|
glibc | 6 | 1 | 0.17 |
bash | 33 | 4 | 0.12 |
bzip2 | 52 | 10 | 0.19 |
sed | 65 | 15 | 0.23 |
gawk | 73 | 15 | 0.21 |
emacs | 96 | 8 | 0.08 |
wget | 143 | 21 | 0.15 |
graphviz | 262 | 65 | 0.25 |
frozen-bubble | 729 | 111 | 0.15 |
less | 911 | 284 | 0.31 |
coreutils | 911 | 284 | 0.31 |
gcc | 967 | 304 | 0.31 |
python | 1027 | 310 | 0.30 |
urpmi | 1086 | 342 | 0.31 |
gaim | 1479 | 433 | 0.29 |
[edit] Compatibility
Virtual library and file names can be removed from package requirements without compromising compatibility with existing third party packages, as long as these symbols are still provided . By doing so, native distribution binaries will benefit from the optimized requirements chains while still allowing external packages to require virtual packages; if we assume that the entire system core and the vast majority of other installed packages are distribution native the unoptimized requirements will be of little impact in the overall system performance.
[edit] Conclusion
This paper outlines a method to dramatically reduce the number of dependencies in RPM-based systems by bypassing virtual package requirements introduced mainly by Red Hat Linux heritage. The use of intermediary virtual packages for certain components such as libraries and executables is not part of the RPM mechanism but a rule set by local packaging policy. This policy was created in a scenario lacking the Debian alternatives system and package managers such as APT, and aims to address problems that are better handled by modern solutions while creating problems for recursive dependency resolution.
The good: The method preserves compatibility with existing third-party binaries (since virtual symbols are still provided), improves system performace and saves bandwidth and disk usage.
The bad: Packages must be rebuilt to take advantage of the new dependency layout. Also some planning is needed when splitting packages or moving components from a package to another, since it would require adoption of nontrivial rules to keep packages functional (see below).
The ugly: Special handling of relations is needed in package layout modifications, as shown in Fig. 4 where package B' is spun off from package B and Fig. 5 where package B is split into packages C and D. It is assumed that all packages B, B', C and D contain components needed by package A.
In the spin-off case, package A must sill require package B and a new dependency must be added to package B' (green). This affects all package B dependendants and the new dependency is automatically added on rebuild. Three new relations must be added: 1) new package B' must obsolete old package B (blue), so installation of B' removes old B; 2) new package B' must conflict with old package B (red), so old package B can't be installed if B' is installed; and 3) new package B conflicts with old package A (red), so installation of new B forces upgrade of A.
In the split case, besides new requirements to packages C and D (added automatically on package A rebuild) four new relations must be added: 1,2) packages C and D must obsolete package B (in order to remove B on installation of either C or D); and 3,4) packages C and D must conflict with package B (to prevent installation of B when C or D are installed).
In both cases, a build system that automatically builds dependants is highly recommended to keep the package base consistent. This system, therefore, is not recommended for distributions with frequent package layout changes or without automated system consistency and upgrade tests.