Development/Docs/Dependency Optimization

From Mandriva Community Wiki

Jump to: navigation, search
RPM Dependency Optimization

RPM packages often carry many virtual dependencies generated at package creation time. Efficient dependency resolution calls for a reduced number of such relations, a goal that could be achieved by optimization at package creation time.

Contents


[edit] Foreword

This paper presents a system that could greatly reduce the dependency resolution burden in RPM-based systems, with direct implication of performance gain due to reduced complexity of dependency chains and smaller system memory usage. Additional gain arise from reduced package base transfer time over the network for tools such as APT, URPMI and Smart, and from reduced size required for storage of the package base in the local disk.

That said, I don't really expect the system outlined in this paper to be accepted as it departs from conventional RPM wisdom and breaks the widely accepted canon of per-file symbolic dependencies. Religious issues set aside, I hope this work yields further study on dependency optimization and overall system improvement.

[edit] Introduction

[edit] Automatic dependencies

Automatic library and file dependencies are added by RPM as a convenient way to build packages without the need to manually keep track of all components needed by the package being built. The mechanism is to add a dependency on a virtual package named after the needed component. The package that carries the component provides the virtual symbol and is indirectly required by the package which uses the component.

The original purpose of this design is to locate a library or file regardless of the actual package containing the component. It allows free relocation of the component from one package to another without breaking dependencies, and also provides a primitive alternatives system when different, mutually-excluding packages containg versions of a needed component can all satisfy the requirements of the packages depending upon the virtual package name. Special handling can be done with certain libraries (e.g. libc), where the library version is appended to the name of the virtual symbol to improve compatibility.

On the other hand, the system also promotes sloppy packaging pratice regarding to dependencies, often resulting in many unnecessary relations and increased complexity. Automation is no substitute for proper planning.

[edit] Performance issues

Package managers such as RPM, APT, URPMI and Smart must compute dependency closure on each new package installed on the system. Complexity of the algorithmis are not linear, meaning that twice the number of relations will result in more (usually much more) than twice the computation effort needed to resolve the package dependencies. Informal testing has shown that APT in Conectiva Linux could be more than fifty times slower than APT in Debian in a comparable system, with Debian having four times as many available packages as Conectiva Linux (not all available repositories being used in Conectiva Linux). Part of the difference is caused by slower access to the db-format package base in RPM (Dpkg uses plaintext), but a significant slowdown can be caused by the higher number of dependency relations. Further investigation is recommended.

Table 1: APT package statistics

Distribution Conectiva Debian
Real packages 12450 16609
Virtual packages 10896 1478
Real/Virtual ratio 1.14:1 11.24:1
Time for apt-cache stats 14.82s (700MHz Xeon SCSI) 0.51s (300MHz Ezra IDE)

Table 2: Smart package statistics

Distribution Conectiva Debian Mandriva
Real packages 12655 15912 11129
Virtual packages 10504 1323 26047
Real/Virtual ratio 1.20:1 12.03:1 0.43:1
Time for Smart stats 6.10s (3GHz P4 SATA-RAID) | 11.82s (2.4GHz Xeon SCSI)

Table 2 shows results obtained with Smart, figures are similar but not the same as APT (further investigation is recommended). Smart and APT have similar performance on RPM systems, while APT is significantly faster in Debian, even in a much less powerful machine. Mandriva Linux seems to have an excessive number of virtual packages compared to real packages.

[edit] Caveat

All dependency resolution operations in the paper ignores the glibc to rpm requirement (see UnzipDependency for related information). Not ignoring this dependency causes all packages to depend on the entire core system and makes further analysis meaningless.

[edit] Package dependencies

[edit] Current situation

The current dependency graph for package bash in a Mandriva Cooker system circa the Mandriva Linux 2005 release is shown in Fig. 1. This layout is likely to be similar in other RPM-based Linux systems. All virtual dependencies in this case were automatically generated by RPM at package build time.

Image:bash.png

Fig. 1: Dependency graph of package bash

[edit] The new approach

Instead of leaving all dependency tracking to be done at package install time, optimization can be done at package creation time. In this study, the following rules are used:

  • Virtual package resolution for libraries and files is done at package build time instead of package install time. The reasons are: 1) Multiple packages providing the same component is not the common case, and the alternatives system provides a better solution with a manually provided symbol; and 2) On a mature distribution, moving components from one package to another is a rare situation and can also be handled in per case basis.
  • Remove all self-dependencies as they are automatically satisfied.
  • Remove all double dependencies between packages since they're as good as a single dependency.
  • Virtually all packages requiring ldconfig also require the libc, so ldconfig is added to the libc package.
  • Remove the rpm dependency from glibc, as it ties every package to the entire system core.

The resulting dependency graph for package bash after application of these rules is shown in Fig. 2.

Image:bash-new.png

Fig. 2: Optimized dependency graph for bash

Note that the optimized dependency graph for bash is similar in its simplicity to the dependency graph of pdksh in Debian, shown in Fig. 3 (generated with apt-cache dotty with conflicts removed).

Image:pdksh-debian.png

Fig. 3: Dependency graph for pdksh in Debian

[edit] Complexity analysis

Table 2 shows the number of requirement dependencies for an arbitrary set of packages using standard RPM dependencies with virtual library packages and optimized dependencies. The number of dependencies ratio for the current and optimized cases can be as good as 8% (in package emacs), with the worst value no greater than 31% (in urpmi). The abnormally high number of dependencies for more complex packages hint to a highly inter-dependent system core, which should be investigated and decoupled wherever possible (see Unzip Dependency). Once the core system is sanitized, the number of dependencies should be greatly reduced for all packages.

Table 2: Number of dependencies

Package Current Optimized O/C Ratio
glibc 6 1 0.17
bash 33 4 0.12
bzip2 52 10 0.19
sed 65 15 0.23
gawk 73 15 0.21
emacs 96 8 0.08
wget 143 21 0.15
graphviz 262 65 0.25
frozen-bubble 729 111 0.15
less 911 284 0.31
coreutils 911 284 0.31
gcc 967 304 0.31
python 1027 310 0.30
urpmi 1086 342 0.31
gaim 1479 433 0.29

[edit] Compatibility

Virtual library and file names can be removed from package requirements without compromising compatibility with existing third party packages, as long as these symbols are still provided . By doing so, native distribution binaries will benefit from the optimized requirements chains while still allowing external packages to require virtual packages; if we assume that the entire system core and the vast majority of other installed packages are distribution native the unoptimized requirements will be of little impact in the overall system performance.

[edit] Conclusion

This paper outlines a method to dramatically reduce the number of dependencies in RPM-based systems by bypassing virtual package requirements introduced mainly by Red Hat Linux heritage. The use of intermediary virtual packages for certain components such as libraries and executables is not part of the RPM mechanism but a rule set by local packaging policy. This policy was created in a scenario lacking the Debian alternatives system and package managers such as APT, and aims to address problems that are better handled by modern solutions while creating problems for recursive dependency resolution.

The good: The method preserves compatibility with existing third-party binaries (since virtual symbols are still provided), improves system performace and saves bandwidth and disk usage.

The bad: Packages must be rebuilt to take advantage of the new dependency layout. Also some planning is needed when splitting packages or moving components from a package to another, since it would require adoption of nontrivial rules to keep packages functional (see below).

The ugly: Special handling of relations is needed in package layout modifications, as shown in Fig. 4 where package B' is spun off from package B and Fig. 5 where package B is split into packages C and D. It is assumed that all packages B, B', C and D contain components needed by package A.

Image:case1.png
Fig. 4: Package B' is spun off from B

In the spin-off case, package A must sill require package B and a new dependency must be added to package B' (green). This affects all package B dependendants and the new dependency is automatically added on rebuild. Three new relations must be added: 1) new package B' must obsolete old package B (blue), so installation of B' removes old B; 2) new package B' must conflict with old package B (red), so old package B can't be installed if B' is installed; and 3) new package B conflicts with old package A (red), so installation of new B forces upgrade of A.

Image:case2.png
Fig. 5: Package B is split into C and D

In the split case, besides new requirements to packages C and D (added automatically on package A rebuild) four new relations must be added: 1,2) packages C and D must obsolete package B (in order to remove B on installation of either C or D); and 3,4) packages C and D must conflict with package B (to prevent installation of B when C or D are installed).

In both cases, a build system that automatically builds dependants is highly recommended to keep the package base consistent. This system, therefore, is not recommended for distributions with frequent package layout changes or without automated system consistency and upgrade tests.

Personal tools