Format of synthesis.hdlist.cz index
From Mandriva Community Wiki
This page presents the format used by synthesis.hdlist.cz index, generated by genhdlist.
Note: take care where hdlists/synthesis are built on mirrors that there are hard links from media/media_info/hdlist_cz to media/main/media_info/hdlist.cz
So if you want to rebuild your hdlist, don't forget to remake the hard link. The best way to regenerate hdlists for all media is to use gendistrib.
Parsing synthesis is an easy process. I did it in python in 1/2 day, without doc on it or special libs in python. See the attachment.
The format is easy to understand, even if not documented at all. Most of the job is done by a perl-XS library and the "description" can be found in perl-URPM source code.
Here is a sample entry in the file :
@firstname.lastname@example.org()(64bit)@openvpn-down-root.so()(64bit)@openvpn[== 2.2.0-1:2011.0] @requires@rpm-helper[*]@/bin/sh[*]@email@example.com()(64bit)@libc.so.6(GLIBC_2.2.5)(64bit)@libc.so.6(GLIBC_2.3)(64bit)@libc.so.6(GLIBC_2.3.2)(64bit)@libc.so.6(GLIBC_2.3.4)(64bit)@libc.so.6(GLIBC_2.4)(64bit)@libcry pto.so.1.0.0()(64bit)@libdl.so.2()(64bit)@libdl.so.2(GLIBC_2.2.5)(64bit)@liblzo2.so.2()(64bit)@libpam.so.0()(64bit)@libpkcs11-helper.so.1()(64bit)@libssl.so.1.0.0()(64bit) @suggests@openvpn-auth-ldap @summary@A Secure TCP/UDP Tunneling Daemon
The lines are always in the same order. Or at least, the @info@ part is always marking the end of the entry.
The first field is always the type of the line. So far, it can be :
The 5 first tags ( provides, requires, obsoletes, conflict and suggests ), are using the same scheme. They are followed by one or more package names, sometimes with version restriction ( like package[== version] ) . Restriction can be <= >= or ==, as far as i have seen. Multiples packages are separated by @.
The summary is simple too, since it is only followed by the summary on one line.
The last line is info, split like this :
As most names are self-explanatory, I will not explain them in detail. arch is src for src.rpm, or i586,pcc, noarch. Size is in bytes, and the group is the rpm group, as listed in Mandriva Groups.
The last tags are disttag & distepoch, while these are also part of the package name first, it's not possible to reliably parse this info from it, which again affects parsing the other tags as well. To be able to successfully parse a package NVRA with disttag/distepoch in it, you need to provide '-disttagdistepoch'in the expression. Notice that packages will only contain -disttagdistepoch in NVRA if they have a disttag, ie. a package with only distepoch and no disttag will have the same NVRA as a package without distepoch, a package with only disttag and no distepoch will have '-disttag' applied..
Problem with synthesis
However, one problem remains.
What if an rpm includes a @ in the name, or in the description ?
Right now, there is nothing to avoid the problem, and genhdlist will crash, and most of the tools using synthesis will be broken by this bug.
A possible workaround is to use another non-ascii character as delimiter (ie. a value between 0x1 - 0x20 and 0x7f - 0xff), this way we're sure to avoid using any character that might be used by a package in the future..
- synthesis.py: A sample python class of a synthesis parser
In this python code, the function split_requires has a bug: it does not handle cases where there are several provides/depends/requires of a same package. Provides: foo = 2, foo = 3 will not work (because data is stored into dictionaries, so foo will exists only one time)
- another_synthesys.py: Without OO and implemented as a generator
For each interaction the next package description is returned:
>>> import synthesis >>> for pkg in synthesis.parse(hdlist_path): >>> print "name =", pkg['name']