Development/Docs/Cluster Admin
From Mandriva Community Wiki
A quick reference of various cluster admin tasks
[edit] Platform overview
The platform for the build system is running on multiple machines and is composed of 2 main layers:
- the logical "cooker" layer
- the physical layer (technical system infrastructure)
[edit] Troubleshooting the cooker layer
The logical cooker layer is made of chroots runnning on multiple nodes (n1..n5 for x86, seggie & deborah for x86_64). To access these machines you login as normal with "ssh n5" for example. From there, you can sudo to perform administration tasks as described below. To help in the daily platform administration, you can request to do so on the distrib-admin mailing list. To get help, login on IRC and contact distrib-admin members who have sudo priliveges on this layer.
[edit] Troubleshooting the physical layer
This layered is formed by the real system running on the nodes, some of them are in 2007, some of them in a mix of 2006 & cooker. Normaly as a build engineer or a contributor, you do not need to access this part.
If a problem cannot be solved from the logical cooker layer, you (as a registered member of distrib-admin) can contact ISTeam to get help. The current SLA for the physical layer is that HW troubles are taken into account during the week with IS Team normal duties. HW troubles happening during the week-end are only fixed on a best effort basis.
[edit] Access to the root shell
If you want to become root, just use sudo bash.
[edit] Updating the cluster configuration
The cluster is managed by cfengine, the configuration is in /var/lib/config/ on kenobi.
A cron task is updating each node of the cluster. Managed files are updated every hour or so. To expedite or force an update, run the /etc/cron.hourly/config script on each managed machine.
[edit] Granting an svn commit right to a user who already has ssh access
- add the user to the svn group on svn.mandriva.com
- checkout svn+ssh://svn.mandriva.com/svn/admin/svn/trunk
- edit the svnperms.conf file , and commit
- update the checkout on svn.mandriva.com cd /svn/conf; svn up ( a checkout on commit feature would be nice, but there is some privileges problem for the moment, and no clean way to handle them )
[edit] Changing password expiration
The cluster is using NIS to authenticate users (for the moment ). The configuration is stored in /var/yp. If you want to change password expiration, you need to use chage on kenobi, like this
chage -M -1 misc
then run
( cd /var/yp ; make )
to update nis database.
[edit] Adding a buildhost to the upload system ACL, based on architecture
In order to declare another buildhost to the upload system, you need to edit the file /etc/youri/hosts.conf, i.e. use the /var/lib/config copy of kenobi. The format is simple :
host-regexp arch-regexp
The check comes from Youri::Upload::Check::Host (/usr/local/lib/perl/Youri/Upload/Check/Host.pm)
[edit] Adding full privileges to someone on bugzilla
On kenobi, just run this
/root/bin/bugs -a erinmargault editbugs
To give privilege to user erinmargault to editbugs.
[edit] Access to the real system outside of chroot
In order to recover in case of big problems, cluster node uses a chroot. The real system can be accessed on port 12, like this:
ssh n5 -p 12
You can also mount the real partion ( /dev/hda5 ) and use chroot to go outside of the first chroot.
[edit] Creating an account for svn usage
OUTDATED PROCEDURE
Use this script:
/var/lib/config/bin/new-user-kenobi <account> <full name> <mail address> (<mandrake.org adress>)
This will create an account only on kenobi. After that, you will need to manually do the following:
- add the user to the svn unix group
- add the user to the proper subversion group in the ACLs file (/svn/conf/svnperms.conf). For a translator this would be the po group, for a package maintainer it would be the cooker group
- add a mapping to the [users] section of /var/lib/config/etc/repsys.conf for this user
- contact lenny ( lenny at mandriva.com ) for the handling of the mail alias @mandriva.org
[edit] Cleaning iurt process that does not respond
If the build is too old (i.e. more than one day, and does nothing, iurt should be killed. Please take a look at the log file first (shown by ps aux) to try another method. rpm/urpmi locking problems are knows, this requires a kill.
[edit] Problem with autofs?
In case autofs is not working, here is a quick summary on how things are set up: autofs is the one from 2006.0, because we run a 2006.0 kernel outside of the cooker chroot (there are various incompatibilities with the cooker version at the moment).
To reinstall, just run :
rpm -Uvh /mnt/BIG/dis/community/2006.0/i586/media/main/autofs-4.1.4-4.2.20060mdk.i586.rpm
Autofs is run from inside the cooker chroot. The config file is /etc/auto.home, managed by cfengine.
[edit] Upload is stopped on kenobi
Sometimes, a mail is sent to signal a problem on kenobi or ken :
Subject: [Maintainers] kenobi.mandriva.com filesystem is full Only 4878812 bytes available. Stopping upload and mirroring processes.
This mail, on kenobi is sent by the script /etc/cron.hourly/stop_if_full, which runs /home/mandrake/bin/stop_if_full. The script checks avaliable disk space on /export/home/ and /mnt/BIG/ and stops crond.
So if this happens, this usually means that something is taking too much space, and among the usual suspects, we have /mnt/BIG/dis/uploads/failure/cooker/{contrib,main}/release/, that can fill pretty quickly. Using find and rm to remove the old log is the usual solution to clean it.
Once this is done, crond must be started again.
service crond start
[edit] (re)move a package
The package repository reference machine is ken. If you have the right access to it, whatever is done there in terms of package move, removal, etc. is reflected in the rest of the world. For example, to move a package from 2007 main/backports to 2007 contrib backports one could do this:
for m in /mnt/BIG/dis/2007.0/{SRPMS,*/media}; do mv $m/main/backports/*warzone* $m/contrib/backports; done
[edit] LDAP support
The cluster infrastructure currently suffers from lots of files and accounts scattered accross multiple machines. We really need to centralize this and use only one source. One way to do this is LDAP.
Below are the sources of information used in the cluster machines:
- unix accounts and groups: currently NIS. Not all machines are part of it (svn.mandriva.com isn't, for example. Easily LDAPified. done!
- sudo: currently
/etc/sudoers
. Distributed via cfengine. Easily LDAPified. I (Andreas) have a ldif file with the current kenobi sudoers converted to LDAP. done! - @mandriva.org email alias: unknown where it is maintained. Easily LDAPified (just have that @mandriva.org postfix server query an aditional LDAP map for aliases).
- repsys: configured via
/etc/repsys.conf
on all cluster machines. Main need is user mapping in LDAP. Needs development to have it (#30549: already implemented in svn version, currently undergoing some testing). done and uploaded! - svnperms: ACLs and groups for SVN access. Could benefit from LDAP support for groups. A simple solution would be to use unix groups, which would use LDAP automatically via nss_ldap. Probably no need to store ACLs in LDAP.
- autofs: used in cluster nodes and managed via cfengine. Easily LDAPified. automap imported, needs testing
- ssh key: for cluster and svn access. Can be LDAPified via an external patch to openssh, not applied by default. Is more important for when the home directory, NFS mounted, fails.
- urpmi: as a bonus, perhaps use LDAP to store urpmi configuration for the cluster nodes. Not urgent.
A few other enhancements we could use in this LDAP server:
- use unique overlay to prevent selected duplicate names done!
- use refint overlay to prevent dangling entries wherever possible (needs attributes with dn syntax to work) done!
- use rfc2307bis groups instead of rfc2307? done!
- use password policy overlay to centralize password control (does it work to change an expired password via ssh?) added! but no policy so far
- and of course, use more than one LDAP server with DNS round robin wherever the tool does not support multiple ldap servers (nss_ldap supports it, for instance)
[edit] Todo
-
script to manage user accounts: I'm trying the ldapscripts package. It needs these changes:
- support rfc2307bis groups (easy to add)
- support cn, sn, givenName and email domain
- better support for being called in a script (i.e., needs command line parameters for some stuff)
- support for uid/gid pool instead of enumerating all users/groups in order to find out what number to use
- possibly not depend on nss_ldap configured
- alternatively, we could use smbldap-tools without the samba component
- autofs: test autofs with these maps, come um with a configuration
- fix emails: contributor vs employee (.org vs .com)
- replication: setup slave/consumer, decide on which machine(s)
- nested groups: patch nss_ldap to disable nested group support (patch done, but not applied by default, it's not "upstream quality")
- svnperms: patch svnperms.py to support groups in LDAP instead of svnperms.conf