This is the Bioinformatics section for the HOWTOs collection.
The following lists bioinformatics software available on calypso. The list below does not provide the actual commands that you would type to access the software. For usage details, visit the software documentation links provided (where available), or contact us for support. You may also want to try executing
ls /usr/local/bin/
on calypso to see the actual commands available.
Many of the programs available on our systems for bioinformatics can take a considerable amount of time to run. Some may take days or even weeks to complete their analysis. For this reason, it may be desirable to place such jobs "in the background." This is a way of running a program that allows you to continue working on other tasks (or even log out) while still keeping the program running. Furthermore, backgrounded jobs are not dependent on your session remaining open, so even if your computer crashes, your job will continue uninterrupted.
NOTE! Due to problematic interactions between OpenMosix kernels and threading used by perl/python/java, you may experience strange and random segfaults when running programs written in these languages. Please use the -g option to bgrun for such programs to prevent them from running on OpenMosix nodes.
There is one consideration you should keep in mind when you're planning to run a job in the background - is it interactive or non-interactive? Interactive jobs require you to provide input after you enter the command. An example of an interactive job is the passwd command - after you enter the command, you have to answer some prompts (ie your password). A non-interactive job is one that doesn't require any further input after entering the command. An example is the ls command - after you enter the command, it does its thing, and then ends.
Non-interactive jobs are easy to run in the background and there are several ways to go about it. The best method for most cases is to use the bgrun command to preface your usual command. For example:
bgrun mendel
bgrun is a special program that will take all of it's arguments as a command and submit them to the cluster batch queue. This ensures that the cluster is utilized as efficiently as possible.
After invoking the command, you will be given some information about the job you just submitted, including the identifier and where the output of the program (if any) will go.
your job 8202 ("mendel.31187") has been submitted
Job submitted. Identifier: 31187 ; output redirected to mendel.31187.output
You can check on the status of your submitted jobs by entering the following command:
qstat -u myusername
If your command generates output normally, you can also preview that output by looking at the output file using the tail command. In the example above, you would use tail mendel.31187.output to look at the last 10 lines of that file.
When the job has completed running, you will no longer see the job listed via the qstat command. You can also cancel a job by entering qdel myjobid. The job id is the first number in the bgrun output above, ie 8202.
Once the job is running in the cluster queue, you do not have to remained logged in. You can log out and your job will continue to run.
The bgrun command has a few options itself. They can be seen if you enter bgrun as a command with no arguments. They include -m which will cause bgrun to generate an email message once the job is complete (this requires an email address in your ~/.forward file); -x which will exclude the head node from running the job (it will instead select one of the backing, usually less busy, cluster nodes); and -o OUTFILE which will let you specify a name for the output file instead of the default.
Interactive jobs are harder to run in the background, since if they are running in the background, there is no way for it to receive input from you. However, some interactive programs can be made to run in a non-interactive way. Some, for example, allow you to answer all prompts as a series of command-line options. If this is possible, then you should investigate the options required to do this and then run the command non-interactively.
Other commands can accept input from a file instead of from your direct input. In these cases, you would put all of the answers you would type during the interactive run into file and then "pipe" these answers into the program via input redirection, eg command < input.file. Again, in this case, you can then run the command as if it were non-interactive.
If there is no way to make a program run in a non-interactive fashion, you can essentially background your entire session using the screen command. To do so, you have to first initialize a screen session by entering screen. It will look like nothing happened. Now run your interactive job. Once you have provided all required inputs and the job has begun to run, you can background the entire session by pressing Control-A followed by D. If you do this successfully, you should see a [detached] message to indicate your session was put in the background.
At this point, you can log out.
To re-attach your session, enter screen -r. Your session should pick up right where it left off. If your job is complete, you can close out your session by entering exit. Remember to close out all screen'd sessions! You can see a list of all your screen sessions via screen -list.
Human Genetics provides computational services for bioinformatics. This comes in the form of Unix servers with bioinformatics software installed and configured in a cluster to handle the needs of our users. The number and specifications of the cluster varies (ie increases) over time, but as of this writing (2007-06-15) consists of 20 "nodes" of dual processor computers of 2.4 - 2.8 GHz each CPU, for a total of 96 effective GHz.
Please note that some nodes are available only through Grid Engine See below for details on how to use the Grid Engine.
NOTE! Due to problematic interactions between OpenMosix kernels and threading used by perl/python/java, you may experience strange and random segfaults when running programs written in these languages. Please use the -g option to bgrun for such programs to prevent them from running on OpenMosix nodes.
While there may be many and variable numbers of nodes available, users connect to only one of them, called the "head node". This head node is named calypso and has the full hostname calypso.genetics.ucla.edu. This is the server to which users log in and run software.
Two clustering technologies are employed by the Human Genetics cluster. They each have their benefits and while they function well together, they are separate entities and the in and outs of each should be kept in mind.
The easiest clustering technology to use is called openMosix. You don't have to do anything special to use it as it is completely transparent. Just run your programs or "jobs" as if no clustering were involved and openMosix will automatically transfer and run it on the best possible node. In fact, it may move your job again across nodes if another one becomes available and it might run faster there.
As openMosix is transparent to the user, your jobs can be monitored and managed via the usual Unix tools. For example, the process list command ps will list your processes as if they were running on calypso, even if they are actually running on another node. Likewise top will monitor running processes.
In addition to the regular Unix tools, openMosix includes a few extra tools that provide additional details and insight into the functioning of the openMosix cluster. mtop is a replacement for top that is openMosix-aware: in addition to the normal information, it also lists on which node a job is running, and how many times it has moved from node to node. mtop should be used on our cluster instead of top for this reason.
mosmon is another tool and it provides a pseudo-graphical monitoring display of cluster load, using a histogram. With it, you can see a visual representation of how many openMosix nodes are available and the relative load on each of them. Enter mosmon -h to see a detailed description of what mosmon can do as well as a list of keyboard commands available to alter the display while it is running.
One caveat with openMosix cluster is that not all jobs will migrate to different nodes and instead will "stick" on the head node. While this is unusual, it is still something to be mindful of as it would be ill-advised to run 10 jobs of this type since they will swamp the head node rather than distributing across the cluster. Whether this will happen depends on the program you're running and the specifics of a particular job, so it's hard to know in advance if this will be the case - please monitor using the above tools (mtop is a good choice) to see that your jobs are distributing properly.
The other caveat with the openMosix cluster is that not all cluster nodes are openMosix members - when using openMosix as the way to distribute jobs, only a subset of computers are available to you. To gain access to the full suite of cluster nodes, you must use the other clustering technology - the grid engine.
The other clustering technology employed by our cluster is provided by the N1 Grid Engine. Unlike openMosix, the grid engine is not transparent and you must execute special commands in order to utilize the grid engine cluster. Also unlike openMosix, the grid engine doesn't run all jobs it receives simultaneously - rather it runs as many jobs as there are CPUs available and "queues" the others until a CPU becomes available. As such, the grid engine is more "friendly" to all users in that one user cannot swamp the cluster and in practice, the time to complete jobs run serially is not significantly different than if run simultaneously.
To utilize the grid engine, you must "submit" jobs to the grid engine queues. This is most easily done through the bgrun command - usage details for this command are available here
Monitoring jobs in the grid engine system requires the use of some new commands. The most useful are qhost, which lists all members of the grid engine cluster and their current loads; qstat which lists details on the current queue (which jobs are queued or running and where they are running); and qdel which allows you to remove unwanted jobs from the queue (you can only remove your own jobs, of course).
While using the grid engine does involve the use of an extra command, in practice, this should be as easy to use as with openMosix - simply preface your usual command with "bgrun" (eg bgrun mendel instead of just mendel). The benefits of using the grid engine is that unlike openMosix, you will have full access to all members of the cluster and that it is able to better distribute jobs in an efficient manner. Not only will all jobs distribute under the grid engine, it does so in a more load-balanced manner. Furthermore, with the Grid Engine, you have a better chance of not losing your job if the head node goes down.
Genotator is a program that allows you to perform a variety of different sequence analyses[1] at once and examine the various outputs in a unified form. Additional information about Genotator and the details about its functions and usage can be found at http://www.fruitfly.org/~nomi/genotator/.
This HOWTO covers getting started with using Genotator and running a typical analysis. For details and in depth discussion on subsequent examination of the results, please consult the documentation provided at the site linked above.
Note: Genotator requires an X-Windows connection to our servers. Please see our XWin HOWTO for details.
Genotator runs in two stages - first the analysis, then the examination of the analysis. To run an analysis, prepare your sequence in fasta format. It should look something like this:
>MyDrosSeq agttgtactgaaatactcgataaggaaatacccaaattacaaaatgttcaagcacctgctgactttgttc gccctgtgcgcggtgtttagcacctgcctgtcggaagacgagacccgtgcccgtctcctggtctctaagc agatcctgaacaagtacctggtggagaagagcgacctgttggtacgctacaccatcttcaacgtgggcag
To start genotator, enter genotator at the command-line. You will then be presented with the main genotator analysis window.
First, press the PRESS HERE TO SELECT button to browse to your fasta format sequence file. Select the organism from which your sequence originates, the select the analyses you wish to have performed [1] (green means include).
For the output option, make sure to select Current directory, as you will not have write privileges to the default selection and Genotator will fail.
Finally, press the Start annotation button to begin the analysis- you will be presented with a notice about the run - press OK to start.
Depending on current load on the server, the analyses may take a few minutes. Once it has completed, the genotator will disappear and you will be returned to the command prompt and a message informing you on how to view your results.
To browse your results, enter genotator-browser my_sequence.tfa, replacing my_sequence.tfa with whatever was indicated by Genotator's closing message. You do not have to type in the full paths.
A colorful window summarizing the results of the various analyses performed will appear. You can click on the color bars to get details on particular results, and in some cases double-click to view details from the originating output files.
It is beyond the scope of this HOWTO to describe completely the usage of the genotator-browser, so please consult the official documentation for further elaboration.
To quit, select Quit from the File menu.
As of 2002-11-06, the Grail component does not work as the Grail server is not available.
Promoter prediction is provided by fa2TDNNpred. tRNA scanning is performed by tRNAscan-SE.
Homology searches are the result of blastx performed against the indicated databses. Please note that the installed databases may not be current at a given time.