DISCLAIMER

Follow these instructions at your own risk. Though they have been tested by the authors, said authors and the Laboratory for Genomics and Bioinformatics take no responsibility for any damage or loss of data as a result of their use. All opinions expressed are those of the authors and not necessarily those of the University of Oklahoma Health Sciences Center. Use of this tutorial acknowledges your agreement with these terms.

 

 tutorial: setting up BLAST on OS X

This tutorial will guide you through every step necessary to setup and run BLASTs locally on your Mac using OS X. It assumes absolutely no knowledge of unix commands; any commands you need will be explained as we go.

note: This document was meant to gently take your hand and guide you through this process, explaining as much as possible. As such, it is a bit long and will perhaps be boring for seasoned *nix-users. Don't be intimidated by its length; once you've done this a few times the entire process takes less than two or three minutes to do. If you'd like something less verbose, see the README files in the BLAST distribution.

To start, open a Darwin command-line interface (aka terminal). This can be found under

Applications -> Utilities:


Figure 01. - OS X application window showing the Terminal icon

 

 opening a terminal


Figure 02. - A newly opened Terminal window
When you double-click on the picture that says "Terminal" beneath it, a window will open that looks like Figure 02.

For those of you new to Unix-based systems, this is known as the command-line or terminal window and is where most of the work gets done. When you use the terminal, you will mostly type commands with the keyboard rather than clicking on things with your mouse.

What you will see on the last line will be different than in Figure 2. This is because "dyer-g3-jo" is the name of my computer and "posidian" is my user name. When you open this window you will see your machine name and user name in place of these.

Remember, when I describe one of the many commands that we'll use during this tutorial, you can look at the associated figure and see that command typed and also how the computer responds it.

A quick note on how to enter commands you see in this tutorial: Each command will be listed out with a number in front of it. This number is not to be typed, it only exists so that we can reference the command like "command 7" or "commands 9-12" etc. Also, be sure to press the "return" or "enter" key after each command (depending on which your keyboard has). This will be labeled as "<return>". Don't actually type <return>, just remember to hit the return button. Also, be sure to enter spaces between words in a command when you see them in this tutorial, such as between the words "mkdir" and "blast" in command 1 below.


 

 first commands


Figure 03. - Terminal window showing output of commands 1-3.
When you open a new terminal window, as we just did, you will be in your home directory which, in OS X, is /Users/username , where you substitute "username" for whatever your username happens to be.

We are going to learn three unix commands now. They are "mkdir", "cd", and "pwd". "mkdir" stands for "make directory", cd for "change directory", and "pwd" for "print working directory".

Our first command will create a directory called "blast".

1   mkdir blast <return>

We then need to change into (move into) that directory.

2   cd blast <return>

Then we'll print out our current folder position.

3   pwd <return>


 

 making directories


Figure 04. - commands 4-7
Command 3, "pwd", is a command that doesn't actually change anything, it only lets you know where you are in your computer's file system tree. Feel free to enter it anytime to see where you are, as we will do throughout this tutorial.

Use the mkdir command again, this time to make two directories at once: "programs" and "databases". The programs directory is where we'll later put the set of BLAST programs, and the databases directory is where we'll put DNA and protein sequence files to BLAST against.

4   mkdir programs databases <return>

The "ls" command "lists" the contents of the directory you are currently in. We'll use it now to see the two folders we just created.

5   ls <return>

Now change into the programs directory.

6   cd programs <return>

Then let's see where we are.

7   pwd <return>


 

 connecting via FTP


Figure 05. - Connecting to NCBI's FTP server (commands 8-10)
We are going to download from NCBI using a program called ftp. FTP stands for "file transfer protocol" and is a very common method to transfer files between computers, especially unix-based systems.

This command connects us to NCBI's FTP server

8   ftp ftp.ncbi.nih.gov <return>

After a moment, a lot of information will flood the screen. When it stops, you will be prompted to enter a Name or User Name. Use the name "anonymous".

9   anonymous <return>

It will then ask for your password, and you should use your email address. While you type your password it will be invisible. This is just a security thing, your keyboard isn't broken!

10   whomever@wherever.com <return>

This will bring you to an ftp command prompt (ftp>) as in figure 5.


 

 listing available BLAST packages

FTP commands are much like UNIX commands.

Let's change into the directory where the BLAST programs are.

11   cd blast/executables/LATEST <return>

Then see a listing of all the ones that are available.

12   ls <return>

(note, you may have to resize your window to see the file listing properly)

The long listing of files you see is just different versions of the BLAST program. Each version is tailored for different types of computers and operating systems.


Figure 06. - The list of downloadable BLAST packages (commands 11 & 12)

 

 downloading BLAST

Now, for OS X, we want to download the file that starts with "blast-" (not "netblast-") and has powerpc-macosx in the name. The rest of the file name has version numbers that may change, so you will have to look for the name yourself and modify the command below if necessary. The aptly-named "get" command is how you download files with ftp.

13   get blast-2.2.6-powerpc-macosx.tar.gz <return>

It will take a moment to download (it's about 4MB). When you have returned to an ftp command prompt, as in figure 7, continue on to the next step.


Figure 07. - Downloading BLAST packages for OS X (command 13)

 

 looking for test sequences

We have downloaded BLAST, but we also need to download some sequences to BLAST against and test our BLAST installation. In this tutorial we will download the E. coli genome and use it for our tests.

Lets move into the directory we created to hold our database (sequence) files. The "lcd" command is only valid when you are connected to a server with the ftp program. It stands for "local change directory", and is used when you want to change to a new directory on your machine, not on the server you are connected to. We will use it now to go up one directory (represented by a "../") and then into the "databases" directory we created earlier.

14   lcd ../databases/ <return>

Now lets move to the place on NCBI's server where the E. coli genome is kept.

15   cd /genomes/Bacteria/Escherichia_coli_K12 <return>

printing our working directory should show us that we are in "/genomes/Bacteria/Escherichia_coli_K12"

16   pwd <return>

And when we list the contents of that directory we should see several files that start with "NC_"

17   ls <return>


Figure 08. - Available files for the Escherichi coli K12 genome (commands 14 - 17)

 

 downloading the Escherichi coli K12 genome

Let's get the file NC_000913.fna file first. The ".fna" extension stands for Fasta Nucleic Acid, which lets you know that this is a nucleic acid sequence in FASTA format. (You will have to wait a moment between these commands while each sequence downloads).

18   get NC_000913.fna <return>

Now get the NC_000913.faa file. ".faa" stands for FASTA Amino Acid, which, of course, is the collection of amino acid sequences of the E. coli genome in FASTA format.

19   get NC_000913.faa <return>

And finally we are done with NCBI's FTP server, so we'll issue the "quit" command.

20   quit <return>


Figure 09. - Downloading the E. coli genome sequences. (commands 18 - 20)

 

 clearing your screen

After quitting ftp we are back in our programs directory. Let's clear the junk off our screen by typing the "clear" command. This makes everything look nicer, and you can still scroll back up and look at your previous commands if you wish.

21   clear <return>


Figure 10. - Output after a "clear" command. (command 21)

 

 de-compressing BLAST

The set of programs in the blast distribution that we downloaded are compressed together into one larger file. This makes suites of programs easier to download and saves on storage space. We can uncompress them using the "tar" command with the following options:

22   tar -xzf blast-2.2.6-powerpc-macosx.tar.gz <return>

The "tar" command, when used with these options, does not produce any output on the screen. But if you list the contents of the directory you will see that many files were created.

23   ls <return>


Figure 11. - File listing after unpacking the BLAST distribution (commands 22 & 23)

 

 creating the NCBI initialization file

We need to create a file that BLAST needs to work properly. It must be in our home directory, called ".ncbirc", and have very specific contents.

Let's change to our home directory by using the "cd" command with no options. When you just type "cd", it assumes that you want to go to your home directory.

25   cd <return>

A quick and easy way to write to a file is by combining the "echo" command with the ">" redirector. That sounds really technical, but when you take them individually it isn't that bad. "echo" makes the computer print something on the screen. The ">" redirector, when used like this, catches anything that is supposed to be going to the screen and instead redirects it to a file. So, command 26 takes the word "[NCBI]", creates a file called ".ncbirc" (notice the period at the beginning), and puts the word into the file.

26   echo "[NCBI]" > .ncbirc <return>

We want to add one more line to the file we just created. This is done much like command 26 except we use the ">>" redirector instead of ">". This is because ">" creates or overwrites a file and ">>" just adds stuff to the end of a file. Since we want to add a second line, we use ">>".

27   echo "Data=/Users/posidian/blast/programs/data" >> .ncbirc <return>

you can then test that the file was written correctly with this command:

28   cat .ncbirc <return>

Command 28 should print out two lines, as shown near the end of figure 13.


Figure 13. - Creating and displaying the .ncbirc file (commands 25 - 28)

 

 preparing to run formatdb

BLAST is set up, now we are ready to format the E. coli sequence files so that we can perform searches against it. There is a program called formatdb that you must run on any text files that you wish to BLAST against.

Let's start by changing into the directory where we downloaded the genome sequences.

29   cd blast/databases/ <return>

List out what's there.

30   ls <return>

You can see that these files aren't named very intuitively. Instead of the accession number, we would rather have a more generic name for our E. coli sequence. The "mv" command is used to rename or move files. It takes two arguments: the file you want to change the name or location of, and then the name or location that you want it to have. We'll rename NC_000913.faa to ecoli.faa.

31   mv NC_000913.faa ecoli.faa <return>

Now rename NC_000913.fna to ecoli.fna <return>

32   mv NC_000913.fna ecoli.fna <return>

We can see the results of these by listing the contents of the directory.

33   ls <return>


Figure 14. - Renaming the E. coli genome sequence files (commands 29 - 33)

 

 setting the path environmental variable

Before we can run formatdb, OS X needs to know where the blast executables are. We can make it remember by adding the path to the executables to the PATH environmental variable. If that means nothing to you, don't worry about it. It is UNIX stuff that is beyond the scope of this tutorial. Fortunately, doing it is easy:

Let's clear the screen first.

34   clear <return>

Now we'll write to a file much like we did with command 27. Note that this command may wrap around the column on this page. You should enter it as one long line (as shown in figure 15) and ignore any wrapping that might have happened in your browser.

35   echo 'set path = ( $path /Users/posidian/blast/programs )' >> ~/.cshrc <return>

The "source" command tells the computer to read a file and do different things depending on what the file says.

36   source ~/.cshrc <return>

To test that this worked, we'll make the computer tell us our PATH.

37   echo $PATH <return>

The output of that last command should be similar to that of figure 15. It is ok if it is different, just be sure that it has the path like /Users/posidian/blast/programs somewhere in it. (substituting "posidian" for your user name, as usual)


Figure 15. - Modifying and displaying the PATH environmental variable (commands 34 - 37)

 

 running formatdb

Since we are creating a database of the amino acid sequences, and another of the nucleic acid sequences, we will run formatdb twice.

Again, we'll clear the junk off our screen.

38   clear <return>

Now format the nucleic acid database. Notice the difference in the "-p" option between commands 39 and 40.

39   formatdb -i ecoli.fna -p F -o T <return>

Then format the amino acid database.

40   formatdb -i ecoli.faa -p T -o T <return>

See what files were created.

41   ls <return>

The options we passed to the formatdb command, and many others we didn't use, are explained in the README.formatdb file mentioned earlier. In short, the "-i" option tells formatdb which file to format, the "-p" is either T (true) if the query is an amino acid sequence or F (false) if it is not. The "-o" option is either T or F, depending whether we wish to use indexing, which requires the FASTA headers to be formatted as specified by NCBI. This is out of the scope of this tutorial and more information can be found in the README files. Just keep in mind that any sequences downloaded from NCBI's ftp site can be formatted with the "-o" option set to T, which is preferred.


Figure 16. - Running formatdb on both genome sequence files (commands 38 - 41)

 

 creating an example file

We now have the BLAST programs set up and databases formatted and ready to use. What we need now is a sequence file to test our setup with. We are going to pull one out of the genome and use it to test, which will certainly at least match itself. There are several ways of doing this, but a quick and easy way is to use the "tail" command to view the last few lines of the genome sequence, count how many of those last lines are one record, and then use the "tail" command again but redirect the output to a file. Here we go:

This will show us the last 15 lines of the amino acid sequence file.

42   tail -15 ecoli.faa <return>

Here, from the bottom, you count the lines of sequence plus the line that starts with a ">" sign, in my case, the last 5 lines are what I want. (as highlighted in figure 17) Because I want the last 5 lines, I will use the number 5 in the tail commmand and redirect the output to a file called "testseq.faa" one directory above the current one.

43   tail -5 ecoli.faa > ../testseq.faa <return>


Figure 17. - Creating an example file with one sequence to test our BLAST setup (commands 42 & 43)

 

 moving back to the BLAST directory

You should now have the testseq.faa file in the original blast directory. If you have followed this tutorial to this point you can go there with this command:

We'll clear the screen first

44   clear <return>

Then go up one directory

45   cd ../ <return>

Now remind ourselves where we are. This should appear like figure 18.

46   pwd <return>

And list the contents of the directory to make sure the testseq.faa we just created is here.

47   ls <return>


Figure 18. - Moving back to the BLAST directory (commands 44 - 47)

 

 first BLAST

Let's BLAST the file with the program "blastall". Since we added the programs to your PATH variable, you can run BLAST anytime by using the "blastall" command. You must use options to tell blastall where the database file and query files you want to use are. Since we are in the same directory as our query file, we can just say the name of the file, other wise we would have to use both the path to the file and the name of the file. So, to blast our amino acid test sequence against the amino acid database, we do:

Clear the screen again.

48   clear <return>

Now we'll do our first BLAST. The results will be returned in a tab-delimited format.

49   blastall -p blastp -i testseq.faa -d databases/ecoli.faa -e .001 -m 9 <return>


(this figure is too large to display in this window. Click the picture for a larger view.)
Figure 19. - Running our first local BLAST (commands 48 - 49)

 

 BLAST options

That was fast wasn't it? Great, you say, but what did we just do? The "-p" option tells BLAST which program to use. Since we wanted to compare a protein sequence against a protein database, we use "blastp". Here is the full list of available programs:

  • blastp - compares an amino acid query sequence against a protein sequence database
  • blastn - compares a nucleotide query sequence against a nucleotide sequence database
  • blastx - compares a nucleotide query sequence translated in all reading frames against a protein sequence database
  • tblastn - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
  • tblastx - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The "-d" option is the path to the database you want to use. Since ours was in the databases folder (referential to where we are) and is called ecoli.faa we used the option "-d databases/ecoli.faa"

The "-e" option is used if you want to filter the results that are returned by E value. Using "-e .001" makes blastall not report rows unless they have E values at least as good as 10e-3.

Finally, the "-m" option always has a number as its argument and changes the way that blastall displays its results. Values of 8 and 9 will return nice short tabular formatted rows. Experiment with different values or omit the "-m" to see the traditional BLAST output. For example:

Clear the screen yet again.

50   clear <return>

Perform the BLAST. This time, we'll save (redirect) the output in a file.

51   blastall -p blastp -i testseq.faa -d databases/ecoli.faa -e .001 > results.txt <return>

Look at the file listing in a detailed manner.

52   ls -l <return>


Figure 20. - Running a BLAST and directing the output to a file

 

 graduation

As seen in figure 20, that command doesn't produce any output. That is because we used the ">" redirector again, which redirects the output to a file that you name. You can view that file will any text editor or by using the unix "cat", "more", or "less" commands followed by the filename.

That's it! There are many other options available to BLAST that are outside the scope of this tutorial, but here you have learned the basics of setting up the BLAST executables locally. We also formatted a database and performed a test BLAST against it. Just remember that you can create your own custom BLAST databases by putting as many sequences as you like in a FASTA formatted file and then running the formatdb command. You then use the blastall command to do the actual BLASTing.

For more information, the following resources may be helpful:

  • [link] - Blast tutorial at NCBI
  • [link] - Blast - O'reilly & Associates
  • [link] - Sequence Analysis In a Nutshell - O'reilly & Associates
Also, the Laboratory for Microbial Genomics offers informatics support on an hourly basis. Contact Matt Carson for more information.

 

 

No longer will you rely on NCBI's webserver when it slows to a crawl during the day from overuse! Go forth, and BLAST locally.

updated: 2004-03-16

author: joshua-orvis@ouhsc.edu