Advanced Protein BLAST: PSI-BLAST

Welcome to the prelab reading for our session on Advanced Protein BLAST, focusing specifically on PSI-BLAST (Position-Specific Iterated BLAST). While standard BLASTP is excellent for finding closely related sequences, PSI-BLAST is a powerful extension designed to detect more distant evolutionary relationships among proteins. This tool is invaluable when searching for remote homologs that share subtle sequence similarities indicative of common ancestry and potentially shared function or structure.

1. What is PSI-BLAST?

PSI-BLAST is an iterative search method that builds a statistical model of related proteins to find more distantly related sequences in a database. It starts with a single protein sequence query and progressively refines its search criteria by incorporating newly found related sequences.

The key innovation in PSI-BLAST is the use of a Position-Specific Scoring Matrix (PSSM), also known as a profile. This PSSM captures the conservation patterns of amino acids at each position in a multiple sequence alignment of related proteins, making the search much more sensitive than using a generic scoring matrix like BLOSUM62 alone.

Purpose of PSI-BLAST:

To detect remote protein homologs that are not easily found with standard BLASTP.
To identify new members of a protein family.
To gather sequences for phylogenetic analysis or protein modeling, especially when initial searches yield few results.

2. The Core Concept: Iterative Searching and PSSMs

PSI-BLAST’s power comes from its two main components: an iterative search strategy and the use of PSSMs.

2.1 Iterative Search Strategy

The PSI-BLAST algorithm works in rounds or iterations:

Initial Search (Iteration 1): The query protein sequence is first compared against a protein database using a standard BLASTP-like algorithm (using a standard scoring matrix like BLOSUM62).
PSSM Construction: Significant local alignments found in the first iteration are used to construct a multiple sequence alignment. This alignment is then used to generate a PSSM. The PSSM essentially provides a score for each possible amino acid at each position in the alignment, reflecting the observed amino acid frequencies and conservation.
Iterative Searches (Subsequent Iterations): The PSSM generated is then used as the query to search the database again. This PSSM-based search is more sensitive at detecting sequences that fit the profile of the protein family, even if their similarity to the original query is low.
Refinement and Repetition: New significant hits found in an iteration can be incorporated into an updated multiple alignment and a refined PSSM. This new PSSM is then used for the next round of searching.
Convergence: The process is repeated until no new significant sequences are found, or a user-defined number of iterations is reached.

Simplified PSI-BLAST Iterative Process Figure 1: A simplified diagram illustrating the iterative nature of PSI-BLAST. The query searches the database, significant hits are used to build a PSSM, and the PSSM is used for the next search round.

2.2 Position-Specific Scoring Matrix (PSSM)

A PSSM is a matrix that represents the likelihood of finding each amino acid at each position in a protein domain/family. It is more informative than a simple consensus sequence because it captures the variability allowed at each position.

Rows: Typically represent the positions in the alignment.
Columns: Represent the 20 standard amino acids.
Values: Scores indicating how favorable a particular amino acid is at a specific position. Positive scores mean the amino acid is favored (more frequent than expected by chance or biochemically similar to conserved residues), while negative scores mean it’s disfavored.

Example of a PSSM (Conceptual):

Imagine a small alignment of 3 positions:

Query: A R T Hit 1: A S T Hit 2: G R T Hit 3: A R S

A PSSM derived from this would have higher scores for ‘A’ or ‘G’ at position 1, ‘R’ or ‘S’ at position 2, and ‘T’ or ‘S’ at position 3, compared to other amino acids. It also learns that some positions are more conserved (e.g., position 3 is mostly T) than others.

3. How PSI-BLAST Works: The Algorithm Steps

Let’s refine the iterative process into more discrete steps:

4. When to Use PSI-BLAST?

PSI-BLAST is particularly useful in scenarios such as:

Detecting Remote Homologs: When you suspect a protein has distant relatives that are not found with a standard BLASTP search. This is common for proteins that have diverged significantly over evolutionary time but may retain similar structural folds or functions.
Expanding Protein Families: To identify new, previously uncharacterized members of a known protein family.
Functional Annotation: If your protein has no obvious close homologs, PSI-BLAST can sometimes find distant relatives with known functions, providing clues about your protein’s role.
Improving Multiple Sequence Alignments: The set of sequences found by PSI-BLAST can be a good starting point for building more comprehensive multiple sequence alignments for a protein family.

5. Advantages of PSI-BLAST

Increased Sensitivity: The primary advantage is its ability to detect sequences with much lower identity to the original query compared to BLASTP. The PSSM captures family-specific conservation patterns, which is more powerful than relying on general substitution matrices.
Discovery of Novel Relationships: It can uncover evolutionary links that are not apparent from direct pairwise comparisons.
Domain-Centric Search: PSSMs are often built around conserved domains, making PSI-BLAST effective for finding proteins that share these domains, even if the rest of the protein is different.

6. Important Parameters and Considerations in PSI-BLAST

Effectively using PSI-BLAST requires understanding key parameters:

E-value threshold for inclusion in PSSM (e.g., PSI-BLAST E-value threshold for PSSM construction on NCBI): This determines which hits from an iteration are used to build/update the PSSM for the next iteration. A lower (stricter) E-value (e.g., 0.005, default on NCBI) reduces the chance of including non-homologous sequences in the PSSM, but might miss some true remote homologs. A higher (looser) E-value can increase sensitivity but also risks PSSM corruption.
Max number of iterations: Limits how many rounds of searching PSI-BLAST will perform. This prevents excessively long run times and can help avoid PSSM corruption if set appropriately.
Query Coverage and Identity: Check these values for newly included sequences. Low coverage or identity might indicate a match to only a small domain or a potentially spurious hit.
Database choice: Using comprehensive databases like nr (non-redundant protein sequences) is common. For specific tasks, curated databases like SwissProt might be considered, or a database filtered by taxonomy.
Compositional Adjustments: PSI-BLAST, like BLAST, can be affected by sequences with biased amino acid compositions. Modern versions often apply compositional adjustments to scoring to mitigate this.

A crucial aspect of using PSI-BLAST is to be vigilant about PSSM corruption (also known as homologous over-extension). This occurs when a non-homologous sequence is incorrectly included in the PSSM. Because the PSSM is used for subsequent searches, this error can propagate and lead to the inclusion of more unrelated sequences, eventually “drifting” the search away from the original query’s family. Careful inspection of new hits in each iteration is essential.

7. Practical Guide: Performing a PSI-BLAST Search on NCBI

Let’s walk through a typical PSI-BLAST search using the NCBI website.

Navigate to NCBI BLAST: Go to the NCBI BLAST homepage (blast.ncbi.nlm.nih.gov).
Select PSI-BLAST: Under the “Protein BLAST” section, choose the “PSI-BLAST” program. Figure 2: The NCBI BLAST homepage where you select the type of BLAST search. PSI-BLAST is an option under Protein BLAST.
Enter Query Sequence: Paste your protein sequence in FASTA format or provide its accession number into the “Enter Query Sequence” box.
Choose Search Set (Database): Select a database. Non-redundant protein sequences (nr) is a common choice for broad searches. You can limit by organism if needed.
Algorithm Parameters (Important for PSI-BLAST):
- Click on “Algorithm parameters” to expand this section.
- PSI/PHI/DELTA BLAST options:
  - PSI-BLAST E-value threshold for PSSM construction: Default is often 0.005. You might adjust this based on your needs (e.g., slightly higher for more sensitivity, but with caution).
  - Max number of iterations: Default is often 5. You can increase this if you expect very distant homologs and are willing to inspect results carefully.
- You can also adjust general parameters like the E-value cutoff for reporting hits in each iteration and scoring matrix for the first iteration (though the PSSM takes over later).
Figure 3: An illustrative example of where PSI-BLAST specific parameters like E-value threshold for PSSM inclusion and number of iterations would be set on the NCBI BLAST form.
Run BLAST: Click the “BLAST” button to start the first iteration.
Interpret First Iteration Results: The results will look similar to a standard BLASTP output. You’ll see a list of hits, their scores, E-values, etc.
Proceed to Next Iteration:
- If significant hits are found that you believe are true homologs, PSI-BLAST will automatically select those below the specified PSSM E-value threshold to build the PSSM for the next round.
- You will typically see a button like “Go” or “Run next iteration” and a message indicating how many new sequences were found and added to the PSSM.
- Crucially, inspect the new sequences. If any look suspicious (e.g., functionally unrelated, very short alignments, known contaminants), you can often manually uncheck them before proceeding to the next iteration. This is key to preventing PSSM corruption.
Analyze Subsequent Iterations and Convergence:
- After each iteration, examine the new hits.
- Note if the E-values of known family members improve (get smaller).
- The search “converges” when an iteration produces no new sequences below the E-value threshold for inclusion in the PSSM. At this point, PSI-BLAST stops automatically, or you can stop manually if the results are satisfactory or if suspicious hits start appearing.

Watch a Video Tutorial: For a visual guide on performing a PSI-BLAST search and interpreting results, this video can be helpful: (Note: Video content can change. Search for “PSI-BLAST NCBI tutorial” for current options.)

Video 1: PSI-BLAST explained & Tutorial from Scratch (Source: Bioinformatics Coach on YouTube)

8. Interpreting PSI-BLAST Results

New Hits: The main goal is to find new, more distantly related sequences in later iterations that were not significant in the first iteration.
E-values: Pay attention to E-values. A sequence that was a borderline hit in iteration 1 might become a very strong hit in iteration 2 if it closely matches the PSSM.
Score: Bit scores also reflect alignment quality.
Descriptions: Read the descriptions of newly found proteins. Do they make biological sense in the context of your query protein family?
Convergence: When PSI-BLAST reports “No new sequences were found above the 0.005 threshold” (or your chosen threshold), it has converged. This means the current PSSM isn’t finding any more family members in that database under the current criteria.

9. Potential Pitfalls and Best Practices

PSSM Corruption:
- Cause: Inclusion of non-homologous sequences (false positives) into the PSSM. This can happen if the E-value threshold for PSSM inclusion is too loose, or if a region of biased composition or a short, common motif leads to spurious hits.
- Prevention/Mitigation:
  - Use a reasonably strict E-value threshold for PSSM inclusion (e.g., 0.001-0.005).
  - Manually inspect sequences selected for PSSM generation before each new iteration. Deselect any suspicious hits.
  - Limit the number of iterations if corruption seems likely.
  - Be wary if the search starts drifting into completely unrelated protein families.
  - Use filters for low-complexity regions or transmembrane regions if appropriate.
Compositional Bias: Sequences with unusual amino acid compositions can sometimes lead to misleadingly high scores. PSI-BLAST implementations often include adjustments for this, but it’s good to be aware.
Over-iteration: Running too many iterations, especially with a loose E-value threshold, increases the risk of PSSM corruption. If no new biologically relevant hits are appearing, it’s often best to stop.
Database Choice: Ensure the database is appropriate and comprehensive enough for your search.

10. Conclusion

PSI-BLAST is a sophisticated and powerful tool for protein sequence analysis, extending the capabilities of standard BLAST to uncover distant evolutionary relationships. Its iterative refinement of a PSSM allows for significantly increased sensitivity. However, this power comes with the need for careful execution and interpretation, particularly to avoid PSSM corruption. By understanding its principles, parameters, and potential pitfalls, you can effectively leverage PSI-BLAST to gain deeper insights into protein families, functions, and evolution.

This prelab reading should provide you with a solid foundation for our upcoming lab session on PSI-BLAST. Be prepared to apply these concepts in practical exercises.

Tools
Radar
Test
Toolkit

Community
X
Discord
YouTube
GitHub