PhysBinder - The Search for TFBS HTML5 Powered with CSS3 / Styling, and Semantics

Help Document

1. Input

a. Upload a sequence of your interest

Users can paste a FASTA-formatted sequence in the text field. Choose genomic regions or upload a FASTA sequence by clicking on the “Upload a FASTA file” button. The FASTA file should only contain "header" lines (lines starting with a ">") and sequence lines (where the actual sequence is).

Tutorial: Choose a method to upload your sequences of interest.
You can paste sequences from the clipboard in the input field. These sequences should be in a FASTA format
Another possibility to get sequences is by entering genomic regions. Just select the species and genome version, and enter the location in the textbox.
You can also click the "Upload a FASTA file" button, then click the browse button and upload a FASTA-formatted sequence file from your pc.

b. Select a threshold

It is important to select a suitable threshold for your experiments. In PhysBinder we offer three precalculated thresholds ("Max. Precision", "Average", "Max. F-Measure"). All of these thresholds were calculated on an external control set. This set was not used to build the model. Alternatively, one can enter a custom threshold.

Scores and thresholds range from 1 to 1000. This score is calculated by the Random Forest algorithm and indicates the confidence the algorithm has in the result. We calculated thresholds to decide which results are valid in three ways: The minimum score at which no false positive predictions where returned on a test set (this is the Max. Precision threshold). Another precalculated threshold was chosen to maximize the F-Measure (Max F-Measure). Then an average of both scores was calculated. Users that want to use their own custom score should take a look at the ROC curves on the models page to get an idea of what to expect from each score.

- The Max. Precision threshold guarantees a minimal number of false positive predictions, while assuring that the positive predictions are of top quality.

- The Max. F-Measure threshold tries to balance precision and recall (% of identified true positives). It is a weighted average of the precision and recall, where an F-Measure score reaches its best value at 1 and worst value at 0.

- The Average threshold is the average between precision and the F-Measure threshold. This threshold is a good starting point if you have no idea about the most suitable threshold.

Tutorial: Choose a threshold suitable for your experiments.
Just select one of the three precalculated thresholds ("Max. Precision", "Average", "Max. F-Measure") or set a custom threshold (enter a number between 1 and 1000). The default threshold is "Average."

c. Select a transcription factor binding site model

Currently we offer more than 60 different vertebrate TFBS models. The number of models will continue to grow in the future. All models are build from available ENCODE ChIP-Seq experiments and other sources. It is possible to filter models by species name or by model type or to use a custom search term.

Model type:

- DE (Direct Evidence): models built from experimental data that clearly contain a consensus motif that has been reported in literature.

- PAF (Possibly Associated Factor): models built from ChIP-Seq data that clearly contain a sensible motif that has NOT been associated with the transcription factor yet. These are often factors associated with the transcription factor. It is also possible that the model represents the actual transcription factor with a consensus sequence that is not yet known.

Tutorial: Choose models.
You can select models in the models window. We use color codes to indicate the type of model: green means DE; yellow stands for PAF; human is blue and mouse is grey.
Filter for species name. Currently we have human and mouse models.
Filtering for evidence type. Choose "DE" or "PAF". See above for the definitions of "DE" and "PAF".
Enter a search term in the input field to search for a certain transcription factor. It is possible to search for aliases of the transcription factor. Just make sure you select the "Include Aliases in Search Terms" checkbox.
If you click on the grey triangle at the bottom of a model icon, all aliases will appear.

d. Optional arguments

Additional options that need an extra explanation.

- Email address: It is possible to provide an email address but this is not required. If an email address is provided, an email will be sent when the calculations are finished. If the calculations result in an error, you will get an error report. We will not use your email address for any purpose other than informing you about your calculations.

- Use as filter: For performance reasons, it is possible to pre-filter the sequences using a short PWM with mild thresholds in order to get a maximum recall. This will really increase the speed! In order to limit the load on our servers, we decided to enable this option by default. Unless you have a specific reason why not to use this filter step, it is best to keep it turned on.

2. Output

a. Summary

By default, the summary section is hidden. When the user clicks on the green arrow, a table with some statistics about the results is shown. The summary section on the results page gives an indication of the number of hits that exceed the chosen threshold for each model. The results can be sorted according to the model or according to the input sequence.

Tutorial: Summary table.
By default, the summary table is ordered by model and lists the number of hits per sequence. You can also order the table by sequence ID. Each sequence link is clickable.

b. Change thresholds

Thresholds can be changed by entering a new value in the "Change thresholds" field. You can enter any value between 1 and 1000 in this field. You can also type "avg", "ppv", or "f1" in this field, to respectively get the average threshold, the max precision or the max f-measure threshold for each model.

Tutorial: Recalculate with different threshold.
Click on the green arrow to the right of the "Set Threshold line", enter any value you like in the threshold field.

c. Detailed results

The results are visualized in this section. Per sequence, hits that exceeds the threshold are indicated with a colored bar. The bar is shaded from a light color (low score) to a dark color (high score). An arrow indicates the orientation of the binding site (forward or reverse). More information is displayed by clicking on the arrow. Binding sites of models can be dynamically shown or hidden by clicking on the corresponding checkboxes. Nucleotides in a gray colored font were not scanned due to model limits. Repeats from RepeatMasker and Tandem Repeats Finder are shown in lower case; non-repeating sequence is in upper case (As used in UCSC Genome Browser for downloadable genome data).

Tutorial: Visualizing hits.
Show/hide hits of a model by clicking on the checkboxes above each sequence. Toggling this checkbox will only show/hide the hits on this sequence. If no hits are found for a particular model, no checkbox for this model is shown.
Toggling checkboxes on the side of the screen will affect all sequences.

d. Map to reference genome, visualize in UCSC Genome Browser, get a BED-file or integrate ENCODE data.

It is possible to map the different sequences to a human or mouse reference genome. This can be done by clicking on the "blat" button below each sequence. If the sequences were fetched from UCSC on the input page, this is not necessary because the sequence location is known already. If the sequence location is known, either from the input page or by blatting the sequence, some extra options are available. 1. The sequence can be visualized in the UCSC Genome Browser. In order to do this, click on the "Map to UCSC" button. 2. A BED file with the genomic regions is available for download. Just click on the "Get BED file" button. 3. ENCODE data available for the region can be integrated into the results. Click on the "show ENCODE TFBS" button to fetch all ENCODE regions for this sequence.

Tutorial: Working with genomic regions.
If the sequence originates from the human or mouse genome, blat can be used to look for the genomic region.
Select the correct reference genome for your sequence. If your sequences are human, select hg19. If the sequences come from mouse, select mm10. Press the Blat button to continue. It can take a while to process the results.
When blat is executed, some other options become available. Select the blat result with the highest overlap from the drop-down list (indicated in %). It is now possible to map your sequences to UCSC, to download a BED file and to integrate ENCODE data.
Clicking the "map to UCSC" button will visualize all TFBS hits in the UCSC Genome Browser.
Binding sites are visualized as a custom track in the genome browser. The original query sequence is shown as a black bar.
Downloading the results is done by clicking the "get BED file" button.

Recently, an enormous number of ChIP-seq datasets from the ENCODE consortium became publicly available. We integrated the TFBS ChIP-seq sets from human in the PhysBinder web tool. ENCODE data can be integrated by clicking on the "show ENCODE TFBS" button. ENCODE data will be shown, along with the PhysBinder predictions, as grey bars. This way the genomic context of the PhysBinder predictions become immediately clear. The different ENCODE tracks can be toggled on or off using the ENCODE checkboxes.

Tutorial: Working with ENCODE data.
By clicking on the show ENCODE TFBS button, overlapping TFBS ChIP-seq sets are indicated on the sequence as grey bars.
When no ENCODE tracks are found in the sequence, a message is displayed below the sequence.
If ENCODE tracks for this genomic region are found, A list of checkboxes will appear next to the sequence. The ENCODE tracks are visualized as simple grey bars below the sequence. In regions with many ENCODE tracks, the grey bars will be stacked and the area becomes darker.
The different ENCODE tracks can be switched on or off by clicking the corresponding checkboxes. You can switch on or off multiple tracks by clicking the "select all" or "deselect all" option below the checkboxes. Do not forget to press the update button!

e. Download FASTA-file and feature-color-file to create publication graphics.

For each sequence in the results section we offer a download of the FASTA-file and feature-color-file with the binding sites. Both files can be used in Jalview to create custom visualizations of the results. Click on this link to get a Powerpoint presentation with a step-by-step overview on how to do this.

Tutorial: Download FASTA-file and feature color file.
Click on the download link for the fasta and feature color file to save them locally. This way you can create graphics in Jalview.

3. Background (the algorithm)

Overview of our approach: The input from which models are built consists of the two classes of nucleotide sequences that the method should learn to separate. One class contains positive sequences known to be bound in vivo; the other contains negative or background sequences, highly unlikely to be bound in vivo. Each nucleotide sequence, from either class, is converted into multiple series of values; each series provides values for a specific DNA structural characteristic at all positions of the TFBS and its context (structural model), or simply consists of one base or two base parts of the sequence (NPD). Basic selection of relevant features (i.e. positions) is made by statistical comparison of distributions of values for positive and negative sequences with low thresholds. Further selection is performed through wrapper-based feature selection, i.e. cross-validation performance evaluation with the Random Forest algorithm. Per characteristic, redundant features are removed by sequential backwards elimination (SBE). Several models with one characteristic can then be merged. The final NPD model and final structural model can be merged into one integrative model. The resulting model can be used by the Random Forest classifier to predict the likelihood that a nucleotide sequence is a TFBS, after converting the sequence into series of the features contained in the model.

For more detailed information on the algorithm and a validation and comparison to other algorithms we refer you to the published Nucleic Acids Research paper here.

4. How to cite us...

We have invested quite some time and effort in making this algorithm and designing the web tool. We would be appreciate it a lot if you spend a few minutes to make a citation to the corresponding papers. Please take a look to our citation page for the correct references.