Forum: open-discussion

RE: Idea Generator [ Reply ]
By: Samuel Brown on 2011-06-11 05:07

Rupert and Stephane:

I have just put the finishing touches on rankSlidWin which will be included in the next release of spider.

Until then though, you can access the raw data of slidWin objects simply by typing the name of the object: in Stephane's case 'RC100h'. It's a list with a bunch of elements, the names of which can be visualised using names(RC100h).

Stephane: Unfortunately, it doesn't seem that identify() can be used on plot.slidWin plots. If you want to identify points though, you can do the following....

plot(RC100h$pos_out, RC100h$win_mono_out) (NB: monophyly plot only)
identify(RC100h$pos_out, RC100h$win_mono_out)

Note the syntax of identify.

Cheerio!

Sam

RE: Idea Generator [ Reply ] By: Rupert Collins on 2011-06-01 04:29	[forum:4521]
Yes, if you need an objective decision for the choosing a specific window, then we need access to the raw data. If you just want to eyeball the plots, then what we have done is okay. We could have a function called rank.windows, which sorts all your windows according to whatever criteria you are interested. Selecting multiple criteria for your ranking could also a nice option.

RE: Idea Generator [ Reply ]
By: Stephane Boyer on 2011-06-01 03:25

[forum:4520]

We need to have access to the actual data = corrdinates of the points. I could probably print them if I knew what they were called, alternatively I though we could use the identify function:

identify(plot.slidWin(RC100h))

It is supposed to give you the coordinates of the point you are pointing at with your mouse. But for some reason it does not work on the slidingwindow output.

Could we fix that?

Stephane.

RE: Idea Generator [ Reply ]
By: Stephane Boyer on 2011-05-31 01:41

[forum:4515]

Hi Sam,
two thoughts:
- Rather than the proportion of cells above and below a certain threshold. Could we have the proportion of interspecific cells above and below a certain threshold

- What do you think of comparing the distance matrix for the whole sequence to the distance matrix for a particular window. Using some sort of similarity index. We could plot the similarity index for each window. The higher the index, the better. This may be already included in your new function (Best Match)?

RE: Idea Generator [ Reply ] By: Samuel Brown on 2011-05-23 03:38	[forum:4478]
I've been contemplating a function to download seqences from BOLD, a la read.genbank() in ape, and the modified version of it, read.GB(), in spider already. BOLD have put some pointers on how their data can be accessed remotely at their website: http://services.boldsystems.org/. I'll play around with this sometime soonishly.

RE: Idea Generator [ Reply ] By: Jagoba Malumbres-Olarte on 2011-05-05 11:02	[forum:4390]
What about: - Transforming or identifying sequences as haplotypes; removing the repeated sequences, so that single haplotypes are left. Something like what ALTER does. - Species/haplotype accumulation curves, based on the haplotypes that you obtain from the previous function. Jagoba

RE: Idea Generator [ Reply ] By: Rupert Collins on 2011-05-04 05:36	[forum:4367]
Great ideas Sam. No attachment though, I'm afraid ...

RE: Idea Generator [ Reply ]
By: Samuel Brown on 2011-05-03 22:10

[forum:4366]

I've been thinking of including some of the analyses described in the attached paper. Last night I whipped up an implementation of PAA which I will upload sometime for further testing. Good and Wake's Genetic Distance measure shouldn't be too hard to put together and will probably be my next mission.

With regards to file uploads, I suspect it's Lincoln's firewall that is preventing it, as opposed to something that's inherent in R-Forge. However, this message will test that theory as it is being posted from AgResearch.

OK, it possibly is something inherently wrong with R-Forge. I'll let the maintainers know. Here's a link to the paper in question: http://www.cell.com/trends/ecology-evolution/abstract/S0169-5347(03)00184-8

RE: Idea Generator [ Reply ]
By: Stephane Boyer on 2011-04-29 00:46

[forum:4341]

I was thinking of a tool to help tidying the DNA alignment when your sequences don't match perfectly (very often with 16S, not sure if it is that useful for COI?).

The way I normally do it is: aligning the sequences, then try to figure out how I can trim the 3' and 5' ends in a way that retains as many individuals as possible. Sometimes it is worth dropping one or two individual so that the retained alignment is longer, sometimes it is worth shortening the alignment so that more individuals are used in the analysis (see left graph on the attached picture).

This is quite easy to do just by looking at the sequences when you have 10 or 20 sequences, but the more sequences, the more complicated it get. With 50 sequences, you just cannot look at them all at once, it does not fit on the screen it get really complicated.

So a simple graph with:

Length of DNA fragment = f(Number of sequences)

or

Length of DNA fragment = f(Number of species (based on sequence names))

could help the decision process (see right graph on the attached picture)

Well, I am unable to upload the figure so it may be difficult to understand. I hate R-Forge already...

RE: Idea Generator [ Reply ]
By: Rupert Collins on 2011-04-28 02:17

[forum:4308]

Ideas for functions in spider

number individuals in dataset + distributions, averages etc

number individuals in each species + distributions, averages etc

number genera + distributions, averages etc

number species in dataset + distributions, averages etc

number haplotypes per species + distributions, averages etc

average seq length, and number seqs below threshold (e.g. BOLD standard 500 bp)

pruning of BOLD and GenBank data into usable formats

removal of negative branch lengths

sliding window analysis

calculation of intraspecific vs. interspecific distances + distributions, averages etc

k-nearest neighbour analysis, and “best close match” (k-nn with threshold)

nj monophyly analysis

% threshold analyses

distances to nearest non-conspecific (smallest interspecific distance)

cumulative error analysis – type I vs. type II errors – works out optimum threshold

some other funky stuff like random forest, kernel, or CART (see Austerlitz et al., 2009: BMC Bioinformatics)?

Idea Generator [ Reply ]
By: Samuel Brown on 2011-04-21 06:11

[forum:4286]

Hello everyone

Welcome to spider---an R package for the analysis of species evolution and identification methods.

If you have any views on other functions to include in the project, please join the discussion. Feel free to kick ideas around and offer code snippets to inspire yourself and others to work on them. If you read of any analyses that would be nice have implemented in spider, please provide links to the articles.