Login

|
|
Genomic Scan, Frequency & Distribution of SEEDModerator: BioTeam
7 posts • Page 1 of 1
Genomic Scan, Frequency & Distribution of SEEDI want to scan a whole genome (fasta format) for its 100% exact matches with a seed sequence which is 7-mer long. Blast (and that goes for any version as far as i know) NEVER, even with the correct parameters which adjust for word size and its e-value - gets ONLY the exact matches because of its extension and similarity methodology which focusses on NEAR-exact matches (similarity).
It's seems very elementary, I want to scan a fasta formatted file, determine its sequence length and mark the seed start and end of every exact match which will allow me to map a distribution along the genome and determine it's frequency. However, I can't find any protocol, method or website that points me into the right direction eventhough the principle is SO BASIC. And trust me I have been trying. I'm aware that this must be a piece-of-cake for somebody that has the perl/coding skills I, quite obviously, lack I would have given up and done it with notepad a long time ago if I didnt have so many different sequences, large genomes, EST's etc to go through. Any help would be greatly appreciated, though i fear that the explanation will be too simple and humiliate me :)
Re: Genomic Scan, Frequency & Distribution of SEEDI don't know what visualization system you're using but something like this might work if I understood you correctly
So basically this code just finds your pattern and then replaces it with period characters which would show up if you're doing a pairwise comparison. Living one day at a time;
Enjoying one moment at a time; Accepting hardships as the pathway to peace; ~Niebuhr
Thank you for your time mith. I apologize for any ambiguity but we are nearly there I think, if only we could replace the "#replace matchpattern with your query sequence" into a print output which would include the start and end of the position of the seed (search sequence) in the whole sequence string.
e.g. ID(occurrence)-Seed-Start-End-Length(FullSequenceLength) To illustrate it an example - we have a string of length 42: T G C G T C C T A C A A G A C A A C T A T G G G G C C G C G T T T T C G C C G C C A And lets say I’m looking for a seed sequence of 3-mer, CGT. T G.C G T.C C T A C A A G A C A A C T A T G G G G C C G.C G T.T T T C G C C G C C A What I want to print to a file would be the following tab delimited data: ID(occurrence)-Seed-Start-End-Length(FullSequenceLength) like: 1-CGT-3-6-42 2 -CGT-29-32-42 I'm sorry to bother any of you with my lack of perl skills, but as a european working/studying biochemistry at a chinese university ; my skills as a researcher are not that well balanced, plus IT personnel has yet to be discovered in my department.
I don't have perl with me right now(on vacation) but here's how I would do it(no testing done so it's probably buggy).
I'm not quite sure why you would want the redundancy of printing the seed info and the sequence length multiple times. The end position is also easily found as start+seed length. The example I posted does the following things 1. Scans thru the fasta file and gets the sequence. 2. Matches the sequence to seed sequence. Each time it matches, it replaces those letters with periods. Make sure you change it so the number of periods added equals the length of the seed sequence. Now your sequence looks something like this TCG...AGCTA...A 3. It breaks the sequence down into an array of chars, each array element is a letter. It attempts to find the occurence of ".". The period signifies the start of the seed. Then it skips forward seed length-1 to find the next seed occurence. 4. Each seed start position is recorded in the found array which gets printed along with the sequence length. On an unrelated note, why are you working in China? I thought the Europeans have great bioinformatics programs...the germans certainly do. Living one day at a time;
Enjoying one moment at a time; Accepting hardships as the pathway to peace; ~Niebuhr
Re: Genomic Scan, Frequency & Distribution of SEEDBelow I posted the code that I eventually managed to frabricate with some help from tutorials that elucidated perl for me. I read the fasta seed from a fasta file and the genomic db is being read in line by line buffered. The code is not beautiful but does the job. The superfluous printing is because I have many seed sequences and many databases to scan and I don't want to get lost in my own data. The tabs in the output are for later processing.
hmm China, where to start. I originally come from Wageningen, The Netherlands. I wanted something more than just studying with mediocre effort and getting a job in the commercial sector (like many students sadly do these days) and came here. Why China? Because China is not 'Western'.The standards ofcourse are not quite up to par just yet, but they are definitly getting there. Being here is great, it has opened my eyes as a scientist and as a person because the differences are surprisingly immense. Getting things done here takes more effort but that's has been a learning experience too. The value of having been here the last two years = priceless. I wonder what it will be like though if I decide to go back to Europe, time will tell. Mith you're a legend for helping out people during your vacation. With great appreciation, my best regards!
Hey, best of luck. Looks better than my code, but if you want to skip all the fasta file parsing, just install the biofastastream module.
Living one day at a time;
Enjoying one moment at a time; Accepting hardships as the pathway to peace; ~Niebuhr
7 posts • Page 1 of 1
Who is onlineUsers browsing this forum: No registered users and 0 guests |

© Biology-Online.org. All Rights Reserved. Register | Login | About Us | Contact Us | Link to Us | Disclaimer & Privacy
Science Network - Braintrack.com - University Directory | Chemicool.com - Chemistry | EquationSheet.com - Equations | Logo design by LogoBee