How do I search for a string in file with different headings?
I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :
Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK
Can search when file has one heading i.e:
Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ
I need to output the result as "String xyz found under Heading abc"
The code I am using is:
print "Input the file name n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file nn";
exit;
}
@prot= <protein>;
close protein;
$newprotein=join("",@prot);
$protein=~s/s//g;
do{
print "enter the motif to be searched n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif nn";
}
else{
print "not found nn";
}
}
until ($motif=~/^s*$/);
exit;
如果你对这篇文章有疑问,欢迎到本站 社区 发帖提问或使用手Q扫描下方二维码加群参与讨论,获取更多帮助。

评论(5)

use strict;
use warnings;
use autodie qw'open';
my($filename,$motif) = @ARGV;
if( @ARGV < 1 ){
print "Please enter file name:n";
$filename = <STDIN>;
chomp $filename;
}
if( @ARGV < 2 ){
print "Please enter motif:n";
$motif = <STDIN>;
chomp $motif;
}
my %data;
# fill in %data;
{
open my $file, '<', $filename;
my $heading;
while( my $line = <$file> ){
chomp $line;
if( $line ne uc $line ){
$heading = $line;
next;
}
if( $data{$heading} ){
$data{$heading} .= $line;
} else {
$data{$heading} = $line;
}
}
}
{
# protect against malicious users
my $motif_cmp = quotemeta $motif;
for my $heading ( keys %data ){
my $data = $data{$heading};
if( $data =~ /$motif_cmp/ ){
print "String $motif found under Heading $headingn";
exit 0;
}
}
die "String $motif not found anywhere in file $filenamen";
}

EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).
Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.
Slightly more difficult way ( O(n) ), given a BNF grammar such as:
file: block
| file block
;
block: heading sequence
heading: [A-Z][a-z]
sequence: [A-Z][a-z]
Try recursive decent parsing (pseudo-code, I don't know perl):
GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"
FUNC file ()
WHILE nextChar() != EOF
block()
printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
END WHILE
END FUNC
FUNC block ()
heading()
sequence()
END FUNC
FUNC heading ()
in = popChar()
IF in == WHITESPACE
sequenceHeading = tempHeading
tempHeading = ""
RETURN
END IF
tempHeading &= in
END FUNC
FUNC sequence ()
in = popChar()
IF in == WHITESPACE
sequenceCount = count
count = 0
i = 0
END IF
IF in == substring[i]
i++
IF i > substringLength
count++
END IF
ELSE
i = 0
END IF
END FUNC
For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.

So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?
Just have a loop and read the file line by line.
$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
//same command that you do for one 'heading' will go here. $line represents one heading
}

The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.
use strict;
use warnings;
print "Enter the motif to be searched n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
if(/[a-z]/) {
$header = $_;
next;
}
if (/$motif/o) {
print "Found $motif under header $headern";
exit;
}
}
print "$motif not foundn";

Seeing your code, I want to make a few suggestions without answering your question:
- Always, always, always
use strict;
. For the love of whatever higher power you may (or may not) believe in,use strict;
. - Every time you
use strict;
, you shoulduse warnings;
along with it. - Also, seriously consider using some indentation.
- Also, consider using obviously different names for different variables.
- Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.
Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:
use strict;
use warnings;
print "Input the file name:n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open filenn";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/s//g;
while(1) {
print "enter the motif to be searched:n";
my $motif = <STDIN>;
last if $motif =~ /^s*$/;
chomp $motif;
# here I might even use the ternary ?: operator, but whatever
if ($newprotein =~ /$motif/) {
print "found motifnn";
}
else {
print "not foundnn";
}
}
发布评论
需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。