How do I search for a string in file with different headings?

How do I search for a string in file with different headings?

发布于 2021-11-25 字数 1449 浏览 704 回复 5 原文

I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :

Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK

Can search when file has one heading i.e:

Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ

I need to output the result as "String xyz found under Heading abc"

The code I am using is:

print "Input the file name n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file nn";
exit;
}
@prot= <protein>;
close protein;
$newprotein=join("",@prot);
$protein=~s/s//g;
do{
print "enter the motif to be searched n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif nn";
}
else{
print "not found nn";
}
}
until ($motif=~/^s*$/);
exit;

如果你对这篇文章有疑问,欢迎到本站 社区 发帖提问或使用手Q扫描下方二维码加群参与讨论,获取更多帮助。

扫码加入群聊

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

别念他 2022-06-07 5 楼
use strict;
use warnings;
use autodie qw'open';

my($filename,$motif) = @ARGV;

if( @ARGV < 1 ){
  print "Please enter file name:n";
  $filename = <STDIN>;
  chomp $filename;
}

if( @ARGV < 2 ){
  print "Please enter motif:n";
  $motif = <STDIN>;
  chomp $motif;
}

my %data;

# fill in %data;
{
  open my $file, '<', $filename;

  my $heading;
  while( my $line = <$file> ){
    chomp $line;
    if( $line ne uc $line ){
      $heading = $line;
      next;
    }
    if( $data{$heading} ){
      $data{$heading} .= $line;
    } else {
      $data{$heading}  = $line;
    }
  }
}

{
  # protect against malicious users
  my $motif_cmp = quotemeta $motif;

  for my $heading ( keys %data ){
    my $data = $data{$heading};

    if( $data =~ /$motif_cmp/ ){
      print "String $motif found under Heading $headingn";
      exit 0;
    }
  }

  die "String $motif not found anywhere in file $filenamen";
}
雪化雨蝶 2022-06-07 4 楼

EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).

Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.

Slightly more difficult way ( O(n) ), given a BNF grammar such as:

file: block
    | file block
    ;

block: heading sequence

heading: [A-Z][a-z]

sequence: [A-Z][a-z]

Try recursive decent parsing (pseudo-code, I don't know perl):

GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"

FUNC file ()
    WHILE nextChar() != EOF
        block()
        printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
    END WHILE
END FUNC

FUNC block ()
    heading()
    sequence()
END FUNC

FUNC heading ()
    in = popChar()
    IF in == WHITESPACE
        sequenceHeading = tempHeading
        tempHeading = ""
        RETURN
    END IF
    tempHeading &= in
END FUNC

FUNC sequence ()
    in = popChar()
    IF in == WHITESPACE
        sequenceCount = count
        count = 0
        i = 0
    END IF
    IF in == substring[i]
        i++
        IF i > substringLength
            count++
        END IF
    ELSE
        i = 0
    END IF
END FUNC

For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.

滥情空心 2022-06-07 3 楼

So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?

Just have a loop and read the file line by line.

$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
 //same command that you do for one 'heading' will go here. $line represents one heading
} 
终遇你 2022-06-07 2 楼

The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.

use strict;
use warnings;
print "Enter the motif to be searched n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
    if(/[a-z]/) {
        $header = $_;
        next;
    }
    if (/$motif/o) {
        print "Found $motif under header $headern";
        exit;
    }
}
print "$motif not foundn";
白芷 2022-06-07 1 楼

Seeing your code, I want to make a few suggestions without answering your question:

  1. Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
  2. Every time you use strict;, you should use warnings; along with it.
  3. Also, seriously consider using some indentation.
  4. Also, consider using obviously different names for different variables.
  5. Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.

Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:

use strict;
use warnings;

print "Input the file name:n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open filenn";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/s//g;
while(1) {
  print "enter the motif to be searched:n";
  my $motif = <STDIN>;
  last if $motif =~ /^s*$/;
  chomp $motif;
  # here I might even use the ternary ?: operator, but whatever
  if ($newprotein =~ /$motif/) {
    print "found motifnn";
  }
  else {
    print "not foundnn";
  }
}