Monday, June 18, 2012

One of my most used and least understood perl snippets

# ----- this perl snippet will remove any leading or trailing white spaces
# ----- all spaces in the variable will NOT be reduced to one space
 
my $variable =~ s/^\s*(.*\S)\s*$/$1/; # trm ld/trl whtspc

18 comments:

Anonymous said...

Wouldn't that be more readable if you make it into to calls?

abraxxa said...

The regex matches the start of the line with ^, followed by zero or more white space, captures what comes next until again zero or more white space and the end of the line.
$1 holds the captured string from the first capturing braces.
Does that explain it for you?

Anonymous said...

I'm probably missing something obvious, but isn't s/^\s+|\s+$//g a simpler RE?

Anonymous said...

I can't find where I read it right now, but I remember a discussion (was it on perlmonks?) saying that it was faster to do it with two substitions :

  $var =~ s/^\s+//;
  $var =~ s/\s+$//;

However, that discussion happened several years ago and with the recent changes that occurred in the regexp engine(s), it might be worth double checking.

Anyway, I got used to the two-step dance and when it comes to readability I find it easier to grok.

Anonymous said...

Let's take a look:

s/^\s*(.*\S)\s*$/$1/;

Looking at the operation, this s a search and replace, so:

s/stringa/stringb/

would replace all instances of stringa with stringb.

Looking at the specifics, let's break down the regular expression:

^\s*(.*\S)\s*$

^ - start of the string
\s* - zero or more whitespace characters

The brackets create a capture buffer for zero or more of any character (.*) followed by a non-whitespace character (\S). This capture buffer is then used later by the $1.

\s* - zero or more whitespace characters

$ - the end of the string

The capture buffer then allows us to use $1 in the second half of the replace to extract just the bit we want.

Hope that helps :)

Ian

Anonymous said...

it's faster to use two separate statements like so: s/^\s+//; s/\s+$//;

there's plenty of benchmarks floating around the webs to confirm that.

A. Sinan Unur said...

perldoc perlfaq explains why it is better to do this differently. Namely:

$string =~ s/\A\s+//;
$string =~ s/\s+\z//;

This is both easier to read and more efficient.

Anonymous said...

try this:
s/^\s*|\s*$//g

stas said...

/^\s+|\s+$//g does the same :)

stas said...

s/^\s+|\s+$//g does the same :)

Lyle said...

Abraxxa - that does explain it, thanks!

Those of you recommending two RE - Thanks! I may start using that in the future. I found the one I posted maybe 10 years ago, and have replicated it in many of my perls. The whtspc comment is how I find it with grep, and I just lazily replicate it year in year out.

The two calls may actually fix the problem if the variable contains only spaces and more than one space...

Lyle said...

Ian - your explanation is most excellent as well. It breaks it down into small steps which appeals to my RISC brain. Thanks!

szabgab said...

The original regex does NOT remove leading (or trailing) white spaces from a string that *only* has white spaces: ' ';

Anonymous said...

How about :

$x=~s/^\s*(.+?)\s*$/$1/;

one less char (maybe)

Anonymous said...

Let's take YET ANOTHER look:

s/^\s*(.*\S)\s*$/$1/;

This expression presumes that there is at least 1 non-whitespace character in the string. That's not always a safe and reasonable assumption.

All of the other solutions presented above correctly handle all-blank strings such as " ", reducing them to zero length strings.

Unknown said...

$var =~ s/^\s+|\s+$//g;

Anonymous said...

XS module is even faster:
benchmark by Sam Graham

Lyle said...

szabgab - that's my one complaint with this perl. I've had to account for that numerous times. But I'm so lazy and it's always commented with "whtspc" so it's easy to find, copy & paste.

I think I'll start using the two line solution once I prove to myself that it covers the " " variable.