Removing bytes from a file

June 25th, 2008 posted by codders

Morning,

I copied and pasted a (Ruby) script from a PDF this morning, and on executing it I got a whole pile of:

webservice.rb:33: Invalid char `\240' in expression
webservice.rb:33: Invalid char `\302' in expression

which was annoying. For reasons best known to KPDF (or oowriter, or my window manager’s cut-and-paste buffer), the spaces in the script (” “) had been encoded as 0xc2 0xa0, which is sort of UTF16 if you look at it sideways, but essentially useless to me.

So how do you remove 200 instances of a 2-byte sequence from a file? I didn’t have a good way, but this bad way sufficed:

cat > rm.c << EOF
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(int arcg, char *argv[])
{
  unsigned char c;

  while (read(0, &c, 1)==1)
  {
    if (c != 0xc2 && c != 0xa0)
    {
      write(1, &c, 1);
    }
    else if (c == 0xc2)
    {
      write(1, ” “, 1);
    }
  }
}
EOF
make rm
cat webservice.rb | ./rm > output.rb

So, my dearest lazyweb… better answers?