Jun
25
Removing bytes from a file
June 25th, 2008 posted by
codders
Morning,
I copied and pasted a (Ruby) script from a PDF this morning, and on executing it I got a whole pile of:
webservice.rb:33: Invalid char `\240' in expression webservice.rb:33: Invalid char `\302' in expression
which was annoying. For reasons best known to KPDF (or oowriter, or my window manager’s cut-and-paste buffer), the spaces in the script (” “) had been encoded as 0xc2 0xa0, which is sort of UTF16 if you look at it sideways, but essentially useless to me.
So how do you remove 200 instances of a 2-byte sequence from a file? I didn’t have a good way, but this bad way sufficed:
cat > rm.c << EOF
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main(int arcg, char *argv[])
{
unsigned char c;
while (read(0, &c, 1)==1)
{
if (c != 0xc2 && c != 0xa0)
{
write(1, &c, 1);
}
else if (c == 0xc2)
{
write(1, ” “, 1);
}
}
}
EOF
make rm
cat webservice.rb | ./rm > output.rb
So, my dearest lazyweb… better answers?
qbJim says:
July 17th, 2008 at 9:04 am
Doing it with C++ iostreams would have saved remembering the parameter list to read and write (perhaps not an issue for you). It wouldn’t have reduced the number of lines of code though, save a couple of #includes (it wouldn’t increase it either, since while (cin >> c) { } is valid).
But my first instinct would have been to try and get perl to do it. If you could figure out how to embed arbitrary characters in a match string (perhaps a C-style “\x123″ would work) then using perl -pe ’s/something/\n/g’ would save having any other files at all.