Štítky

sobota 31. října 2009

Two same words in in the row

The following awk script finds (more or less) words that occur in the document twice in a row. It outputs the word with (something like) the surrounding text.


BEGIN {
FS = "[^A-Za-z0-9\\\\]+"
RS = "necodslkjfkas"
}

{
j = "xyz"
for(i = 1 ; i <= NF ; i++)
{
if ($i == j) { print $i "\t\t" $(i-3) " " $(i-2) " " $(i-1) " " $i " " $(i+1) " " $(i+2)}
j = $i
}
}

END { }


The quality of the script and the explanation is so poor because I have never used awk before (I prefer grep, but it is useless here for the obvious reason), and because I have to submit my thesis in 3 days. Btw. the thesis is the reason why I wrote this script, and it actually found about 10 occurrences of unwanted word (together with about 1000 false positives :-))

Žádné komentáře: