vojtův blog: Two same words in in the row

sobota 31. října 2009

Two same words in in the row

The following awk script finds (more or less) words that occur in the document twice in a row. It outputs the word with (something like) the surrounding text.


BEGIN  {
   FS = "[^A-Za-z0-9\\\\]+"
   RS = "necodslkjfkas"
  }

{
  j = "xyz"
  for(i = 1 ; i <= NF ; i++)
  {
   if ($i == j) { print $i "\t\t"  $(i-3) " " $(i-2) " " $(i-1) " " $i " " $(i+1) " " $(i+2)}
   j = $i
  }
}

END { }

The quality of the script and the explanation is so poor because I have never used awk before (I prefer grep, but it is useless here for the obvious reason), and because I have to submit my thesis in 3 days. Btw. the thesis is the reason why I wrote this script, and it actually found about 10 occurrences of unwanted word (together with about 1000 false positives :-))

Žádné komentáře:

Okomentovat

vojtův blog

Štítky

sobota 31. října 2009

Two same words in in the row

Žádné komentáře:

Odkazy

Archiv blogu