Tách từ tố

Việc tách từ tố giúp tách riêng dấu câu, số và âm tiết để dễ dàng xử lý. Đây là bước đầu tiên trong hệ thống tách từ và cũng là bước tiền xử lý dữ liệu huấn luyện.

Đoạn mã sau được sửa lại từ Treebank tokenization của UPenn, thêm thao tác xử lý một số ký tự đặc biệt và dấu nháy đơn. Để sử dụng, bạn lưu đoạn mã thành tệp tokenizer.sed và chạy trên Linux.

s=^"=`` =g s=\([ ([{<]\)"=\1 `` =g s=^'=` =g s=\([ ([{<]\)'=\1 ` =g s=\.\.\.= ... =g s=[,;:@#$%&]= & =g s=\([^.]\)\([.]\)\([])}>"']*\)[ 	]*$=\1 \2\3 =g s=[?!]= & =g s=[][{}<>]= & =g s=--= -- =g s=$= = s=^= = s="=  =g s=\([^']\)' =\1 ' =g s='\([sSmMdD]\) = '\1 =g s='ll = 'll =g s='re = 're =g s='ve = 've =g s=n't = n't =g s='LL = 'LL =g s='RE = 'RE =g s='VE = 'VE =g s=N'T = N'T =g s= \([Cc]\)annot = \1an not =g s= \([Dd]\)'ye = \1' ye =g s= \([Gg]\)imme = \1im me =g s= \([Gg]\)onna = \1on na =g s= \([Gg]\)otta = \1ot ta =g s= \([Ll]\)emme = \1em me =g s= \([Mm]\)ore'n = \1ore 'n =g s= '\([Tt]\)is = '\1 is =g s= '\([Tt]\)was = '\1 was =g s= \([Ww]\)anna = \1an na =g s=“= `` =g s=”=  =g s=„= '' =g s=‘= ` =g s=’= ' =g s=…= ... =g s= *= =g s=^ *==g
 * 1) !/bin/sed -f
 * 1) Sed script to produce Penn Treebank tokenization on arbitrary raw text.
 * 2) Yeah, sure.
 * 1) expected input: raw text with ONE SENTENCE TOKEN PER LINE
 * 1) by Robert MacIntyre, University of Pennsylvania, late 1995.
 * 1) If this wasn't such a trivial program, I'd include all that stuff about
 * 2) no warrantee, free use, etc. from the GNU General Public License.  If you
 * 3) want to be picky, assume that all of its terms apply.  Okay?
 * 1) attempt to get correct directional quotes
 * 1) close quotes handled at end
 * 1) Assume sentence tokenization has been done first, so split FINAL periods
 * 2) only.
 * 1) however, we may as well split ALL question marks and exclamation points,
 * 2) since they shouldn't have the abbrev.-marker ambiguity problem
 * 1) parentheses, brackets, etc.
 * 1) Some taggers, such as Adwait Ratnaparkhi's MXPOST, use the parsed-file
 * 2) version of these symbols.
 * 3) UNCOMMENT THE FOLLOWING 6 LINES if you're using MXPOST.
 * 4) s/(/-LRB-/g
 * 5) s/)/-RRB-/g
 * 6) s/\[/-LSB-/g
 * 7) s/\]/-RSB-/g
 * 8) s/{/-LCB-/g
 * 9) s/}/-RCB-/g
 * 1) NOTE THAT SPLIT WORDS ARE NOT MARKED.  Obviously this isn't great, since
 * 2) you might someday want to know how the words originally fit together --
 * 3) but it's too late to make a better system now, given the millions of
 * 4) words we've already done "wrong".
 * 1) First off, add a space to the beginning and end of each line, to reduce
 * 2) necessary number of regexps.
 * 1) possessive or close-single-quote
 * 1) as in it's, I'm, we'd
 * 1) s= \([Ww]\)haddya = \1ha dd ya =g
 * 2) s= \([Ww]\)hatcha = \1ha t cha =g
 * 1) special characters
 * 1) clean out extra spaces