2011-01-24

Levenshtein距離とか。

algorithm NLP

　某所のパズルでまさにこれを使う問題が出たので2年ぶりくらいに実装してみる。

●レーベンシュタイン距離

　２つの文字列があるとき、片方からもう片方に変換したい。
文字の置換・挿入、削除の３つの操作が可能なとき、一連の操作の最も少ない回数のこと。
　Googleの"もしかして"のような機能や、判別系モデルの特徴量なんかに使われていて、自然言語処理屋さんなら必ず知っている気がします。気のせいかもしれません。

　※ちなみにこのとき、置換操作のみを許可したものがハミング距離というやつです。

　昨今（といってももうだいぶ前から）ではバイオインフォでも使われます。
面倒な話は省きますが、謎の要請により、DNA配列同士を比較したいのだけど、
DNAには塩基の置換や挿入、欠失が起こるから困ったもんだ、という、
まことにレーベンシュタイン距離の扱う問題そのものがあったりします。

閑話休題

　しかし昨今はインターネットでこの辺の説明もコードも簡単に手に入るなぁ。
上で書いた内容なんてほとんど、むしろ断然詳しくネットに乗っているので僕が偉そうに書く必要もないのですが。

　そんな感じで虚無感に苛まれていてもあれなので、その辺に出回っているものとはほんのちょっと、wikiでは扱っていない部分を付け足したスニペットをはっときます。

　何が違うかって言うと、たとえば、置換はわりとよくあるんだけど、
挿入はほとんどないし、削除にいたっては絶望的にないよね、
という事前情報を状況に対応できるよう、コストを突っ込めるようにしただけ。

　※挿入と削除は見る方向が違うだけで、同じ処理なので、ここを変える要請があるかは不明

template< typename CostType >
int getLevenshteinDistance( const char* str1, const char* str2, CostType cost_ins, CostType cost_del, CostType cost_mut ){
	unsigned int len1 = std::strlen( str1 );
	unsigned int len2 = std::strlen( str2 );

	// alloc
	CostType **matrix = new CostType*[len1+1];
	for( unsigned int i=0; i <= len1; ++i ){ matrix[i] = new CostType[len2+1]; }

	// init dp matrix
	for( unsigned int i = 0; i <= len1; ++i ){ matrix[i][0] = i * cost_ins; }
	for( unsigned int i = 0; i <= len2; ++i ){ matrix[0][i] = i * cost_del; }
	for( unsigned int i = 1; i <= len1; ++i ){
		for( unsigned int j = 1; j <= len2; ++j ){
			CostType min = matrix[i-1][j-1] + ( ( str1[i-1] == str2[j-1] ) ? 0 : cost_mut );	// mutation 
			if( min > ( matrix[i-1][j] + cost_ins ) ){ min = matrix[i-1][j] + cost_ins; }	// insertion
			if( min > ( matrix[i][j-1] + cost_del ) ){ min = matrix[i][j-1] + cost_del; }	// deletion
			matrix[i][j] = min;

		}
	}

	for( unsigned int i = 0; i <= len1; ++i ){
		for( unsigned int j = 0; j <= len2; ++j ){
			std::cout << matrix[i][j] << " ";
		}
		std::cout << std::endl;
	}
	
	CostType result = matrix[len1][len2];

	for( unsigned int i=0; i < len1; ++i ){ delete[] matrix[i]; }
	delete[] matrix;

	return result;
}

また無駄にテンプレート。コストはintとは限らないからね！
テストしてないのでちゃんと動くのかは知りません…

●蛇足

　この辺の、文字比較の何とか距離というものは有用だからか、わりと色々な種類があって、
今回のレーベンシュタイン距離のほかにも、Jaro距離とか、Jaro-Winkler距離とか、
なんだかもう色々とわけが分からない感じになっています。

　その辺の話も、一応できるのですがまた今度。

2010-11-25

PythonでMethod Missing

Python

　聞くところによると、RubyにはMethod Missingなる機能があるとか。
いわく、オブジェクトにメソッドがない場合に、どういう処理をするかを記述するとか。

こいつぁ、われらがPythonにも欲しい機能です。

上記の文面をまともに受け取る使い方でも良いのですが、たとえば、あるオブジェクトのメソッド全てについて、
同様な処理を付加するといったラッパーが簡単に書けちゃうわけです。

というわけで調べてみた。
そのものずばりは出てこなかったけれど、おおむねこんなかんじ。
Method missing（のようなもの）を使った簡易ラッパー。

import threading
import time

class Wrappee:
    def __init__( self, count ):
        self.count = count

    def print_something( self, text ):
        for i in range( self.count ):
            print( i, " something : ", text )
            time.sleep( 0.5 )

class Wrapper:
    def __init__( self, wrapper ):
        self.wrappee= wrapper

    # Emulate ruby like " missing method ".
    def __getattr__( self, name ):
        def _method_missing( *args ):
            return args

        print( "--", name, " is called --")
        ret = getattr( self.wrappee, name, _method_missing)
        return ret


wrappee = Wrappee( 10 )
wrapper = Wrapper( wrappee )
wrapper.print_something( " foo " )

ううん、これは便利だ。
ただ、ラップされるオブジェクトに指定の関数名がない場合にエラーが出ないのが玉に傷。
どこかのサイトで関数名一覧をディクショナリでもっておいて検査するという手法も見られたけど、もっと楽にいけないものだろうか。

2010-11-09

MacPortsでHugs突っ込むときにコケる件。

Haskell

トラッカーによるとleopardでは起こらないけどsnow leopardでは起こる。

https://trac.macports.org/ticket/20950

以下を参考にすれば最短距離。
パッチ2つ当てるのみ。

http://antsomerset.co.uk/2010/10/04/installing-hugs98-haskell-on-mac-10-6-snow-leopard/

2010-11-08

すごいのみつけた

機械学習 Haskell

　以前から、Haskellの遅延評価の枠組みは確率分布のサンプリングとかやるのによさそうだなぁとか思っていたのですよ。

再帰を使うってだけなら、C++とかでも上記の式と同様になるかと思いますが、
いざサンプリングするぞって時に面倒なのが、SBPのtailの方。

僕の場合だと、必要な分だけ確率列を用意して、サンプリング中に必要になる、
そのつど確率変数に値を振っていくという形式のコードです。面倒。

んが、Haskellだと（多分）そのへん自動でやってくれる！すごい！
んじゃないのかな。実際動かしたわけじゃないのでわからないのですが。

そんなこんなで、Haskellでノンパラベイズ周辺のライブラリを書こうと意気込んだその矢先、

すごいのみつけた

しかもコードは、かのHal Daume III御大。

とりあえずオープンソースにして。コミットするから。

2010-09-28

O'REILLYどうした

雑記

　もともと翻訳どうよ、と思うものが多い出版社でしたが、
久々の出物、しかも大物、しかも２つ。

あまりに酷いので、ここで血祭りにあげようと思います。

プライムナンバーズ ―魅惑的で楽しい素数の事典 (O’Reilly math series)

作者: David Wells,伊知地宏(監訳),さかいなおみ
出版社/メーカー: オライリージャパン
発売日: 2008/10/25
メディア: 単行本
クリック: 15回
この商品を含むブログ (33件) を見る

数学を生み出す魔法のるつぼ ―実験数学への招待 (O’Reilly math series)

作者: Jonathan Borwein,Keith Devlin,伊知地宏
出版社/メーカー: オライリージャパン
発売日: 2009/12/28
メディア: 単行本（ソフトカバー）
クリック: 20回
この商品を含むブログ (11件) を見る

これはひどすぎる。読めたものではないとはまさにこれか。

分かりにくい日本語とか、そういうレベルではなく、
もはや日本語の体をなしていない文章だらけ。
誤植等も非常に多い。

久々に焚書目録に登録しました。

なお、非常に残念ですが、内容については触れることはできません。
５０ページくらいで読むのをやめてしまったので。
そのうち原書のほうを買います。

原著をなぶって遊ぶのはやめたらどうかと声を大にして言いたい。

....追記
プライムナンバーズを無理やり読んでみた。
普通に原書がほしい。

2010-09-26

はじめてのMac

雑記

　時代の流れに負けてMac book Proを買った。

とりあえずキーボードの使い方がwindowsと違いすぎて困ったが、

GUIのとてもきれいなlinuxなんだという友人のアドバイスでバリバリ使えるようになった。

　もちろんコマンドラインでしか使ってない。どうしてこうなった。

2010-09-20

Project Euler Problem 84

Python Project Euler

モノポリーで、どの升目に止まりやすいかという問題。
まじめに計算できそうな感じだけど、せっかくなのでサンプリングしてみる。
ただひたすら面倒なだけ。

import random
# monopoly simulator
cells = [ 'GO', 'A', 'CC', 'A', 'T', 
         'R', 'B', 'CH', 'B', 'B',
         'JAIL', 'C', 'U', 'C', 'C',
         'R', 'D', 'CC', 'D', 'D',
         'FP', 'E', 'CH', 'E', 'E',
         'R', 'F', 'F', 'U', 'F',
         'G2J', 'G', 'G', 'CC', 'G',
         'R', 'CH', 'H', 'T', 'H' ]
cccards = ['DC', 'DC', 'DC', 'DC', 'DC',
           'DC', 'DC', 'DC', 'DC', 'DC',
           'DC', 'DC', 'DC', 'DC', '0', '10']
ccards = ['DC', 'DC', 'DC', 'DC', 'DC',
               'DC', '0', '10', '11', '24', 
               '39', '5', 'G2NR', 'G2NR', 'G2NU',
               'GB3']
def pe84():
    ntrial = 500000
    shuffle_cards()
    visited = sampling( ntrial )
    print visited
    print get_solution( visited )
def get_solution( visited ):
    first = get_max_index( visited )
    visited[first] = 0
    second = get_max_index( visited )
    visited[second] = 0
    third = get_max_index( visited )
    return "".join( [encode_squareface( first ), encode_squareface( second ), encode_squareface( third ) ] )
def get_max_index( lis ):
    idx = 0
    max = 0
    for i in range( 0, len( lis ) ):
        if lis[i] >= max:
            idx = i
            max = lis[i]
    return idx
              
def sampling( ntrial ):
    visited = [ 0 for i in range(0, len(cells))]
    pos = 0
    tried = 0
    # should I count first GO as visited ?
    while tried < ntrial:
        nextpos = nextcell( pos)
        #print len( visited ), " ", nextpos
        visited[nextpos]+= 1
        pos = nextpos
        tried +=1
    return visited

def nextcell( current_position):
    global prev_doubles
    score1 = random.randint( 1, 4 )
    score2 = random.randint( 1, 4 )
    if score1 == score2:
        if prev_doubles == 2:
            prev_doubles = 0
            return 10 
        else:
            prev_doubles += 1
    else:
        prev_doubles = 0
    next_position = advance( score1+score2, current_position )
    #print score, next_position
    if cells[next_position] == 'CC':
        return ccjump( next_position )
    elif cells[next_position] == 'CH':
        return cjump( next_position )
    elif cells[next_position] == 'G2J':
        return 10
    else:
        return next_position
def ccjump( current_position ): 
    card = drawcc()
    if card == 'DC':
        return current_position
    else:
        return int( card )
def cjump( current_position ):
    card = drawc()
    if card == 'DC':
        return current_position
    elif card == 'G2NR':
        return g2nr( current_position ) 
    elif card == 'G2NU':
        return g2nu( current_position )
    elif card == 'GB3':
        return back( 3, current_position )
    else:
        return int( card )
def g2nr( current_position ):
    while( True ):
        current_position = advance( 1, current_position)
        if cells[current_position] == 'R':
            return current_position
def g2nu( current_position ): 
    while( True ):
        current_position = advance( 1, current_position )
        if cells[current_position]  == 'U':
            return current_position
def advance( nsteps, current_position ):
    while nsteps > 0:
        current_position+=1
        if current_position > len( cells ) - 1:
            current_position = 0
        nsteps-=1
    return current_position
def back( nsteps, current_position ):
    while nsteps > 0:
        current_position -= 1
        if current_position < 0:
            current_position = len( cells ) - 1
        nsteps -=1
    return current_position 
def shuffle_cards():
    global ccards
    global cccards
    random.shuffle( ccards ) 
    random.shuffle( cccards )
def drawc():
    global ccards
    drawed = ccards.pop( 0 )
    ccards.append( drawed )
    return drawed 
def drawcc():
    global cccards
    drawed = cccards.pop( 0 )
    cccards.append( drawed )
    return drawed
def encode_squareface( num ):
    print "n : ", num 
    if num < 10:
        return '0'+str(num )
    else:
        return str( num )

prev_doubles = 0
pe84()

なんだかとてもまじめに書いた。

2010-09-15

Project Euler Problem 125

C++ Project Euler

　単なるブルートフォースなのだが、範囲が違うけど同じ数が出うることと、それを加算してはいけないことに1時間ハマって死にたい。何この凡ミス。

typedef unsigned long long int nat;

vector<nat> gensqsq(nat lim){
	vector<nat> sqsq;
	for(nat i=1;i*i <= lim;++i)sqsq.push_back(i*i);
	return sqsq;
}

bool ispalindromic( nat n ){
	if( n >= 10 && n % 10 == 0 ) return false;
	deque<int> d;
	while( n ){
		d.push_front( n % 10 );
		n /= 10; 
	}
	for( int i=0, j=d.size()-1; i <= j; ++i, --j ){
		if( d[i] != d[j] ) return false;
	}
	return true;
}

nat pe125(){
	set<nat> s;
	nat limit = 100000000;
	nat total = 0;
	
	vector<nat> sq = gensqsq(limit);

	for( nat from = 0; from < sq.size(); ++from){
		nat seqsum = 0;
		for(nat to=from; to < sq.size(); ++to ){
			seqsum += sq[(unsigned int )to];
			if( to == from ) continue;
			if( seqsum >= limit )break;
			if( ispalindromic( seqsum ) )
				s.insert( seqsum );
		}
	}
	for( set<nat>::iterator itr = s.begin(); itr != s.end(); ++itr )
		total += *itr;
	return total;
}