y_310's diary

iPhoneでOpenCVが使えることを知ったので、ついでに画像処理の高速化について今までもやもやしていたことをはっきりさせようと思い立ち実験してみた。

題材は画像のモザイク化

githubにあったこちら
niw/iphone_opencv_test · GitHub
のサンプルコードからforkさせてもらってモザイク化処理を追加した。
y310/iphone_opencv_test · GitHub
コンパイルオプションはXcodeのデフォルトで-Os(最も高速で最小)、iPhoneOS3.1.2のReleaseビルドで実験した。

Step1 とりあえず何も考えずに実装

モザイク処理メソッドから抜粋。
元画像のmSize*mSize領域のRGB平均値を計算し、出力先画像の同領域をその平均値で埋める、という処理。

int mSize = [mosaicSize intValue];
IplImage *img = [self CreateIplImageFromUIImage:imageView.image];
IplImage *dst = cvCreateImage(cvGetSize(img), img->depth, img->nChannels);
unsigned int r, g, b;
int p, pp, c;

for (int i = 0; i < img->height; i += mSize) {
    for (int j = 0; j < img->width; j += mSize) {
        r = b = g = 0;
        p = (i * img->width + j) * 3;
        c = 0;
        for (int k = 0; k < mSize; k++) {
            if (i + k < img->height) {
                for (int l = 0; l < mSize; l++) {
                    if (j + l < img->width) {
                        pp = p + (k * img->width + l) * 3;
                        b += (unsigned char)img->imageData[pp];
                        g += (unsigned char)img->imageData[pp + 1];
                        r += (unsigned char)img->imageData[pp + 2];
                        c++;
                    }
                }
            }
        }
        r /= c;
        g /= c;
        b /= c;
        
        for (int k = 0; k < mSize; k++) {
            if (i + k < img->height) {
                for (int l = 0; l < mSize; l++) {
                    if (j + l < img->width) {
                        pp = p + (k * img->width + l) * 3;
                        dst->imageData[pp] = (char)r;
                        dst->imageData[pp + 1] = (char)g;
                        dst->imageData[pp + 2] = (char)b;
                    }
                }
            }
        }
    }
}
cvReleaseImage(&img);
imageView.image = [self UIImageFromIplImage:dst];
cvReleaseImage(&dst);

この処理でmSizeを2,4,8,16,32,64,128で順番に実行した時の結果(5回の平均、単位は秒)

	Step1
2	0.2202906
4	0.1995558
8	0.1747878
16	0.1725956
32	0.1691068
64	0.166092
128	0.1662554
average:	0.1812404

これを基準に高速化してみる。

Step2 img->widthをwidthに変換(heightも)

forループなどで使っているimg->widthが毎回imgからwidthをたどっていて効率が悪いのではないかと考え、ループの前で値を保存する用に変更。

int width = img->width;
int height = img->height;

for (int i = 0; i < height; i += mSize) {
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        c = 0;
        for (int k = 0; k < mSize; k++) {
            if (i + k < height) {
                for (int l = 0; l < mSize; l++) {
                    if (j + l < width) {
                        pp = p + (k * width + l) * 3;
                        b += (unsigned char)img->imageData[pp];
                        g += (unsigned char)img->imageData[pp + 1];
                        r += (unsigned char)img->imageData[pp + 2];
                        c++;
                    }
                }
            }
        }
        r /= c;
        g /= c;
        b /= c;
        
        for (int k = 0; k < mSize; k++) {
            if (i + k < height) {
                for (int l = 0; l < mSize; l++) {
                    if (j + l < width) {
                        pp = p + (k * width + l) * 3;
                        dst->imageData[pp] = (char)r;
                        dst->imageData[pp + 1] = (char)g;
                        dst->imageData[pp + 2] = (char)b;
                    }
                }
            }
        }
    }
}

実行結果

	Step1	Step2
2	0.2202906	0.2058164
4	0.1995558	0.1892952
8	0.1747878	0.1648164
16	0.1725956	0.1648778
32	0.1691068	0.163123
64	0.166092	0.1677026
128	0.1662554	0.1650716
average:	0.1812404	0.174386

ちょっと早くなった。

Step3 ifの削減

画像サイズがmSizeで割り切れない時のためにif (i + k < height)という形でループ内でifの判定を行っていた部分を、事前にkMax,lMaxにループの最大値を保存することで削除した。

int kMax, lMax;
for (int i = 0; i < height; i += mSize) {
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        c = 0;
        kMax = i + mSize < height ? mSize : height - i;
        lMax = j + mSize < width  ? mSize : width - j;
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                pp = p + (k * width + l) * 3;
                b += (unsigned char)img->imageData[pp];
                g += (unsigned char)img->imageData[pp + 1];
                r += (unsigned char)img->imageData[pp + 2];
                c++;
            }
        }
        r /= c;
        g /= c;
        b /= c;
        
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                pp = p + (k * width + l) * 3;
                dst->imageData[pp] = (char)r;
                dst->imageData[pp + 1] = (char)g;
                dst->imageData[pp + 2] = (char)b;
            }
        }
    }
}

実行結果

	Step1	Step2	Step3
2	0.2202906	0.2058164	0.1998938
4	0.1995558	0.1892952	0.1865738
8	0.1747878	0.1648164	0.1517498
16	0.1725956	0.1648778	0.1577388
32	0.1691068	0.163123	0.1486748
64	0.166092	0.1677026	0.1521254
128	0.1662554	0.1650716	0.1549348
average:	0.1812404	0.174386	0.1645274

Step2から約0.01秒早くなった。

Step4 img->imageDataも削減&c++をc=kMax*lMaxに

width,heightと同じ考え方でimg->imageDataもimgImageDataに変換。
あと対象領域の平均値計算に使っていたcを毎回ループ内でカウントするのをやめてkMax*lMaxで計算。(最初に気付けと思ったがStep1の段階ではifの判定で数が変わるため事前には計算できなかった)

char *imgImageData = img->imageData;
char *dstImageData = dst->imageData;
for (int i = 0; i < height; i += mSize) {
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        kMax = i + mSize < height ? mSize : height - i;
        lMax = j + mSize < width  ? mSize : width - j;
        c = kMax * lMax;
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                pp = p + (k * width + l) * 3;
                b += (unsigned char)imgImageData[pp];
                g += (unsigned char)imgImageData[pp + 1];
                r += (unsigned char)imgImageData[pp + 2];
            }
        }
        r /= c;
        g /= c;
        b /= c;
        
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                pp = p + (k * width + l) * 3;
                dstImageData[pp] = (char)r;
                dstImageData[pp + 1] = (char)g;
                dstImageData[pp + 2] = (char)b;
            }
        }
    }
}

実行結果

	Step1	Step2	Step3	Step4
2	0.2202906	0.2058164	0.1998938	0.1796986
4	0.1995558	0.1892952	0.1865738	0.1683434
8	0.1747878	0.1648164	0.1517498	0.148594
16	0.1725956	0.1648778	0.1577388	0.1463726
32	0.1691068	0.163123	0.1486748	0.1627986
64	0.166092	0.1677026	0.1521254	0.1537092
128	0.1662554	0.1650716	0.1549348	0.151777
average:	0.1812404	0.174386	0.1645274	0.1587564

ここまでで初期状態から約0.023秒速くなった。

Step5 ppの計算回数を削減

k-lループの内側で毎回現在の注目ピクセル座標ppを計算していたが、kループの内側で計算すればlループ内はインクリメントだけでいける。

for (int i = 0; i < height; i += mSize) {
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        kMax = i + mSize < height ? mSize : height - i;
        lMax = j + mSize < width  ? mSize : width - j;
        c = kMax * lMax;
        for (int k = 0; k < kMax; k++) {
            pp = p + (k * width * 3);
            for (int l = 0; l < lMax; l++) {
                b += (unsigned char)imgImageData[pp++];
                g += (unsigned char)imgImageData[pp++];
                r += (unsigned char)imgImageData[pp++];
            }
        }
        r /= c;
        g /= c;
        b /= c;
        
        for (int k = 0; k < kMax; k++) {
            pp = p + (k * width * 3);
            for (int l = 0; l < lMax; l++) {
                dstImageData[pp++] = (char)r;
                dstImageData[pp++] = (char)g;
                dstImageData[pp++] = (char)b;
            }
        }
    }
}

実行結果

	Step1	Step2	Step3	Step4	Step5
2	0.2202906	0.2058164	0.1998938	0.1796986	0.1845956
4	0.1995558	0.1892952	0.1865738	0.1683434	0.1641452
8	0.1747878	0.1648164	0.1517498	0.148594	0.1551546
16	0.1725956	0.1648778	0.1577388	0.1463726	0.1454828
32	0.1691068	0.163123	0.1486748	0.1627986	0.147119
64	0.166092	0.1677026	0.1521254	0.1537092	0.1495658
128	0.1662554	0.1650716	0.1549348	0.151777	0.1533584
average:	0.1812404	0.174386	0.1645274	0.1587564	0.15706

一応速くなった…かな？ちょっと誤差かもしれない微妙な差。

Step6 掛け算を削減

掛け算より足し算の方が速いと聞いたことがあったのでppの計算から掛け算をなくしてみる。
Step5でppは繰り返しの度にwidth*3を増やす処理と同値だったので毎回width*3を足すようにする。

for (int i = 0; i < height; i += mSize) {
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        kMax = i + mSize < height ? mSize : height - i;
        lMax = j + mSize < width  ? mSize : width - j;
        c = kMax * lMax;
        pp = offset = p;
        for (int k = 0; k < kMax; k++) {	
            for (int l = 0; l < lMax; l++) {
                b += (unsigned char)imgImageData[pp++];
                g += (unsigned char)imgImageData[pp++];
                r += (unsigned char)imgImageData[pp++];
            }
            offset += stride;
            pp = offset;
        }
        r /= c;
        g /= c;
        b /= c;
        pp = offset = p;
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                dstImageData[pp++] = (char)r;
                dstImageData[pp++] = (char)g;
                dstImageData[pp++] = (char)b;
            }
            offset += stride;
            pp = offset;
        }
    }
}

実行結果

	Step1	Step2	Step3	Step4	Step5	Step6
2	0.2202906	0.2058164	0.1998938	0.1796986	0.1845956	0.186088
4	0.1995558	0.1892952	0.1865738	0.1683434	0.1641452	0.1658398
8	0.1747878	0.1648164	0.1517498	0.148594	0.1551546	0.1556316
16	0.1725956	0.1648778	0.1577388	0.1463726	0.1454828	0.14568
32	0.1691068	0.163123	0.1486748	0.1627986	0.147119	0.1497236
64	0.166092	0.1677026	0.1521254	0.1537092	0.1495658	0.1495576
128	0.1662554	0.1650716	0.1549348	0.151777	0.1533584	0.1559454
average:	0.1812404	0.174386	0.1645274	0.1587564	0.15706	0.1583524

遅くなった…。1行の掛け算だったのが足し算と代入の2行に別れたせいだろうか。
理由は分からないがとりあえずあまり効果はないようだ。

Step7 構造体を使って代入を減らす

出力先画像への代入処理が同じ値をRGBの3回分毎回やるのは非効率な気がするのでunsigned charのRGB値を保持する構造体を作って、1回で構造体まるごと代入するようにしてみた。
正直この方法はchar*のimageDataを無理矢理struct Rgb*にキャストしてたりするので危険だと思う。
構造体もきっちりunsigned char*3つ分になる保証はないはず。

struct Rgb *dstImageData = (struct Rgb*)dst->imageData;
for (int i = 0; i < height; i += mSize) {
    kMax = i + mSize < height ? mSize : height - i;
    for (int j = 0; j < width; j += mSize) {
        r = b = g = 0;
        p = (i * width + j) * 3;
        lMax = j + mSize < width  ? mSize : width - j;
        c = kMax * lMax;
        pp = offset = p;
        for (int k = 0; k < kMax; k++) {	
            for (int l = 0; l < lMax; l++) {
                b += (unsigned char)imgImageData[pp++];
                g += (unsigned char)imgImageData[pp++];
                r += (unsigned char)imgImageData[pp++];
            }
            offset += stride;
            pp = offset;
        }
        rgb.r = r / c;
        rgb.g = g / c;
        rgb.b = b / c;
        pp = offset = i * width + j;
        for (int k = 0; k < kMax; k++) {
            for (int l = 0; l < lMax; l++) {
                dstImageData[pp++] = rgb;
            }
            offset += width;
            pp = offset;
        }
    }
}

実行結果

	Step1	Step2	Step3	Step4	Step5	Step6	Step7
2	0.2202906	0.2058164	0.1998938	0.1796986	0.1845956	0.186088	0.2409104
4	0.1995558	0.1892952	0.1865738	0.1683434	0.1641452	0.1658398	0.203338
8	0.1747878	0.1648164	0.1517498	0.148594	0.1551546	0.1556316	0.184218
16	0.1725956	0.1648778	0.1577388	0.1463726	0.1454828	0.14568	0.1747662
32	0.1691068	0.163123	0.1486748	0.1627986	0.147119	0.1497236	0.1759188
64	0.166092	0.1677026	0.1521254	0.1537092	0.1495658	0.1495576	0.1728034
128	0.1662554	0.1650716	0.1549348	0.151777	0.1533584	0.1559454	0.1736168
average:	0.1812404	0.174386	0.1645274	0.1587564	0.15706	0.1583524	0.1893676

さらに遅くなった…。というかStep1より遅い。なんでこんなに遅くなるんだろうか。
そろそろアセンブラレベルでちゃんと見ないといけない気がしてきた。

とりあえずまとめ

今までなんとなくこう書いた方が速くなるかなぁと思っていたことが、実際効果があったり逆効果だったりすることがわかった。
結局この辺どれが本当に正しいかはマシン語レベルで見ないとわからなそう。まだまだコンパイラの最適化だけに頼ってちゃいけないんですね。

追記

通りすがりさんからアイディアをいただいたので組み込んでみた。

void mosaic(char *imgImageData, char *dstImageData, int width, int sX, int sY, int w, int h, int wmSize, int hmSize, int stride) {
	int dp = width * 3;
	int dpp = wmSize * 3;
	unsigned int r, g, b;
	int p, pp, c;
	c = wmSize * hmSize;
	
	for (int y = sY; y < h; y++) {
		p = y * dp * hmSize;
		for (int x = sX; x < w; x++, p += dpp) {
			r = b = g = 0;
			pp = p;
			for (int yy = 0; yy < hmSize; yy++) {
				for (int xx = 0; xx < wmSize; xx++) {
					b += (unsigned char)imgImageData[pp++];
					g += (unsigned char)imgImageData[pp++];
					r += (unsigned char)imgImageData[pp++];
				}
				pp += stride;
			}
			r /= c;
			g /= c;
			b /= c;
			
			pp = p;
			for (int yy = 0; yy < hmSize; yy++) {
				for (int xx = 0; xx < wmSize; xx++) {
					dstImageData[pp++] = (char)r;
					dstImageData[pp++] = (char)g;
					dstImageData[pp++] = (char)b;
				}
				pp += stride;
			}
		}
	}
}

{
...
    int wBlockCnt = width / mSize;
    int hBlockCnt = height / mSize;

    /* 割り切れる領域 */
    mosaic(imgImageData, dstImageData, width, 0, 0, wBlockCnt, hBlockCnt, mSize, mSize, (width - mSize) * 3);

    /* 割り切れない領域 */
    if ( width % mSize > 0) {
        mosaic(imgImageData, dstImageData, width, wBlockCnt, 0, wBlockCnt + 1, hBlockCnt, width % mSize, mSize, (width - (width % mSize)) * 3);
    }
    if ( height % mSize > 0) {
        mosaic(imgImageData, dstImageData, width, 0, hBlockCnt, wBlockCnt, hBlockCnt + 1, mSize, height % mSize, (width - mSize) * 3);
    }
    if ( width % mSize > 0 && height % mSize > 0) {
        mosaic(imgImageData, dstImageData, width, wBlockCnt, hBlockCnt, wBlockCnt + 1, hBlockCnt + 1, width % mSize, height % mSize, (width - (width % mSize)) * 3);
    }
...
}

	Step1	Step2	Step3	Step4	Step5	Step6	Step7	Step8
2	0.2202906	0.2058164	0.1998938	0.1796986	0.1845956	0.186088	0.2409104	0.1631222
4	0.1995558	0.1892952	0.1865738	0.1683434	0.1641452	0.1658398	0.203338	0.15106
8	0.1747878	0.1648164	0.1517498	0.148594	0.1551546	0.1556316	0.184218	0.1435718
16	0.1725956	0.1648778	0.1577388	0.1463726	0.1454828	0.14568	0.1747662	0.142803
32	0.1691068	0.163123	0.1486748	0.1627986	0.147119	0.1497236	0.1759188	0.1379126
64	0.166092	0.1677026	0.1521254	0.1537092	0.1495658	0.1495576	0.1728034	0.13246
128	0.1662554	0.1650716	0.1549348	0.151777	0.1533584	0.1559454	0.1736168	0.1457272
average:	0.1812404	0.174386	0.1645274	0.1587564	0.15706	0.1583524	0.1893676	0.1452366

最速記録更新！余計なif文がなくなったのが大きいのかな。

さらに追記

さらに改良コードを貼っていただいたので取り込んでみました。
が、Step1より遅い結果に...正確な原因はわかりませんが剰余演算とif文がループ内に増えた分と配列の要素参照が増えたところあたりがアルゴリズムの改善効果以上に効いてるのでしょうか。

	Step1	Step2	Step3	Step4	Step5	Step6	Step7	Step8	Step9
2	0.2202906	0.2058164	0.1998938	0.1796986	0.1845956	0.186088	0.2409104	0.1631222	0.297052
4	0.1995558	0.1892952	0.1865738	0.1683434	0.1641452	0.1658398	0.203338	0.15106	0.26692
8	0.1747878	0.1648164	0.1517498	0.148594	0.1551546	0.1556316	0.184218	0.1435718	0.2529658
16	0.1725956	0.1648778	0.1577388	0.1463726	0.1454828	0.14568	0.1747662	0.142803	0.2481548
32	0.1691068	0.163123	0.1486748	0.1627986	0.147119	0.1497236	0.1759188	0.1379126	0.2526098
64	0.166092	0.1677026	0.1521254	0.1537092	0.1495658	0.1495576	0.1728034	0.13246	0.248334
128	0.1662554	0.1650716	0.1549348	0.151777	0.1533584	0.1559454	0.1736168	0.1457272	0.2471866
average:	0.1812404	0.174386	0.1645274	0.1587564	0.15706	0.1583524	0.1893676	0.1452366	0.2590316

CookingPadをダウンロードしてくださった皆様、ありがとうございます。
昨日の段階でダウンロード数が2万を超え、現在無料アプリランキングの2位に入っています。
あらためてクックパッドの人気に驚いています。

しかし、ここで水を差すようなことを言って申し訳ないのですが、CookingPadへの要望の中で実現不可能なこと、やるつもりがないことがあります。
過剰な期待を抱かれないように今のうちにその辺りをはっきりさせておきたいと思います。

CookingPadは非公式アプリです

CookingPadで表示しているクックパッドのデータは、スクレイピングという手法で取得しています。
簡単にいうとPCのブラウザで見えるクックパッドのページからプログラムで必要な情報を抜き出して表示する手法です。
そのため、クックパッドがページをリニューアルするなどして構造が変わってしまった場合、何も表示できなくなってしまう可能性があります。
実際、開発中につくれぽのページに変更がありプログラムを変更しないとCookingPadで表示できない状態になりました。

今後同様のことがあった場合、私がそれに気づいてプログラムを修正し、Appleに提出して審査を通過するまで最短でも2週間程度(場合によってはもっと長期間)CookingPadが使えなくなってしまいます。
これはこのアプリの仕組み上やむを得ないことですのでご了承ください。
(中間サーバでパースする方式はサーバ維持コスト等の問題によりやりません)

ログイン機能も実装しません

クックパッドの携帯サイトモバれぴを見たところ、携帯からのログインはプレミアムユーザ限定ということがわかりました。
そのため技術的な問題もさることながら、CookingPadでログインを実現した場合クックパッドのビジネスを阻害することになります。
それは私の本意ではありませんのでログイン機能は実装しません。(クックパッドが公式APIを提供するなど状況が変化した場合は別です)

人気順検索の実装は不可能です

人気順検索はプレミアムユーザ限定なので前述の理由により実装できません。
また基本的にレシピのソートはサーバ側でしか出来ませんので非公式アプリであるCookingPadでは実装できません。

機能追加した有料版は作りません

そもそもクックパッドのデータを勝手に表示させてもらっているアプリですので有料化はしません。

最後に

というようにCookingPadは非公式アプリのため様々な制約があります。
私としてはPC版のクックパッド並の機能を有した公式のクックパッドアプリが公開されることが一番だと思っています。
CookingPadについて沢山の要望やお褒めの言葉をいただいてとてもうれしく思っていますが、このような事情のためすべてにお答えすることが出来ないことをお許しください。

y_310's diary

iPhoneで画像処理の高速化実験

題材は画像のモザイク化

Step1 とりあえず何も考えずに実装

Step2 img->widthをwidthに変換(heightも)

Step3 ifの削減

Step4 img->imageDataも削減&c++をc=kMax*lMaxに

Step5 ppの計算回数を削減

Step6 掛け算を削減

Step7 構造体を使って代入を減らす

とりあえずまとめ

追記

さらに追記

Firefoxの終了が遅い

CookingPadで出来ないこと

初めてのiPhoneアプリ公開

WidgetPadを使ってみて

HTML5のCanvas

WidgetPad