Cycles: merging features from tomato branch.

=== BVH build time optimizations === * BVH building was multithreaded. Not all building is multithreaded, packing and the initial bounding/splitting is still single threaded, but recursive splitting is, which was the main bottleneck. * Object splitting now uses binning rather than sorting of all elements, using code from the Embree raytracer from Intel. http://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels/ * Other small changes to avoid allocations, pack memory more tightly, avoid some unnecessary operations, ... These optimizations do not work yet when Spatial Splits are enabled, for that more work is needed. There's also other optimizations still needed, in particular for the case of many low poly objects, the packing step and node memory allocation. BVH raytracing time should remain about the same, but BVH build time should be significantly reduced, test here show speedup of about 5x to 10x on a dual core and 5x to 25x on an 8-core machine, depending on the scene. === Threads === Centralized task scheduler for multithreading, which is basically the CPU device threading code wrapped into something reusable. Basic idea is that there is a single TaskScheduler that keeps a pool of threads, one for each core. Other places in the code can then create a TaskPool that they can drop Tasks in to be executed by the scheduler, and wait for them to complete or cancel them early. === Normal ==== Added a Normal output to the texture coordinate node. This currently gives the object space normal, which is the same under object animation. In the future this might become a "generated" normal so it's also stable for deforming objects, but for now it's already useful for non-deforming objects. === Render Layers === Per render layer Samples control, leaving it to 0 will use the common scene setting. Environment pass will now render environment even if film is set to transparent. Exclude Layers" added. Scene layers (all object that influence the render, directly or indirectly) are shared between all render layers. However sometimes it's useful to leave out some object influence for a particular render layer. That's what this option allows you to do. === Filter Glossy === When using a value higher than 0.0, this will blur glossy reflections after blurry bounces, to reduce noise at the cost of accuracy. 1.0 is a good starting value to tweak. Some light paths have a low probability of being found while contributing much light to the pixel. As a result these light paths will be found in some pixels and not in others, causing fireflies. An example of such a difficult path might be a small light that is causing a small specular highlight on a sharp glossy material, which we are seeing through a rough glossy material. With path tracing it is difficult to find the specular highlight, but if we increase the roughness on the material the highlight gets bigger and softer, and so easier to find. Often this blurring will be hardly noticeable, because we are seeing it through a blurry material anyway, but there are also cases where this will lead to a loss of detail in lighting.
2012-04-28 08:53:59 +00:00 · 2012-04-28 08:53:59 +00:00 · 07b2241fb1
parent fd2439f47a
commit 07b2241fb1
48 changed files with 3808 additions and 2356 deletions
--- a/intern/cycles/blender/addon/properties.py
+++ b/intern/cycles/blender/addon/properties.py
@ -85,10 +85,10 @@ class CyclesRenderSettings(bpy.types.PropertyGroup):
                description="Leave out caustics, resulting in a darker image with less noise",
                default=False,
                )
-        cls.blur_caustics = FloatProperty(
-                name="Blur Caustics",
-                description="Blur caustics to reduce noise",
-                min=0.0, max=1.0,
+        cls.blur_glossy = FloatProperty(
+                name="Filter Glossy",
+                description="Adaptively blur glossy shaders after blurry bounces, to reduce noise at the cost of accuracy",
+                min=0.0, max=10.0,
                default=0.0,
                )

--- a/intern/cycles/blender/addon/ui.py
+++ b/intern/cycles/blender/addon/ui.py
@ -87,11 +87,11 @@ class CyclesRender_PT_integrator(CyclesButtonsPanel, Panel):
        sub.prop(cscene, "diffuse_bounces", text="Diffuse")
        sub.prop(cscene, "glossy_bounces", text="Glossy")
        sub.prop(cscene, "transmission_bounces", text="Transmission")
-        sub.prop(cscene, "no_caustics")

-        #row = col.row()
-        #row.prop(cscene, "blur_caustics")
-        #row.active = not cscene.no_caustics
+        col.separator()
+
+        col.prop(cscene, "no_caustics")
+        col.prop(cscene, "blur_glossy")


 class CyclesRender_PT_film(CyclesButtonsPanel, Panel):
@ -178,10 +178,7 @@ class CyclesRender_PT_layers(CyclesButtonsPanel, Panel):

        col = split.column()
        col.prop(scene, "layers", text="Scene")
-        col.label(text="Material:")
-        col.prop(rl, "material_override", text="")
-
-        col.prop(rl, "use_sky", "Use Environment")
+        col.prop(rl, "layers_exclude", text="Exclude")

        col = split.column()
        col.prop(rl, "layers", text="Layer")
@ -190,6 +187,16 @@ class CyclesRender_PT_layers(CyclesButtonsPanel, Panel):

        split = layout.split()

+        col = split.column()
+        col.label(text="Material:")
+        col.prop(rl, "material_override", text="")
+
+        col = split.column()
+        col.prop(rl, "samples")
+        col.prop(rl, "use_sky", "Use Environment")
+
+        split = layout.split()
+
        col = split.column()
        col.label(text="Passes:")
        col.prop(rl, "use_pass_combined")
--- a/intern/cycles/blender/blender_session.cpp
+++ b/intern/cycles/blender/blender_session.cpp
@ -218,12 +218,13 @@ void BlenderSession::render()
 		scene->film->passes = passes;
 		scene->film->tag_update(scene);

-		/* update session */
-		session->reset(buffer_params, session_params.samples);
-
 		/* update scene */
 		sync->sync_data(b_v3d, b_iter->name().c_str());

+		/* update session */
+		int samples = sync->get_layer_samples();
+		session->reset(buffer_params, (samples == 0)? session_params.samples: samples);
+
 		/* render */
 		session->start();
 		session->wait();
--- a/intern/cycles/blender/blender_sync.cpp
+++ b/intern/cycles/blender/blender_sync.cpp
@ -153,6 +153,8 @@ void BlenderSync::sync_integrator()
 	integrator->transparent_shadows = get_boolean(cscene, "use_transparent_shadows");

 	integrator->no_caustics = get_boolean(cscene, "no_caustics");
+	integrator->filter_glossy = get_float(cscene, "blur_glossy");
+
 	integrator->seed = get_int(cscene, "seed");

 	integrator->layer_flag = render_layer.layer;
@ -208,6 +210,7 @@ void BlenderSync::sync_render_layers(BL::SpaceView3D b_v3d, const char *layer)
 			render_layer.holdout_layer = 0;
 			render_layer.material_override = PointerRNA_NULL;
 			render_layer.use_background = true;
+			render_layer.samples = 0;
 			return;
 		}
 	}
@ -220,12 +223,13 @@ void BlenderSync::sync_render_layers(BL::SpaceView3D b_v3d, const char *layer)
 	for(r.layers.begin(b_rlay); b_rlay != r.layers.end(); ++b_rlay) {
 		if((!layer && first_layer) || (layer && b_rlay->name() == layer)) {
 			render_layer.name = b_rlay->name();
-			render_layer.scene_layer = get_layer(b_scene.layers());
+			render_layer.scene_layer = get_layer(b_scene.layers()) & ~get_layer(b_rlay->layers_exclude());
 			render_layer.layer = get_layer(b_rlay->layers());
 			render_layer.holdout_layer = get_layer(b_rlay->layers_zmask());
 			render_layer.layer |= render_layer.holdout_layer;
 			render_layer.material_override = b_rlay->material_override();
 			render_layer.use_background = b_rlay->use_sky();
+			render_layer.samples = b_rlay->samples();
 		}

 		first_layer = false;
--- a/intern/cycles/blender/blender_sync.h
+++ b/intern/cycles/blender/blender_sync.h
@ -57,6 +57,7 @@ public:
 	void sync_data(BL::SpaceView3D b_v3d, const char *layer = 0);
 	void sync_camera(BL::Object b_override, int width, int height);
 	void sync_view(BL::SpaceView3D b_v3d, BL::RegionView3D b_rv3d, int width, int height);
+	int get_layer_samples() { return render_layer.samples; }

 	/* get parameters */
 	static SceneParams get_scene_params(BL::Scene b_scene, bool background);
@ -108,7 +109,8 @@ private:
 		RenderLayerInfo()
 		: scene_layer(0), layer(0), holdout_layer(0),
 		  material_override(PointerRNA_NULL),
-		  use_background(true)
+		  use_background(true),
+		  samples(0)
 		{}

 		string name;
@ -117,6 +119,7 @@ private:
 		uint holdout_layer;
 		BL::Material material_override;
 		bool use_background;
+		int samples;
 	} render_layer;
 };

--- a/intern/cycles/bvh/CMakeLists.txt
+++ b/intern/cycles/bvh/CMakeLists.txt
@ -10,17 +10,21 @@ set(INC

 set(SRC
 	bvh.cpp
+	bvh_binning.cpp
 	bvh_build.cpp
 	bvh_node.cpp
 	bvh_sort.cpp
+	bvh_split.cpp
 )

 set(SRC_HEADERS
 	bvh.h
+	bvh_binning.h
 	bvh_build.h
 	bvh_node.h
 	bvh_params.h
 	bvh_sort.h
+	bvh_split.h
 )

 include_directories(${INC})
--- a/intern/cycles/bvh/bvh.cpp
+++ b/intern/cycles/bvh/bvh.cpp
@ -530,7 +530,7 @@ void RegularBVH::refit_nodes()
 {
 	assert(!params.top_level);

-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;
 	uint visibility = 0;
 	refit_node(0, (pack.is_leaf[0])? true: false, bbox, visibility);
 }
@ -572,7 +572,7 @@ void RegularBVH::refit_node(int idx, bool leaf, BoundBox& bbox, uint& visibility
 	}
 	else {
 		/* refit inner node, set bbox from children */
-		BoundBox bbox0, bbox1;
+		BoundBox bbox0 = BoundBox::empty, bbox1 = BoundBox::empty;
 		uint visibility0 = 0, visibility1 = 0;

 		refit_node((c0 < 0)? -c0-1: c0, (c0 < 0), bbox0, visibility0);
--- a/intern/cycles/bvh/bvh_binning.cpp
+++ b/intern/cycles/bvh/bvh_binning.cpp
@ -0,0 +1,223 @@
+/*
+ * Adapted from code copyright 2009-2011 Intel Corporation
+ * Modifications Copyright 2012, Blender Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+//#define __KERNEL_SSE__
+
+#include <stdlib.h>
+
+#include "bvh_binning.h"
+
+#include "util_algorithm.h"
+#include "util_boundbox.h"
+#include "util_types.h"
+
+CCL_NAMESPACE_BEGIN
+
+/* SSE replacements */
+
+__forceinline void prefetch_L1 (const void* ptr) { }
+__forceinline void prefetch_L2 (const void* ptr) { }
+__forceinline void prefetch_L3 (const void* ptr) { }
+__forceinline void prefetch_NTA(const void* ptr) { }
+
+template<size_t src> __forceinline float extract(const int4& b)
+{ return b[src]; }
+template<size_t dst> __forceinline const float4 insert(const float4& a, const float b)
+{ float4 r = a; r[dst] = b; return r; }
+
+__forceinline int get_best_dimension(const float4& bestSAH)
+{
+	// return (int)__bsf(movemask(reduce_min(bestSAH) == bestSAH));
+
+	float minSAH = min(bestSAH.x, min(bestSAH.y, bestSAH.z));
+
+	if(bestSAH.x == minSAH) return 0;
+	else if(bestSAH.y == minSAH) return 1;
+	else return 2;
+}
+
+/* BVH Object Binning */
+
+BVHObjectBinning::BVHObjectBinning(const BVHRange& job, BVHReference *prims)
+: BVHRange(job), splitSAH(FLT_MAX), dim(0), pos(0)
+{
+	/* compute number of bins to use and precompute scaling factor for binning */
+	num_bins = min(size_t(MAX_BINS), size_t(4.0f + 0.05f*size()));
+	scale = rcp(cent_bounds().size()) * make_float3((float)num_bins);
+
+	/* initialize binning counter and bounds */
+	BoundBox bin_bounds[MAX_BINS][4];	/* bounds for every bin in every dimension */
+	int4 bin_count[MAX_BINS];			/* number of primitives mapped to bin */
+
+	for(size_t i = 0; i < num_bins; i++) {
+		bin_count[i] = make_int4(0);
+		bin_bounds[i][0] = bin_bounds[i][1] = bin_bounds[i][2] = BoundBox::empty;
+	}
+
+	/* map geometry to bins, unrolled once */
+	{
+		ssize_t i;
+
+		for(i = 0; i < ssize_t(size()) - 1; i += 2) {
+			prefetch_L2(&prims[start() + i + 8]);
+
+			/* map even and odd primitive to bin */
+			BVHReference prim0 = prims[start() + i + 0];
+			BVHReference prim1 = prims[start() + i + 1];
+
+			int4 bin0 = get_bin(prim0.bounds());
+			int4 bin1 = get_bin(prim1.bounds());
+
+			/* increase bounds for bins for even primitive */
+			int b00 = extract<0>(bin0); bin_count[b00][0]++; bin_bounds[b00][0].grow(prim0.bounds());
+			int b01 = extract<1>(bin0); bin_count[b01][1]++; bin_bounds[b01][1].grow(prim0.bounds());
+			int b02 = extract<2>(bin0); bin_count[b02][2]++; bin_bounds[b02][2].grow(prim0.bounds());
+
+			/* increase bounds of bins for odd primitive */
+			int b10 = extract<0>(bin1); bin_count[b10][0]++; bin_bounds[b10][0].grow(prim1.bounds());
+			int b11 = extract<1>(bin1); bin_count[b11][1]++; bin_bounds[b11][1].grow(prim1.bounds());
+			int b12 = extract<2>(bin1); bin_count[b12][2]++; bin_bounds[b12][2].grow(prim1.bounds());
+		}
+
+		/* for uneven number of primitives */
+		if(i < ssize_t(size())) {
+			/* map primitive to bin */
+			BVHReference prim0 = prims[start() + i];
+			int4 bin0 = get_bin(prim0.bounds());
+
+			/* increase bounds of bins */
+			int b00 = extract<0>(bin0); bin_count[b00][0]++; bin_bounds[b00][0].grow(prim0.bounds());
+			int b01 = extract<1>(bin0); bin_count[b01][1]++; bin_bounds[b01][1].grow(prim0.bounds());
+			int b02 = extract<2>(bin0); bin_count[b02][2]++; bin_bounds[b02][2].grow(prim0.bounds());
+		}
+	}
+
+	/* sweep from right to left and compute parallel prefix of merged bounds */
+	float4 r_area[MAX_BINS];	/* area of bounds of primitives on the right */
+	float4 r_count[MAX_BINS];	/* number of primitives on the right */
+	int4 count = make_int4(0);
+
+	BoundBox bx = BoundBox::empty;
+	BoundBox by = BoundBox::empty;
+	BoundBox bz = BoundBox::empty;
+
+	for(size_t i = num_bins - 1; i > 0; i--) {
+		count = count + bin_count[i];
+		r_count[i] = blocks(count);
+
+		bx = merge(bx,bin_bounds[i][0]); r_area[i][0] = bx.half_area();
+		by = merge(by,bin_bounds[i][1]); r_area[i][1] = by.half_area();
+		bz = merge(bz,bin_bounds[i][2]); r_area[i][2] = bz.half_area();
+	}
+
+	/* sweep from left to right and compute SAH */
+	int4 ii = make_int4(1);
+	float4 bestSAH = make_float4(FLT_MAX);
+	int4 bestSplit = make_int4(-1);
+
+	count = make_int4(0);
+
+	bx = BoundBox::empty;
+	by = BoundBox::empty;
+	bz = BoundBox::empty;
+
+	for(size_t i = 1; i < num_bins; i++, ii += make_int4(1)) {
+		count = count + bin_count[i-1];
+
+		bx = merge(bx,bin_bounds[i-1][0]); float Ax = bx.half_area();
+		by = merge(by,bin_bounds[i-1][1]); float Ay = by.half_area();
+		bz = merge(bz,bin_bounds[i-1][2]); float Az = bz.half_area();
+
+		float4 lCount = blocks(count);
+		float4 lArea = make_float4(Ax,Ay,Az,Az);
+		float4 sah = lArea*lCount + r_area[i]*r_count[i];
+
+		bestSplit = select(sah < bestSAH,ii,bestSplit);
+		bestSAH = min(sah,bestSAH);
+	}
+
+	int4 mask = float3_to_float4(cent_bounds().size()) <= make_float4(0.0f);
+	bestSAH = insert<3>(select(mask, make_float4(FLT_MAX), bestSAH), FLT_MAX);
+
+	/* find best dimension */
+	dim = get_best_dimension(bestSAH);
+	splitSAH = bestSAH[dim];
+	pos = bestSplit[dim];
+	leafSAH	= bounds().half_area() * blocks(size());
+}
+
+void BVHObjectBinning::split(BVHReference* prims, BVHObjectBinning& left_o, BVHObjectBinning& right_o) const
+{
+	size_t N = size();
+
+	BoundBox lgeom_bounds = BoundBox::empty;
+	BoundBox rgeom_bounds = BoundBox::empty;
+	BoundBox lcent_bounds = BoundBox::empty;
+	BoundBox rcent_bounds = BoundBox::empty;
+
+	ssize_t l = 0, r = N-1;
+
+	while(l <= r) {
+		prefetch_L2(&prims[start() + l + 8]);
+		prefetch_L2(&prims[start() + r - 8]);
+
+		BVHReference prim = prims[start() + l];
+		float3 center = prim.bounds().center2();
+
+		if(get_bin(center)[dim] < pos) {
+			lgeom_bounds.grow(prim.bounds());
+			lcent_bounds.grow(center);
+			l++;
+		}
+		else {
+			rgeom_bounds.grow(prim.bounds());
+			rcent_bounds.grow(center);
+			swap(prims[start()+l],prims[start()+r]);
+			r--;
+		}
+	}
+
+	/* finish */
+	if(l != 0 && N-1-r != 0) {
+		right_o = BVHObjectBinning(BVHRange(rgeom_bounds, rcent_bounds, start() + l, N-1-r), prims);
+		left_o  = BVHObjectBinning(BVHRange(lgeom_bounds, lcent_bounds, start(), l), prims);
+		return;
+	}
+
+	/* object medium split if we did not make progress, can happen when all
+	   primitives have same centroid */
+	lgeom_bounds = BoundBox::empty;
+	rgeom_bounds = BoundBox::empty;
+	lcent_bounds = BoundBox::empty;
+	rcent_bounds = BoundBox::empty;
+
+	for(size_t i = 0; i < N/2; i++) {
+		lgeom_bounds.grow(prims[start()+i].bounds());
+		lcent_bounds.grow(prims[start()+i].bounds().center2());
+	}
+
+	for(size_t i = N/2; i < N; i++) {
+		rgeom_bounds.grow(prims[start()+i].bounds());
+		rcent_bounds.grow(prims[start()+i].bounds().center2());
+	}
+
+	right_o = BVHObjectBinning(BVHRange(rgeom_bounds, rcent_bounds, start() + N/2, N/2 + N%2), prims);
+	left_o  = BVHObjectBinning(BVHRange(lgeom_bounds, lcent_bounds, start(), N/2), prims);
+}
+
+CCL_NAMESPACE_END
+
--- a/intern/cycles/bvh/bvh_binning.h
+++ b/intern/cycles/bvh/bvh_binning.h
@ -0,0 +1,86 @@
+/*
+ * Adapted from code copyright 2009-2011 Intel Corporation
+ * Modifications Copyright 2012, Blender Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef __BVH_BINNING_H__
+#define __BVH_BINNING_H__
+
+#include "bvh_params.h"
+
+#include "util_types.h"
+
+CCL_NAMESPACE_BEGIN
+
+/* Single threaded object binner. Finds the split with the best SAH heuristic
+ * by testing for each dimension multiple partitionings for regular spaced
+ * partition locations. A partitioning for a partition location is computed,
+ * by putting primitives whose centroid is on the left and right of the split
+ * location to different sets. The SAH is evaluated by computing the number of
+ * blocks occupied by the primitives in the partitions. */
+
+class BVHObjectBinning : public BVHRange
+{
+public:
+	__forceinline BVHObjectBinning() {}
+	BVHObjectBinning(const BVHRange& job, BVHReference *prims);
+
+	void split(BVHReference *prims, BVHObjectBinning& left_o, BVHObjectBinning& right_o) const;
+
+	float splitSAH;	/* SAH cost of the best split */
+	float leafSAH;	/* SAH cost of creating a leaf */
+
+protected:
+	int dim;			/* best split dimension */
+	int pos;			/* best split position */
+	size_t num_bins;	/* actual number of bins to use */
+	float3 scale;		/* scaling factor to compute bin */
+
+	enum { MAX_BINS = 32 };
+	enum { LOG_BLOCK_SIZE = 2 };
+
+	/* computes the bin numbers for each dimension for a box. */
+	__forceinline int4 get_bin(const BoundBox& box) const
+	{
+		int4 a = make_int4((box.center2() - cent_bounds().min)*scale - make_float3(0.5f));
+		int4 mn = make_int4(0);
+		int4 mx = make_int4((int)num_bins-1);
+
+		return clamp(a, mn, mx);
+	}
+
+	/* computes the bin numbers for each dimension for a point. */
+	__forceinline int4 get_bin(const float3& c) const
+	{
+		return make_int4((c - cent_bounds().min)*scale - make_float3(0.5f));
+	}
+
+	/* compute the number of blocks occupied for each dimension. */
+	__forceinline float4 blocks(const int4& a) const
+	{
+		return make_float4((a + make_int4((1 << LOG_BLOCK_SIZE)-1)) >> LOG_BLOCK_SIZE);
+	}
+
+	/* compute the number of blocks occupied in one dimension. */
+	__forceinline int blocks(size_t a) const
+	{
+		return (int)((a+((1LL << LOG_BLOCK_SIZE)-1)) >> LOG_BLOCK_SIZE);
+	}
+};
+
+CCL_NAMESPACE_END
+
+#endif
+
--- a/intern/cycles/bvh/bvh_build.cpp
+++ b/intern/cycles/bvh/bvh_build.cpp
@ -15,22 +15,36 @@
 * limitations under the License.
 */

+#include "bvh_binning.h"
 #include "bvh_build.h"
 #include "bvh_node.h"
 #include "bvh_params.h"
-#include "bvh_sort.h"
+#include "bvh_split.h"

 #include "mesh.h"
 #include "object.h"
 #include "scene.h"

-#include "util_algorithm.h"
+#include "util_debug.h"
 #include "util_foreach.h"
 #include "util_progress.h"
 #include "util_time.h"

 CCL_NAMESPACE_BEGIN

+/* BVH Build Task */
+
+class BVHBuildTask : public Task {
+public:
+	BVHBuildTask(InnerNode *node_, int child_, BVHObjectBinning& range_, int level_)
+	: node(node_), child(child_), level(level_), range(range_) {}
+
+	InnerNode *node;
+	int child;
+	int level;
+	BVHObjectBinning range;
+};
+
 /* Constructor / Destructor */

 BVHBuild::BVHBuild(const vector<Object*>& objects_,
@ -41,10 +55,10 @@ BVHBuild::BVHBuild(const vector<Object*>& objects_,
  prim_object(prim_object_),
  params(params_),
  progress(progress_),
-  progress_start_time(0.0)
+  progress_start_time(0.0),
+  task_pool(function_bind(&BVHBuild::thread_build_node, this, _1, _2))
 {
 	spatial_min_overlap = 0.0f;
-	progress_num_duplicates = 0;
 }

 BVHBuild::~BVHBuild()
@ -53,57 +67,63 @@ BVHBuild::~BVHBuild()

 /* Adding References */

-void BVHBuild::add_reference_mesh(NodeSpec& root, Mesh *mesh, int i)
+void BVHBuild::add_reference_mesh(BoundBox& root, BoundBox& center, Mesh *mesh, int i)
 {
 	for(uint j = 0; j < mesh->triangles.size(); j++) {
 		Mesh::Triangle t = mesh->triangles[j];
-		Reference ref;
+		BoundBox bounds = BoundBox::empty;

 		for(int k = 0; k < 3; k++) {
 			float3 pt = mesh->verts[t.v[k]];
-			ref.bounds.grow(pt);
+			bounds.grow(pt);
 		}

-		if(ref.bounds.valid()) {
-			ref.prim_index = j;
-			ref.prim_object = i;
-
-			references.push_back(ref);
-			root.bounds.grow(ref.bounds);
+		if(bounds.valid()) {
+			references.push_back(BVHReference(bounds, j, i));
+			root.grow(bounds);
+			center.grow(bounds.center2());
 		}
 	}
 }

-void BVHBuild::add_reference_object(NodeSpec& root, Object *ob, int i)
+void BVHBuild::add_reference_object(BoundBox& root, BoundBox& center, Object *ob, int i)
 {
-	Reference ref;
-
-	ref.prim_index = -1;
-	ref.prim_object = i;
-	ref.bounds = ob->bounds;
-
-	references.push_back(ref);
-	root.bounds.grow(ref.bounds);
+	references.push_back(BVHReference(ob->bounds, -1, i));
+	root.grow(ob->bounds);
+	center.grow(ob->bounds.center2());
 }

-void BVHBuild::add_references(NodeSpec& root)
+void BVHBuild::add_references(BVHRange& root)
 {
-	/* init root spec */
-	root.num = 0;
-	root.bounds = BoundBox();
+	/* reserve space for references */
+	size_t num_alloc_references = 0;

-	/* add objects */
+	foreach(Object *ob, objects) {
+		if(params.top_level) {
+			if(ob->mesh->transform_applied)
+				num_alloc_references += ob->mesh->triangles.size();
+			else
+				num_alloc_references++;
+		}
+		else
+			num_alloc_references += ob->mesh->triangles.size();
+	}
+
+	references.reserve(num_alloc_references);
+
+	/* add references from objects */
+	BoundBox bounds = BoundBox::empty, center = BoundBox::empty;
 	int i = 0;

 	foreach(Object *ob, objects) {
 		if(params.top_level) {
 			if(ob->mesh->transform_applied)
-				add_reference_mesh(root, ob->mesh, i);
+				add_reference_mesh(bounds, center, ob->mesh, i);
 			else
-				add_reference_object(root, ob, i);
+				add_reference_object(bounds, center, ob, i);
 		}
 		else
-			add_reference_mesh(root, ob->mesh, i);
+			add_reference_mesh(bounds, center, ob->mesh, i);

 		i++;

@ -111,129 +131,213 @@ void BVHBuild::add_references(NodeSpec& root)
 	}

 	/* happens mostly on empty meshes */
-	if(!root.bounds.valid())
-		root.bounds.grow(make_float3(0.0f, 0.0f, 0.0f));
+	if(!bounds.valid())
+		bounds.grow(make_float3(0.0f, 0.0f, 0.0f));

-	root.num = references.size();
+	root = BVHRange(bounds, center, 0, references.size());
 }

 /* Build */

 BVHNode* BVHBuild::run()
 {
-	NodeSpec root;
+	BVHRange root;

 	/* add references */
 	add_references(root);

-	if(progress.get_cancel()) return NULL;
+	if(progress.get_cancel())
+		return NULL;

 	/* init spatial splits */
 	if(params.top_level) /* todo: get rid of this */
 		params.use_spatial_split = false;

-	spatial_min_overlap = root.bounds.area() * params.spatial_split_alpha;
+	spatial_min_overlap = root.bounds().safe_area() * params.spatial_split_alpha;
 	spatial_right_bounds.clear();
-	spatial_right_bounds.resize(max(root.num, (int)BVHParams::NUM_SPATIAL_BINS) - 1);
+	spatial_right_bounds.resize(max(root.size(), (int)BVHParams::NUM_SPATIAL_BINS) - 1);

 	/* init progress updates */
-	progress_num_duplicates = 0;
 	progress_start_time = time_dt();
+	progress_count = 0;
+	progress_total = references.size();
+	progress_original_total = progress_total;
+
+	prim_index.resize(references.size());
+	prim_object.resize(references.size());

 	/* build recursively */
-	return build_node(root, 0, 0.0f, 1.0f);
+	BVHNode *rootnode;
+
+	if(params.use_spatial_split) {
+		/* singlethreaded spatial split build */
+		rootnode = build_node(root, 0);
+	}
+	else {
+		/* multithreaded binning build */
+		BVHObjectBinning rootbin(root, &references[0]);
+		rootnode = build_node(rootbin, 0);
+		task_pool.wait();
+	}
+
+	/* delete if we cancelled */
+	if(rootnode) {
+		if(progress.get_cancel()) {
+			rootnode->deleteSubtree();
+			rootnode = NULL;
+		}
+		else if(!params.use_spatial_split) {
+			/*rotate(rootnode, 4, 5);*/
+			rootnode->update_visibility();
+		}
+	}
+
+	return rootnode;
 }

-void BVHBuild::progress_update(float progress_start, float progress_end)
+void BVHBuild::progress_update()
 {
 	if(time_dt() - progress_start_time < 0.25f)
 		return;
+	
+	double progress_start = (double)progress_count/(double)progress_total;
+	double duplicates = (double)(progress_total - progress_original_total)/(double)progress_total;

-	float duplicates = (float)progress_num_duplicates/(float)references.size();
 	string msg = string_printf("Building BVH %.0f%%, duplicates %.0f%%",
 		progress_start*100.0f, duplicates*100.0f);

 	progress.set_substatus(msg);
-	progress_start_time = time_dt();
+	progress_start_time = time_dt(); 
 }

-BVHNode* BVHBuild::build_node(const NodeSpec& spec, int level, float progress_start, float progress_end)
+void BVHBuild::thread_build_node(Task *task_, int thread_id)
 {
-	/* progress update */
-	progress_update(progress_start, progress_end);
-	if(progress.get_cancel()) return NULL;
+	if(progress.get_cancel())
+		return;

-	/* small enough or too deep => create leaf. */
-	if(spec.num <= params.min_leaf_size || level >= BVHParams::MAX_DEPTH)
-		return create_leaf_node(spec);
+	/* build nodes */
+	BVHBuildTask *task = (BVHBuildTask*)task_;
+	BVHNode *node = build_node(task->range, task->level);

-	/* find split candidates. */
-	float area = spec.bounds.area();
-	float leafSAH = area * params.triangle_cost(spec.num);
-	float nodeSAH = area * params.node_cost(2);
-	ObjectSplit object = find_object_split(spec, nodeSAH);
-	SpatialSplit spatial;
+	/* set child in inner node */
+	task->node->children[task->child] = node;

-	if(params.use_spatial_split && level < BVHParams::MAX_SPATIAL_DEPTH) {
-		BoundBox overlap = object.left_bounds;
-		overlap.intersect(object.right_bounds);
+	/* update progress */
+	if(task->range.size() < THREAD_TASK_SIZE) {
+		/*rotate(node, INT_MAX, 5);*/

-		if(overlap.area() >= spatial_min_overlap)
-			spatial = find_spatial_split(spec, nodeSAH);
+		thread_scoped_lock lock(build_mutex);
+
+		progress_count += task->range.size();
+		progress_update();
 	}
+}

-	/* leaf SAH is the lowest => create leaf. */
-	float minSAH = min(min(leafSAH, object.sah), spatial.sah);
+/* multithreaded binning builder */
+BVHNode* BVHBuild::build_node(const BVHObjectBinning& range, int level)
+{
+	size_t size = range.size();
+	float leafSAH = params.sah_triangle_cost * range.leafSAH;
+	float splitSAH = params.sah_node_cost * range.bounds().half_area() + params.sah_triangle_cost * range.splitSAH;

-	if(minSAH == leafSAH && spec.num <= params.max_leaf_size)
-		return create_leaf_node(spec);
+	/* make leaf node when threshold reached or SAH tells us */
+	if(params.small_enough_for_leaf(size, level) || (size <= params.max_leaf_size && leafSAH < splitSAH))
+		return create_leaf_node(range);

-	/* perform split. */
-	NodeSpec left, right;
-
-	if(params.use_spatial_split && minSAH == spatial.sah)
-		do_spatial_split(left, right, spec, spatial);
-	if(!left.num || !right.num)
-		do_object_split(left, right, spec, object);
+	/* perform split */
+	BVHObjectBinning left, right;
+	range.split(&references[0], left, right);

 	/* create inner node. */
-	progress_num_duplicates += left.num + right.num - spec.num;
+	InnerNode *inner;

-	float progress_mid = lerp(progress_start, progress_end, (float)right.num / (float)(left.num + right.num));
+	if(range.size() < THREAD_TASK_SIZE) {
+		/* local build */
+		BVHNode *leftnode = build_node(left, level + 1);
+		BVHNode *rightnode = build_node(right, level + 1);

-	BVHNode* rightNode = build_node(right, level + 1, progress_start, progress_mid);
-	if(progress.get_cancel()) {
-		if(rightNode) rightNode->deleteSubtree();
-		return NULL;
+		inner = new InnerNode(range.bounds(), leftnode, rightnode);
+	}
+	else {
+		/* threaded build */
+		inner = new InnerNode(range.bounds());
+
+		task_pool.push(new BVHBuildTask(inner, 0, left, level + 1), true);
+		task_pool.push(new BVHBuildTask(inner, 1, right, level + 1), true);
 	}

-	BVHNode* leftNode = build_node(left, level + 1, progress_mid, progress_end);
-	if(progress.get_cancel()) {
-		if(leftNode) leftNode->deleteSubtree();
-		return NULL;
-	}
-
-	return new InnerNode(spec.bounds, leftNode, rightNode);
+	return inner;
 }

-BVHNode *BVHBuild::create_object_leaf_nodes(const Reference *ref, int num)
+/* single threaded spatial split builder */
+BVHNode* BVHBuild::build_node(const BVHRange& range, int level)
+{
+	/* progress update */
+	progress_update();
+	if(progress.get_cancel())
+		return NULL;
+
+	/* small enough or too deep => create leaf. */
+	if(params.small_enough_for_leaf(range.size(), level)) {
+		progress_count += range.size();
+		return create_leaf_node(range);
+	}
+
+	/* splitting test */
+	BVHMixedSplit split(this, range, level);
+
+	if(split.no_split) {
+		progress_count += range.size();
+		return create_leaf_node(range);
+	}
+	
+	/* do split */
+	BVHRange left, right;
+	split.split(this, left, right, range);
+
+	progress_total += left.size() + right.size() - range.size();
+	size_t total = progress_total;
+
+	/* leaft node */
+	BVHNode *leftnode = build_node(left, level + 1);
+
+	/* right node (modify start for splits) */
+	right.set_start(right.start() + progress_total - total);
+	BVHNode *rightnode = build_node(right, level + 1);
+
+	/* inner node */
+	return new InnerNode(range.bounds(), leftnode, rightnode);
+}
+
+/* Create Nodes */
+
+BVHNode *BVHBuild::create_object_leaf_nodes(const BVHReference *ref, int start, int num)
 {
 	if(num == 0) {
-		BoundBox bounds;
+		BoundBox bounds = BoundBox::empty;
 		return new LeafNode(bounds, 0, 0, 0);
 	}
 	else if(num == 1) {
-		prim_index.push_back(ref[0].prim_index);
-		prim_object.push_back(ref[0].prim_object);
-		uint visibility = objects[ref[0].prim_object]->visibility;
-		return new LeafNode(ref[0].bounds, visibility, prim_index.size()-1, prim_index.size());
+		if(start == prim_index.size()) {
+			assert(params.use_spatial_split);
+
+			prim_index.push_back(ref->prim_index());
+			prim_object.push_back(ref->prim_object());
+		}
+		else {
+			prim_index[start] = ref->prim_index();
+			prim_object[start] = ref->prim_object();
+		}
+
+		uint visibility = objects[ref->prim_object()]->visibility;
+		return new LeafNode(ref->bounds(), visibility, start, start+1);
 	}
 	else {
 		int mid = num/2;
-		BVHNode *leaf0 = create_object_leaf_nodes(ref, mid); 
-		BVHNode *leaf1 = create_object_leaf_nodes(ref+mid, num-mid); 
+		BVHNode *leaf0 = create_object_leaf_nodes(ref, start, mid); 
+		BVHNode *leaf1 = create_object_leaf_nodes(ref+mid, start+mid, num-mid); 

-		BoundBox bounds;
+		BoundBox bounds = BoundBox::empty;
 		bounds.grow(leaf0->m_bounds);
 		bounds.grow(leaf1->m_bounds);

@ -241,310 +345,136 @@ BVHNode *BVHBuild::create_object_leaf_nodes(const Reference *ref, int num)
 	}
 }

-BVHNode* BVHBuild::create_leaf_node(const NodeSpec& spec)
+BVHNode* BVHBuild::create_leaf_node(const BVHRange& range)
 {
 	vector<int>& p_index = prim_index;
 	vector<int>& p_object = prim_object;
-	BoundBox bounds;
-	int num = 0;
+	BoundBox bounds = BoundBox::empty;
+	int num = 0, ob_num = 0;
 	uint visibility = 0;

-	for(int i = 0; i < spec.num; i++) {
-		if(references.back().prim_index != -1) {
-			p_index.push_back(references.back().prim_index);
-			p_object.push_back(references.back().prim_object);
-			bounds.grow(references.back().bounds);
-			visibility |= objects[references.back().prim_object]->visibility;
-			references.pop_back();
+	for(int i = 0; i < range.size(); i++) {
+		BVHReference& ref = references[range.start() + i];
+
+		if(ref.prim_index() != -1) {
+			if(range.start() + num == prim_index.size()) {
+				assert(params.use_spatial_split);
+
+				p_index.push_back(ref.prim_index());
+				p_object.push_back(ref.prim_object());
+			}
+			else {
+				p_index[range.start() + num] = ref.prim_index();
+				p_object[range.start() + num] = ref.prim_object();
+			}
+
+			bounds.grow(ref.bounds());
+			visibility |= objects[ref.prim_object()]->visibility;
 			num++;
 		}
+		else {
+			if(ob_num < i)
+				references[range.start() + ob_num] = ref;
+			ob_num++;
+		}
 	}

 	BVHNode *leaf = NULL;
 	
 	if(num > 0) {
-		leaf = new LeafNode(bounds, visibility, p_index.size() - num, p_index.size());
+		leaf = new LeafNode(bounds, visibility, range.start(), range.start() + num);

-		if(num == spec.num)
+		if(num == range.size())
 			return leaf;
 	}

 	/* while there may be multiple triangles in a leaf, for object primitives
-	 * we want them to be the only one, so we  */
-	int ob_num = spec.num - num;
-	const Reference *ref = (ob_num)? &references.back() - (ob_num - 1): NULL;
-	BVHNode *oleaf = create_object_leaf_nodes(ref, ob_num);
-	for(int i = 0; i < ob_num; i++)
-		references.pop_back();
+	 * we want there to be the only one, so we keep splitting */
+	const BVHReference *ref = (ob_num)? &references[range.start()]: NULL;
+	BVHNode *oleaf = create_object_leaf_nodes(ref, range.start() + num, ob_num);
 	
 	if(leaf)
-		return new InnerNode(spec.bounds, leaf, oleaf);
+		return new InnerNode(range.bounds(), leaf, oleaf);
 	else
 		return oleaf;
 }

-/* Object Split */
+/* Tree Rotations */

-BVHBuild::ObjectSplit BVHBuild::find_object_split(const NodeSpec& spec, float nodeSAH)
+void BVHBuild::rotate(BVHNode *node, int max_depth, int iterations)
 {
-	ObjectSplit split;
-	const Reference *ref_ptr = &references[references.size() - spec.num];
+	/* in tested scenes, this resulted in slightly slower raytracing, so disabled
+	 * it for now. could be implementation bug, or depend on the scene */
+	if(node)
+		for(int i = 0; i < iterations; i++)
+			rotate(node, max_depth);
+}

-	for(int dim = 0; dim < 3; dim++) {
-		/* sort references */
-		bvh_reference_sort(references.size() - spec.num, references.size(), &references[0], dim);
+void BVHBuild::rotate(BVHNode *node, int max_depth)
+{
+	/* nothing to rotate if we reached a leaf node. */
+	if(node->is_leaf() || max_depth < 0)
+		return;
+	
+	InnerNode *parent = (InnerNode*)node;

-		/* sweep right to left and determine bounds. */
-		BoundBox right_bounds;
+	/* rotate all children first */
+	for(size_t c = 0; c < 2; c++)
+		rotate(parent->children[c], max_depth-1);

-		for(int i = spec.num - 1; i > 0; i--) {
-			right_bounds.grow(ref_ptr[i].bounds);
-			spatial_right_bounds[i - 1] = right_bounds;
-		}
+	/* compute current area of all children */
+	BoundBox bounds0 = parent->children[0]->m_bounds;
+	BoundBox bounds1 = parent->children[1]->m_bounds;

-		/* sweep left to right and select lowest SAH. */
-		BoundBox left_bounds;
+	float area0 = bounds0.half_area();
+	float area1 = bounds1.half_area();
+	float4 child_area = make_float4(area0, area1, 0.0f, 0.0f);

-		for(int i = 1; i < spec.num; i++) {
-			left_bounds.grow(ref_ptr[i - 1].bounds);
-			right_bounds = spatial_right_bounds[i - 1];
+	/* find best rotation. we pick a target child of a first child, and swap
+	 * this with an other child. we perform the best such swap. */
+	float best_cost = FLT_MAX;
+	int best_child = -1, bets_target = -1, best_other = -1;

-			float sah = nodeSAH +
-				left_bounds.area() * params.triangle_cost(i) +
-				right_bounds.area() * params.triangle_cost(spec.num - i);
+	for(size_t c = 0; c < 2; c++) {
+		/* ignore leaf nodes as we cannot descent into */
+		if(parent->children[c]->is_leaf())
+			continue;

-			if(sah < split.sah) {
-				split.sah = sah;
-				split.dim = dim;
-				split.num_left = i;
-				split.left_bounds = left_bounds;
-				split.right_bounds = right_bounds;
+		InnerNode *child = (InnerNode*)parent->children[c];
+		BoundBox& other = (c == 0)? bounds1: bounds0;
+
+		/* transpose child bounds */
+		BoundBox target0 = child->children[0]->m_bounds;
+		BoundBox target1 = child->children[1]->m_bounds;
+
+		/* compute cost for both possible swaps */
+		float cost0 = merge(other, target1).half_area() - child_area[c];
+		float cost1 = merge(target0, other).half_area() - child_area[c];
+
+		if(min(cost0,cost1) < best_cost) {
+			best_child = (int)c;
+			best_other = (int)(1-c);
+
+			if(cost0 < cost1) {
+				best_cost = cost0;
+				bets_target = 0;
+			}
+			else {
+				best_cost = cost0;
+				bets_target = 1;
 			}
 		}
 	}

-	return split;
-}
+	/* if we did not find a swap that improves the SAH then do nothing */
+	if(best_cost >= 0)
+		return;

-void BVHBuild::do_object_split(NodeSpec& left, NodeSpec& right, const NodeSpec& spec, const ObjectSplit& split)
-{
-	/* sort references according to split */
-	int start = references.size() - spec.num;
-	int end = references.size(); /* todo: is this right? */
+	/* perform the best found tree rotation */
+	InnerNode *child = (InnerNode*)parent->children[best_child];

-	bvh_reference_sort(start, end, &references[0], split.dim);
-
-	/* split node specs */
-	left.num = split.num_left;
-	left.bounds = split.left_bounds;
-	right.num = spec.num - split.num_left;
-	right.bounds = split.right_bounds;
-}
-
-/* Spatial Split */
-
-BVHBuild::SpatialSplit BVHBuild::find_spatial_split(const NodeSpec& spec, float nodeSAH)
-{
-	/* initialize bins. */
-	float3 origin = spec.bounds.min;
-	float3 binSize = (spec.bounds.max - origin) * (1.0f / (float)BVHParams::NUM_SPATIAL_BINS);
-	float3 invBinSize = 1.0f / binSize;
-
-	for(int dim = 0; dim < 3; dim++) {
-		for(int i = 0; i < BVHParams::NUM_SPATIAL_BINS; i++) {
-			SpatialBin& bin = spatial_bins[dim][i];
-
-			bin.bounds = BoundBox();
-			bin.enter = 0;
-			bin.exit = 0;
-		}
-	}
-
-	/* chop references into bins. */
-	for(unsigned int refIdx = references.size() - spec.num; refIdx < references.size(); refIdx++) {
-		const Reference& ref = references[refIdx];
-		float3 firstBinf = (ref.bounds.min - origin) * invBinSize;
-		float3 lastBinf = (ref.bounds.max - origin) * invBinSize;
-		int3 firstBin = make_int3((int)firstBinf.x, (int)firstBinf.y, (int)firstBinf.z);
-		int3 lastBin = make_int3((int)lastBinf.x, (int)lastBinf.y, (int)lastBinf.z);
-
-		firstBin = clamp(firstBin, 0, BVHParams::NUM_SPATIAL_BINS - 1);
-		lastBin = clamp(lastBin, firstBin, BVHParams::NUM_SPATIAL_BINS - 1);
-
-		for(int dim = 0; dim < 3; dim++) {
-			Reference currRef = ref;
-
-			for(int i = firstBin[dim]; i < lastBin[dim]; i++) {
-				Reference leftRef, rightRef;
-
-				split_reference(leftRef, rightRef, currRef, dim, origin[dim] + binSize[dim] * (float)(i + 1));
-				spatial_bins[dim][i].bounds.grow(leftRef.bounds);
-				currRef = rightRef;
-			}
-
-			spatial_bins[dim][lastBin[dim]].bounds.grow(currRef.bounds);
-			spatial_bins[dim][firstBin[dim]].enter++;
-			spatial_bins[dim][lastBin[dim]].exit++;
-		}
-	}
-
-	/* select best split plane. */
-	SpatialSplit split;
-
-	for(int dim = 0; dim < 3; dim++) {
-		/* sweep right to left and determine bounds. */
-		BoundBox right_bounds;
-
-		for(int i = BVHParams::NUM_SPATIAL_BINS - 1; i > 0; i--) {
-			right_bounds.grow(spatial_bins[dim][i].bounds);
-			spatial_right_bounds[i - 1] = right_bounds;
-		}
-
-		/* sweep left to right and select lowest SAH. */
-		BoundBox left_bounds;
-		int leftNum = 0;
-		int rightNum = spec.num;
-
-		for(int i = 1; i < BVHParams::NUM_SPATIAL_BINS; i++) {
-			left_bounds.grow(spatial_bins[dim][i - 1].bounds);
-			leftNum += spatial_bins[dim][i - 1].enter;
-			rightNum -= spatial_bins[dim][i - 1].exit;
-
-			float sah = nodeSAH +
-				left_bounds.area() * params.triangle_cost(leftNum) +
-				spatial_right_bounds[i - 1].area() * params.triangle_cost(rightNum);
-
-			if(sah < split.sah) {
-				split.sah = sah;
-				split.dim = dim;
-				split.pos = origin[dim] + binSize[dim] * (float)i;
-			}
-		}
-	}
-
-	return split;
-}
-
-void BVHBuild::do_spatial_split(NodeSpec& left, NodeSpec& right, const NodeSpec& spec, const SpatialSplit& split)
-{
-	/* Categorize references and compute bounds.
-	 *
-	 * Left-hand side:			[left_start, left_end[
-	 * Uncategorized/split:		[left_end, right_start[
-	 * Right-hand side:			[right_start, refs.size()[ */
-
-	vector<Reference>& refs = references;
-	int left_start = refs.size() - spec.num;
-	int left_end = left_start;
-	int right_start = refs.size();
-
-	left.bounds = right.bounds = BoundBox();
-
-	for(int i = left_end; i < right_start; i++) {
-		if(refs[i].bounds.max[split.dim] <= split.pos) {
-			/* entirely on the left-hand side */
-			left.bounds.grow(refs[i].bounds);
-			swap(refs[i], refs[left_end++]);
-		}
-		else if(refs[i].bounds.min[split.dim] >= split.pos) {
-			/* entirely on the right-hand side */
-			right.bounds.grow(refs[i].bounds);
-			swap(refs[i--], refs[--right_start]);
-		}
-	}
-
-	/* duplicate or unsplit references intersecting both sides. */
-	while(left_end < right_start) {
-		/* split reference. */
-		Reference lref, rref;
-
-		split_reference(lref, rref, refs[left_end], split.dim, split.pos);
-
-		/* compute SAH for duplicate/unsplit candidates. */
-		BoundBox lub = left.bounds;		// Unsplit to left:		new left-hand bounds.
-		BoundBox rub = right.bounds;	// Unsplit to right:	new right-hand bounds.
-		BoundBox ldb = left.bounds;		// Duplicate:			new left-hand bounds.
-		BoundBox rdb = right.bounds;	// Duplicate:			new right-hand bounds.
-
-		lub.grow(refs[left_end].bounds);
-		rub.grow(refs[left_end].bounds);
-		ldb.grow(lref.bounds);
-		rdb.grow(rref.bounds);
-
-		float lac = params.triangle_cost(left_end - left_start);
-		float rac = params.triangle_cost(refs.size() - right_start);
-		float lbc = params.triangle_cost(left_end - left_start + 1);
-		float rbc = params.triangle_cost(refs.size() - right_start + 1);
-
-		float unsplitLeftSAH = lub.area() * lbc + right.bounds.area() * rac;
-		float unsplitRightSAH = left.bounds.area() * lac + rub.area() * rbc;
-		float duplicateSAH = ldb.area() * lbc + rdb.area() * rbc;
-		float minSAH = min(min(unsplitLeftSAH, unsplitRightSAH), duplicateSAH);
-
-		if(minSAH == unsplitLeftSAH) {
-			/* unsplit to left */
-			left.bounds = lub;
-			left_end++;
-		}
-		else if(minSAH == unsplitRightSAH) {
-			/* unsplit to right */
-			right.bounds = rub;
-			swap(refs[left_end], refs[--right_start]);
-		}
-		else {
-			/* duplicate */
-			left.bounds = ldb;
-			right.bounds = rdb;
-			refs[left_end++] = lref;
-			refs.push_back(rref);
-		}
-	}
-
-	left.num = left_end - left_start;
-	right.num = refs.size() - right_start;
-}
-
-void BVHBuild::split_reference(Reference& left, Reference& right, const Reference& ref, int dim, float pos)
-{
-	/* initialize references. */
-	left.prim_index = right.prim_index = ref.prim_index;
-	left.prim_object = right.prim_object = ref.prim_object;
-	left.bounds = right.bounds = BoundBox();
-
-	/* loop over vertices/edges. */
-	Object *ob = objects[ref.prim_object];
-	const Mesh *mesh = ob->mesh;
-	const int *inds = mesh->triangles[ref.prim_index].v;
-	const float3 *verts = &mesh->verts[0];
-	const float3* v1 = &verts[inds[2]];
-
-	for(int i = 0; i < 3; i++) {
-		const float3* v0 = v1;
-		int vindex = inds[i];
-		v1 = &verts[vindex];
-		float v0p = (*v0)[dim];
-		float v1p = (*v1)[dim];
-
-		/* insert vertex to the boxes it belongs to. */
-		if(v0p <= pos)
-			left.bounds.grow(*v0);
-
-		if(v0p >= pos)
-			right.bounds.grow(*v0);
-
-		/* edge intersects the plane => insert intersection to both boxes. */
-		if((v0p < pos && v1p > pos) || (v0p > pos && v1p < pos)) {
-			float3 t = lerp(*v0, *v1, clamp((pos - v0p) / (v1p - v0p), 0.0f, 1.0f));
-			left.bounds.grow(t);
-			right.bounds.grow(t);
-		}
-	}
-
-	/* intersect with original bounds. */
-	left.bounds.max[dim] = pos;
-	right.bounds.min[dim] = pos;
-	left.bounds.intersect(ref.bounds);
-	right.bounds.intersect(ref.bounds);
+	swap(parent->children[best_other], child->children[bets_target]);
+	child->m_bounds = merge(child->children[0]->m_bounds, child->children[1]->m_bounds);
 }

 CCL_NAMESPACE_END
--- a/intern/cycles/bvh/bvh_build.h
+++ b/intern/cycles/bvh/bvh_build.h
@ -21,8 +21,10 @@
 #include <float.h>

 #include "bvh.h"
+#include "bvh_binning.h"

 #include "util_boundbox.h"
+#include "util_task.h"
 #include "util_vector.h"

 CCL_NAMESPACE_BEGIN
@ -37,28 +39,7 @@ class Progress;
 class BVHBuild
 {
 public:
-	struct Reference
-	{
-		int prim_index;
-		int prim_object;
-		BoundBox bounds;
-
-		Reference()
-		{
-		}
-	};
-
-	struct NodeSpec
-	{
-		int num;
-		BoundBox bounds;
-
-		NodeSpec()
-		{
-			num = 0;
-		}
-	};
-
+	/* Constructor/Destructor */
 	BVHBuild(
 		const vector<Object*>& objects,
 		vector<int>& prim_index,
@ -70,63 +51,37 @@ public:
 	BVHNode *run();

 protected:
+	friend class BVHMixedSplit;
+	friend class BVHObjectSplit;
+	friend class BVHSpatialSplit;
+
 	/* adding references */
-	void add_reference_mesh(NodeSpec& root, Mesh *mesh, int i);
-	void add_reference_object(NodeSpec& root, Object *ob, int i);
-	void add_references(NodeSpec& root);
+	void add_reference_mesh(BoundBox& root, BoundBox& center, Mesh *mesh, int i);
+	void add_reference_object(BoundBox& root, BoundBox& center, Object *ob, int i);
+	void add_references(BVHRange& root);

 	/* building */
-	BVHNode *build_node(const NodeSpec& spec, int level, float progress_start, float progress_end);
-	BVHNode *create_leaf_node(const NodeSpec& spec);
-	BVHNode *create_object_leaf_nodes(const Reference *ref, int num);
+	BVHNode *build_node(const BVHRange& range, int level);
+	BVHNode *build_node(const BVHObjectBinning& range, int level);
+	BVHNode *create_leaf_node(const BVHRange& range);
+	BVHNode *create_object_leaf_nodes(const BVHReference *ref, int start, int num);

-	void progress_update(float progress_start, float progress_end);
+	/* threads */
+	enum { THREAD_TASK_SIZE = 4096 };
+	void thread_build_node(Task *task_, int thread_id);
+	thread_mutex build_mutex;

-	/* object splits */
-	struct ObjectSplit
-	{
-		float sah;
-		int dim;
-		int num_left;
-		BoundBox left_bounds;
-		BoundBox right_bounds;
+	/* progress */
+	void progress_update();

-		ObjectSplit()
-		: sah(FLT_MAX), dim(0), num_left(0)
-		{
-		}
-	};
-
-	ObjectSplit find_object_split(const NodeSpec& spec, float nodeSAH);
-	void do_object_split(NodeSpec& left, NodeSpec& right, const NodeSpec& spec, const ObjectSplit& split);
-
-	/* spatial splits */
-	struct SpatialSplit
-	{
-		float sah;
-		int dim;
-		float pos;
-
-		SpatialSplit()
-		: sah(FLT_MAX), dim(0), pos(0.0f)
-		{
-		}
-	};
-
-	struct SpatialBin
-	{
-		BoundBox bounds;
-		int enter;
-		int exit;
-	};
-
-	SpatialSplit find_spatial_split(const NodeSpec& spec, float nodeSAH);
-	void do_spatial_split(NodeSpec& left, NodeSpec& right, const NodeSpec& spec, const SpatialSplit& split);
-	void split_reference(Reference& left, Reference& right, const Reference& ref, int dim, float pos);
+	/* tree rotations */
+	void rotate(BVHNode *node, int max_depth);
+	void rotate(BVHNode *node, int max_depth, int iterations);

 	/* objects and primitive references */
 	vector<Object*> objects;
-	vector<Reference> references;
+	vector<BVHReference> references;
+	int num_original_references;

 	/* output primitive indexes and objects */
 	vector<int>& prim_index;
@ -138,12 +93,17 @@ protected:
 	/* progress reporting */
 	Progress& progress;
 	double progress_start_time;
-	int progress_num_duplicates;
+	size_t progress_count;
+	size_t progress_total;
+	size_t progress_original_total;

 	/* spatial splitting */
 	float spatial_min_overlap;
 	vector<BoundBox> spatial_right_bounds;
-	SpatialBin spatial_bins[3][BVHParams::NUM_SPATIAL_BINS];
+	BVHSpatialBin spatial_bins[3][BVHParams::NUM_SPATIAL_BINS];
+
+	/* threads */
+	TaskPool task_pool;
 };

 CCL_NAMESPACE_END
--- a/intern/cycles/bvh/bvh_node.cpp
+++ b/intern/cycles/bvh/bvh_node.cpp
@ -24,6 +24,8 @@

 CCL_NAMESPACE_BEGIN

+/* BVH Node */
+
 int BVHNode::getSubtreeSize(BVH_STAT stat) const
 {
 	int cnt = 0;
@ -59,7 +61,8 @@ int BVHNode::getSubtreeSize(BVH_STAT stat) const
 void BVHNode::deleteSubtree()
 {
 	for(int i=0;i<num_children();i++)
-		get_child(i)->deleteSubtree();
+		if(get_child(i))
+			get_child(i)->deleteSubtree();

 	delete this;
 }
@ -70,12 +73,27 @@ float BVHNode::computeSubtreeSAHCost(const BVHParams& p, float probability) cons

 	for(int i=0;i<num_children();i++) {
 		BVHNode *child = get_child(i);
-		SAH += child->computeSubtreeSAHCost(p, probability * child->m_bounds.area()/m_bounds.area());
+		SAH += child->computeSubtreeSAHCost(p, probability * child->m_bounds.safe_area()/m_bounds.safe_area());
 	}

 	return SAH;
 }

+uint BVHNode::update_visibility()
+{
+	if(!is_leaf() && m_visibility == 0) {
+		InnerNode *inner = (InnerNode*)this;
+		BVHNode *child0 = inner->children[0];
+		BVHNode *child1 = inner->children[1];
+
+		m_visibility = child0->update_visibility()|child1->update_visibility();
+	}
+
+	return m_visibility;
+}
+
+/* Inner Node */
+
 void InnerNode::print(int depth) const
 {
 	for(int i = 0; i < depth; i++)
--- a/intern/cycles/bvh/bvh_node.h
+++ b/intern/cycles/bvh/bvh_node.h
@ -49,8 +49,6 @@ public:
 	virtual int num_triangles() const { return 0; }
 	virtual void print(int depth = 0) const = 0;

-	float getArea() const { return m_bounds.area(); }
-
 	BoundBox m_bounds;
 	uint m_visibility;

@ -58,6 +56,8 @@ public:
 	int getSubtreeSize(BVH_STAT stat=BVH_STAT_NODE_COUNT) const;
 	float computeSubtreeSAHCost(const BVHParams& p, float probability = 1.0f) const;	
 	void deleteSubtree();
+
+	uint update_visibility();
 };

 class InnerNode : public BVHNode
@ -66,9 +66,21 @@ public:
 	InnerNode(const BoundBox& bounds, BVHNode* child0, BVHNode* child1)
 	{
 		m_bounds = bounds;
-		m_visibility = child0->m_visibility|child1->m_visibility;
 		children[0] = child0;
 		children[1] = child1;
+
+		if(child0 && child1)
+			m_visibility = child0->m_visibility|child1->m_visibility;
+		else
+			m_visibility = 0; /* happens on build cancel */
+	}
+
+	InnerNode(const BoundBox& bounds)
+	{
+		m_bounds = bounds;
+		m_visibility = 0;
+		children[0] = NULL;
+		children[1] = NULL;
 	}

 	bool is_leaf() const { return false; }
--- a/intern/cycles/bvh/bvh_params.h
+++ b/intern/cycles/bvh/bvh_params.h
@ -18,6 +18,8 @@
 #ifndef __BVH_PARAMS_H__
 #define __BVH_PARAMS_H__

+#include "util_boundbox.h"
+
 CCL_NAMESPACE_BEGIN

 /* BVH Parameters */
@ -73,14 +75,97 @@ public:
 	}

 	/* SAH costs */
-	float cost(int num_nodes, int num_tris) const
+	__forceinline float cost(int num_nodes, int num_tris) const
 	{ return node_cost(num_nodes) + triangle_cost(num_tris); }

-	float triangle_cost(int n) const
+	__forceinline float triangle_cost(int n) const
 	{ return n*sah_triangle_cost; }

-	float node_cost(int n) const
+	__forceinline float node_cost(int n) const
 	{ return n*sah_node_cost; }
+
+	__forceinline bool small_enough_for_leaf(int size, int level)
+	{ return (size <= min_leaf_size || level >= MAX_DEPTH); }
+};
+
+/* BVH Reference
+ *
+ * Reference to a primitive. Primitive index and object are sneakily packed
+ * into BoundBox to reduce memory usage and align nicely */
+
+class BVHReference
+{
+public:
+	__forceinline BVHReference() {}
+
+	__forceinline BVHReference(const BoundBox& bounds_, int prim_index, int prim_object)
+	: rbounds(bounds_)
+	{
+		rbounds.min.w = __int_as_float(prim_index);
+		rbounds.max.w = __int_as_float(prim_object);
+	}
+
+	__forceinline const BoundBox& bounds() const { return rbounds; }
+	__forceinline int prim_index() const { return __float_as_int(rbounds.min.w); }
+	__forceinline int prim_object() const { return __float_as_int(rbounds.max.w); }
+
+protected:
+	BoundBox rbounds;
+};
+
+/* BVH Range
+ *
+ * Build range used during construction, to indicate the bounds and place in
+ * the reference array of a subset of pirmitives Again uses trickery to pack
+ * integers into BoundBox for alignment purposes. */
+
+class BVHRange
+{
+public:
+	__forceinline BVHRange()
+	{
+		rbounds.min.w = __int_as_float(0);
+		rbounds.max.w = __int_as_float(0);
+	}
+
+	__forceinline BVHRange(const BoundBox& bounds_, int start_, int size_)
+	: rbounds(bounds_)
+	{
+		rbounds.min.w = __int_as_float(start_);
+		rbounds.max.w = __int_as_float(size_);
+	}
+
+	__forceinline BVHRange(const BoundBox& bounds_, const BoundBox& cbounds_, int start_, int size_)
+	: rbounds(bounds_), cbounds(cbounds_)
+	{
+		rbounds.min.w = __int_as_float(start_);
+		rbounds.max.w = __int_as_float(size_);
+	}
+
+	__forceinline void set_start(int start_) { rbounds.min.w = __int_as_float(start_); }
+
+	__forceinline const BoundBox& bounds() const { return rbounds; }
+	__forceinline const BoundBox& cent_bounds() const { return cbounds; }
+	__forceinline int start() const { return __float_as_int(rbounds.min.w); }
+	__forceinline int size() const { return __float_as_int(rbounds.max.w); }
+	__forceinline int end() const { return start() + size(); }
+
+protected:
+	BoundBox rbounds;
+	BoundBox cbounds;
+};
+
+/* BVH Spatial Bin */
+
+struct BVHSpatialBin
+{
+	BoundBox bounds;
+	int enter;
+	int exit;
+
+	__forceinline BVHSpatialBin()
+	{
+	}
 };

 CCL_NAMESPACE_END
--- a/intern/cycles/bvh/bvh_sort.cpp
+++ b/intern/cycles/bvh/bvh_sort.cpp
@ -32,23 +32,23 @@ public:
 		dim = dim_;
 	}

-	bool operator()(const BVHBuild::Reference& ra, const BVHBuild::Reference& rb)
+	bool operator()(const BVHReference& ra, const BVHReference& rb)
 	{
-		float ca = ra.bounds.min[dim] + ra.bounds.max[dim];
-		float cb = rb.bounds.min[dim] + rb.bounds.max[dim];
+		float ca = ra.bounds().min[dim] + ra.bounds().max[dim];
+		float cb = rb.bounds().min[dim] + rb.bounds().max[dim];

 		if(ca < cb) return true;
 		else if(ca > cb) return false;
-		else if(ra.prim_object < rb.prim_object) return true;
-		else if(ra.prim_object > rb.prim_object) return false;
-		else if(ra.prim_index < rb.prim_index) return true;
-		else if(ra.prim_index > rb.prim_index) return false;
+		else if(ra.prim_object() < rb.prim_object()) return true;
+		else if(ra.prim_object() > rb.prim_object()) return false;
+		else if(ra.prim_index() < rb.prim_index()) return true;
+		else if(ra.prim_index() > rb.prim_index()) return false;

 		return false;
 	}
 };

-void bvh_reference_sort(int start, int end, BVHBuild::Reference *data, int dim)
+void bvh_reference_sort(int start, int end, BVHReference *data, int dim)
 {
 	sort(data+start, data+end, BVHReferenceCompare(dim));
 }
--- a/intern/cycles/bvh/bvh_sort.h
+++ b/intern/cycles/bvh/bvh_sort.h
@ -20,7 +20,7 @@

 CCL_NAMESPACE_BEGIN

-void bvh_reference_sort(int start, int end, BVHBuild::Reference *data, int dim);
+void bvh_reference_sort(int start, int end, BVHReference *data, int dim);

 CCL_NAMESPACE_END

--- a/intern/cycles/bvh/bvh_split.cpp
+++ b/intern/cycles/bvh/bvh_split.cpp
@ -0,0 +1,293 @@
+/*
+ * Adapted from code copyright 2009-2010 NVIDIA Corporation
+ * Modifications Copyright 2011, Blender Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "bvh_build.h"
+#include "bvh_split.h"
+#include "bvh_sort.h"
+
+#include "mesh.h"
+#include "object.h"
+
+#include "util_algorithm.h"
+
+CCL_NAMESPACE_BEGIN
+
+/* Object Split */
+
+BVHObjectSplit::BVHObjectSplit(BVHBuild *builder, const BVHRange& range, float nodeSAH)
+: sah(FLT_MAX), dim(0), num_left(0), left_bounds(BoundBox::empty), right_bounds(BoundBox::empty)
+{
+	const BVHReference *ref_ptr = &builder->references[range.start()];
+	float min_sah = FLT_MAX;
+
+	for(int dim = 0; dim < 3; dim++) {
+		/* sort references */
+		bvh_reference_sort(range.start(), range.end(), &builder->references[0], dim);
+
+		/* sweep right to left and determine bounds. */
+		BoundBox right_bounds = BoundBox::empty;
+
+		for(int i = range.size() - 1; i > 0; i--) {
+			right_bounds.grow(ref_ptr[i].bounds());
+			builder->spatial_right_bounds[i - 1] = right_bounds;
+		}
+
+		/* sweep left to right and select lowest SAH. */
+		BoundBox left_bounds = BoundBox::empty;
+
+		for(int i = 1; i < range.size(); i++) {
+			left_bounds.grow(ref_ptr[i - 1].bounds());
+			right_bounds = builder->spatial_right_bounds[i - 1];
+
+			float sah = nodeSAH +
+				left_bounds.safe_area() * builder->params.triangle_cost(i) +
+				right_bounds.safe_area() * builder->params.triangle_cost(range.size() - i);
+
+			if(sah < min_sah) {
+				min_sah = sah;
+
+				this->sah = sah;
+				this->dim = dim;
+				this->num_left = i;
+				this->left_bounds = left_bounds;
+				this->right_bounds = right_bounds;
+			}
+		}
+	}
+}
+
+void BVHObjectSplit::split(BVHBuild *builder, BVHRange& left, BVHRange& right, const BVHRange& range)
+{
+	/* sort references according to split */
+	bvh_reference_sort(range.start(), range.end(), &builder->references[0], this->dim);
+
+	/* split node ranges */
+	left = BVHRange(this->left_bounds, range.start(), this->num_left);
+	right = BVHRange(this->right_bounds, left.end(), range.size() - this->num_left);
+
+}
+
+/* Spatial Split */
+
+BVHSpatialSplit::BVHSpatialSplit(BVHBuild *builder, const BVHRange& range, float nodeSAH)
+: sah(FLT_MAX), dim(0), pos(0.0f)
+{
+	/* initialize bins. */
+	float3 origin = range.bounds().min;
+	float3 binSize = (range.bounds().max - origin) * (1.0f / (float)BVHParams::NUM_SPATIAL_BINS);
+	float3 invBinSize = 1.0f / binSize;
+
+	for(int dim = 0; dim < 3; dim++) {
+		for(int i = 0; i < BVHParams::NUM_SPATIAL_BINS; i++) {
+			BVHSpatialBin& bin = builder->spatial_bins[dim][i];
+
+			bin.bounds = BoundBox::empty;
+			bin.enter = 0;
+			bin.exit = 0;
+		}
+	}
+
+	/* chop references into bins. */
+	for(unsigned int refIdx = range.start(); refIdx < range.end(); refIdx++) {
+		const BVHReference& ref = builder->references[refIdx];
+		float3 firstBinf = (ref.bounds().min - origin) * invBinSize;
+		float3 lastBinf = (ref.bounds().max - origin) * invBinSize;
+		int3 firstBin = make_int3((int)firstBinf.x, (int)firstBinf.y, (int)firstBinf.z);
+		int3 lastBin = make_int3((int)lastBinf.x, (int)lastBinf.y, (int)lastBinf.z);
+
+		firstBin = clamp(firstBin, 0, BVHParams::NUM_SPATIAL_BINS - 1);
+		lastBin = clamp(lastBin, firstBin, BVHParams::NUM_SPATIAL_BINS - 1);
+
+		for(int dim = 0; dim < 3; dim++) {
+			BVHReference currRef = ref;
+
+			for(int i = firstBin[dim]; i < lastBin[dim]; i++) {
+				BVHReference leftRef, rightRef;
+
+				split_reference(builder, leftRef, rightRef, currRef, dim, origin[dim] + binSize[dim] * (float)(i + 1));
+				builder->spatial_bins[dim][i].bounds.grow(leftRef.bounds());
+				currRef = rightRef;
+			}
+
+			builder->spatial_bins[dim][lastBin[dim]].bounds.grow(currRef.bounds());
+			builder->spatial_bins[dim][firstBin[dim]].enter++;
+			builder->spatial_bins[dim][lastBin[dim]].exit++;
+		}
+	}
+
+	/* select best split plane. */
+	for(int dim = 0; dim < 3; dim++) {
+		/* sweep right to left and determine bounds. */
+		BoundBox right_bounds = BoundBox::empty;
+
+		for(int i = BVHParams::NUM_SPATIAL_BINS - 1; i > 0; i--) {
+			right_bounds.grow(builder->spatial_bins[dim][i].bounds);
+			builder->spatial_right_bounds[i - 1] = right_bounds;
+		}
+
+		/* sweep left to right and select lowest SAH. */
+		BoundBox left_bounds = BoundBox::empty;
+		int leftNum = 0;
+		int rightNum = range.size();
+
+		for(int i = 1; i < BVHParams::NUM_SPATIAL_BINS; i++) {
+			left_bounds.grow(builder->spatial_bins[dim][i - 1].bounds);
+			leftNum += builder->spatial_bins[dim][i - 1].enter;
+			rightNum -= builder->spatial_bins[dim][i - 1].exit;
+
+			float sah = nodeSAH +
+				left_bounds.safe_area() * builder->params.triangle_cost(leftNum) +
+				builder->spatial_right_bounds[i - 1].safe_area() * builder->params.triangle_cost(rightNum);
+
+			if(sah < this->sah) {
+				this->sah = sah;
+				this->dim = dim;
+				this->pos = origin[dim] + binSize[dim] * (float)i;
+			}
+		}
+	}
+}
+
+void BVHSpatialSplit::split(BVHBuild *builder, BVHRange& left, BVHRange& right, const BVHRange& range)
+{
+	/* Categorize references and compute bounds.
+	 *
+	 * Left-hand side:			[left_start, left_end[
+	 * Uncategorized/split:		[left_end, right_start[
+	 * Right-hand side:			[right_start, refs.size()[ */
+
+	vector<BVHReference>& refs = builder->references;
+	int left_start = range.start();
+	int left_end = left_start;
+	int right_start = range.end();
+	int right_end = range.end();
+	BoundBox left_bounds = BoundBox::empty;
+	BoundBox right_bounds = BoundBox::empty;
+
+	for(int i = left_end; i < right_start; i++) {
+		if(refs[i].bounds().max[this->dim] <= this->pos) {
+			/* entirely on the left-hand side */
+			left_bounds.grow(refs[i].bounds());
+			swap(refs[i], refs[left_end++]);
+		}
+		else if(refs[i].bounds().min[this->dim] >= this->pos) {
+			/* entirely on the right-hand side */
+			right_bounds.grow(refs[i].bounds());
+			swap(refs[i--], refs[--right_start]);
+		}
+	}
+
+	/* duplicate or unsplit references intersecting both sides. */
+	while(left_end < right_start) {
+		/* split reference. */
+		BVHReference lref, rref;
+
+		split_reference(builder, lref, rref, refs[left_end], this->dim, this->pos);
+
+		/* compute SAH for duplicate/unsplit candidates. */
+		BoundBox lub = left_bounds;		// Unsplit to left:		new left-hand bounds.
+		BoundBox rub = right_bounds;	// Unsplit to right:	new right-hand bounds.
+		BoundBox ldb = left_bounds;		// Duplicate:			new left-hand bounds.
+		BoundBox rdb = right_bounds;	// Duplicate:			new right-hand bounds.
+
+		lub.grow(refs[left_end].bounds());
+		rub.grow(refs[left_end].bounds());
+		ldb.grow(lref.bounds());
+		rdb.grow(rref.bounds());
+
+		float lac = builder->params.triangle_cost(left_end - left_start);
+		float rac = builder->params.triangle_cost(right_end - right_start);
+		float lbc = builder->params.triangle_cost(left_end - left_start + 1);
+		float rbc = builder->params.triangle_cost(right_end - right_start + 1);
+
+		float unsplitLeftSAH = lub.safe_area() * lbc + right_bounds.safe_area() * rac;
+		float unsplitRightSAH = left_bounds.safe_area() * lac + rub.safe_area() * rbc;
+		float duplicateSAH = ldb.safe_area() * lbc + rdb.safe_area() * rbc;
+		float minSAH = min(min(unsplitLeftSAH, unsplitRightSAH), duplicateSAH);
+
+		if(minSAH == unsplitLeftSAH) {
+			/* unsplit to left */
+			left_bounds = lub;
+			left_end++;
+		}
+		else if(minSAH == unsplitRightSAH) {
+			/* unsplit to right */
+			right_bounds = rub;
+			swap(refs[left_end], refs[--right_start]);
+		}
+		else {
+			/* duplicate */
+			left_bounds = ldb;
+			right_bounds = rdb;
+			refs[left_end++] = lref;
+			refs.insert(refs.begin() + right_end, rref);
+			right_end++;
+		}
+	}
+
+	left = BVHRange(left_bounds, left_start, left_end - left_start);
+	right = BVHRange(right_bounds, right_start, right_end - right_start);
+}
+
+void BVHSpatialSplit::split_reference(BVHBuild *builder, BVHReference& left, BVHReference& right, const BVHReference& ref, int dim, float pos)
+{
+	/* initialize boundboxes */
+	BoundBox left_bounds = BoundBox::empty;
+	BoundBox right_bounds = BoundBox::empty;
+
+	/* loop over vertices/edges. */
+	Object *ob = builder->objects[ref.prim_object()];
+	const Mesh *mesh = ob->mesh;
+	const int *inds = mesh->triangles[ref.prim_index()].v;
+	const float3 *verts = &mesh->verts[0];
+	const float3* v1 = &verts[inds[2]];
+
+	for(int i = 0; i < 3; i++) {
+		const float3* v0 = v1;
+		int vindex = inds[i];
+		v1 = &verts[vindex];
+		float v0p = (*v0)[dim];
+		float v1p = (*v1)[dim];
+
+		/* insert vertex to the boxes it belongs to. */
+		if(v0p <= pos)
+			left_bounds.grow(*v0);
+
+		if(v0p >= pos)
+			right_bounds.grow(*v0);
+
+		/* edge intersects the plane => insert intersection to both boxes. */
+		if((v0p < pos && v1p > pos) || (v0p > pos && v1p < pos)) {
+			float3 t = lerp(*v0, *v1, clamp((pos - v0p) / (v1p - v0p), 0.0f, 1.0f));
+			left_bounds.grow(t);
+			right_bounds.grow(t);
+		}
+	}
+
+	/* intersect with original bounds. */
+	left_bounds.max[dim] = pos;
+	right_bounds.min[dim] = pos;
+	left_bounds.intersect(ref.bounds());
+	right_bounds.intersect(ref.bounds());
+
+	/* set referecnes */
+	left = BVHReference(left_bounds, ref.prim_index(), ref.prim_object());
+	right = BVHReference(right_bounds, ref.prim_index(), ref.prim_object());
+}
+
+CCL_NAMESPACE_END
+
--- a/intern/cycles/bvh/bvh_split.h
+++ b/intern/cycles/bvh/bvh_split.h
@ -0,0 +1,110 @@
+/*
+ * Adapted from code copyright 2009-2010 NVIDIA Corporation
+ * Modifications Copyright 2011, Blender Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef __BVH_SPLIT_H__
+#define __BVH_SPLIT_H__
+
+#include "bvh_build.h"
+#include "bvh_params.h"
+
+CCL_NAMESPACE_BEGIN
+
+class BVHBuild;
+
+/* Object Split */
+
+class BVHObjectSplit
+{
+public:
+	float sah;
+	int dim;
+	int num_left;
+	BoundBox left_bounds;
+	BoundBox right_bounds;
+
+	BVHObjectSplit() {}
+	BVHObjectSplit(BVHBuild *builder, const BVHRange& range, float nodeSAH);
+
+	void split(BVHBuild *builder, BVHRange& left, BVHRange& right, const BVHRange& range);
+};
+
+/* Spatial Split */
+
+class BVHSpatialSplit
+{
+public:
+	float sah;
+	int dim;
+	float pos;
+
+	BVHSpatialSplit() : sah(FLT_MAX), dim(0), pos(0.0f) {}
+	BVHSpatialSplit(BVHBuild *builder, const BVHRange& range, float nodeSAH);
+
+	void split(BVHBuild *builder, BVHRange& left, BVHRange& right, const BVHRange& range);
+	void split_reference(BVHBuild *builder, BVHReference& left, BVHReference& right, const BVHReference& ref, int dim, float pos);
+};
+
+/* Mixed Object-Spatial Split */
+
+class BVHMixedSplit
+{
+public:
+	BVHObjectSplit object;
+	BVHSpatialSplit spatial;
+
+	float leafSAH;
+	float nodeSAH;
+	float minSAH;
+
+	bool no_split;
+
+	__forceinline BVHMixedSplit(BVHBuild *builder, const BVHRange& range, int level)
+	{
+		/* find split candidates. */
+		float area = range.bounds().safe_area();
+
+		leafSAH = area * builder->params.triangle_cost(range.size());
+		nodeSAH = area * builder->params.node_cost(2);
+
+		object = BVHObjectSplit(builder, range, nodeSAH);
+
+		if(builder->params.use_spatial_split && level < BVHParams::MAX_SPATIAL_DEPTH) {
+			BoundBox overlap = object.left_bounds;
+			overlap.intersect(object.right_bounds);
+
+			if(overlap.safe_area() >= builder->spatial_min_overlap)
+				spatial = BVHSpatialSplit(builder, range, nodeSAH);
+		}
+
+		/* leaf SAH is the lowest => create leaf. */
+		minSAH = min(min(leafSAH, object.sah), spatial.sah);
+		no_split = (minSAH == leafSAH && range.size() <= builder->params.max_leaf_size);
+	}
+
+	__forceinline void split(BVHBuild *builder, BVHRange& left, BVHRange& right, const BVHRange& range)
+	{
+		if(builder->params.use_spatial_split && minSAH == spatial.sah)
+			spatial.split(builder, left, right, range);
+		if(!left.size() || !right.size())
+			object.split(builder, left, right, range);
+	}
+};
+
+CCL_NAMESPACE_END
+
+#endif /* __BVH_SPLIT_H__ */
+
--- a/intern/cycles/device/device.cpp
+++ b/intern/cycles/device/device.cpp
@ -58,15 +58,6 @@ void DeviceTask::split_max_size(list<DeviceTask>& tasks, int max_size)
 	split(tasks, num);
 }

-void DeviceTask::split(ThreadQueue<DeviceTask>& queue, int num)
-{
-	list<DeviceTask> tasks;
-	split(tasks, num);
-
-	foreach(DeviceTask& task, tasks)
-		queue.push(task);
-}
-
 void DeviceTask::split(list<DeviceTask>& tasks, int num)
 {
 	if(type == SHADER) {
--- a/intern/cycles/device/device.h
+++ b/intern/cycles/device/device.h
@ -25,6 +25,7 @@

 #include "util_list.h"
 #include "util_string.h"
+#include "util_task.h"
 #include "util_thread.h"
 #include "util_types.h"
 #include "util_vector.h"
@ -66,7 +67,7 @@ public:

 /* Device Task */

-class DeviceTask {
+class DeviceTask : public Task {
 public:
 	typedef enum { PATH_TRACE, TONEMAP, SHADER } Type;
 	Type type;
@ -87,7 +88,6 @@ public:
 	DeviceTask(Type type = PATH_TRACE);

 	void split(list<DeviceTask>& tasks, int num);
-	void split(ThreadQueue<DeviceTask>& tasks, int num);
 	void split_max_size(list<DeviceTask>& tasks, int max_size);
 };

--- a/intern/cycles/device/device_cpu.cpp
+++ b/intern/cycles/device/device_cpu.cpp
@ -40,35 +40,21 @@ CCL_NAMESPACE_BEGIN
 class CPUDevice : public Device
 {
 public:
-	vector<thread*> threads;
-	ThreadQueue<DeviceTask> tasks;
+	TaskPool task_pool;
 	KernelGlobals *kg;
 	
 	CPUDevice(int threads_num)
+	: task_pool(function_bind(&CPUDevice::thread_run, this, _1, _2))
 	{
 		kg = kernel_globals_create();

 		/* do now to avoid thread issues */
 		system_cpu_support_optimized();
-
-		if(threads_num == 0)
-			threads_num = system_cpu_thread_count();
-
-		threads.resize(threads_num);
-
-		for(size_t i = 0; i < threads.size(); i++)
-			threads[i] = new thread(function_bind(&CPUDevice::thread_run, this, i));
 	}

 	~CPUDevice()
 	{
-		tasks.stop();
-
-		foreach(thread *t, threads) {
-			t->join();
-			delete t;
-		}
-
+		task_pool.stop();
 		kernel_globals_free(kg);
 	}

@ -127,25 +113,21 @@ public:
 #endif
 	}

-	void thread_run(int t)
+	void thread_run(Task *task_, int thread_id)
 	{
-		DeviceTask task;
+		DeviceTask *task = (DeviceTask*)task_;

-		while(tasks.worker_wait_pop(task)) {
-			if(task.type == DeviceTask::PATH_TRACE)
-				thread_path_trace(task);
-			else if(task.type == DeviceTask::TONEMAP)
-				thread_tonemap(task);
-			else if(task.type == DeviceTask::SHADER)
-				thread_shader(task);
-
-			tasks.worker_done();
-		}
+		if(task->type == DeviceTask::PATH_TRACE)
+			thread_path_trace(*task);
+		else if(task->type == DeviceTask::TONEMAP)
+			thread_tonemap(*task);
+		else if(task->type == DeviceTask::SHADER)
+			thread_shader(*task);
 	}

 	void thread_path_trace(DeviceTask& task)
 	{
-		if(tasks.worker_cancel())
+		if(task_pool.cancelled())
 			return;

 #ifdef WITH_OSL
@ -160,7 +142,7 @@ public:
 					kernel_cpu_optimized_path_trace(kg, (float*)task.buffer, (unsigned int*)task.rng_state,
 						task.sample, x, y, task.offset, task.stride);

-				if(tasks.worker_cancel())
+				if(task_pool.cancelled())
 					break;
 			}
 		}
@ -172,7 +154,7 @@ public:
 					kernel_cpu_path_trace(kg, (float*)task.buffer, (unsigned int*)task.rng_state,
 						task.sample, x, y, task.offset, task.stride);

-				if(tasks.worker_cancel())
+				if(task_pool.cancelled())
 					break;
 			}
 		}
@ -214,7 +196,7 @@ public:
 			for(int x = task.shader_x; x < task.shader_x + task.shader_w; x++) {
 				kernel_cpu_optimized_shader(kg, (uint4*)task.shader_input, (float4*)task.shader_output, task.shader_eval_type, x);

-				if(tasks.worker_cancel())
+				if(task_pool.cancelled())
 					break;
 			}
 		}
@ -224,7 +206,7 @@ public:
 			for(int x = task.shader_x; x < task.shader_x + task.shader_w; x++) {
 				kernel_cpu_shader(kg, (uint4*)task.shader_input, (float4*)task.shader_output, task.shader_eval_type, x);

-				if(tasks.worker_cancel())
+				if(task_pool.cancelled())
 					break;
 			}
 		}
@ -239,17 +221,22 @@ public:
 	{
 		/* split task into smaller ones, more than number of threads for uneven
 		   workloads where some parts of the image render slower than others */
-		task.split(tasks, threads.size()*10);
+		list<DeviceTask> tasks;
+
+		task.split(tasks, TaskScheduler::num_threads()*10);
+
+		foreach(DeviceTask& task, tasks)
+			task_pool.push(new DeviceTask(task));
 	}

 	void task_wait()
 	{
-		tasks.wait_done();
+		task_pool.wait();
 	}

 	void task_cancel()
 	{
-		tasks.cancel();
+		task_pool.cancel();
 	}
 };

--- a/intern/cycles/device/device_multi.cpp
+++ b/intern/cycles/device/device_multi.cpp
@ -257,13 +257,14 @@ public:

 	void task_add(DeviceTask& task)
 	{
-		ThreadQueue<DeviceTask> tasks;
+		list<DeviceTask> tasks;
 		task.split(tasks, devices.size());

 		foreach(SubDevice& sub, devices) {
-			DeviceTask subtask;
+			if(!tasks.empty()) {
+				DeviceTask subtask = tasks.front();
+				tasks.pop_front();

-			if(tasks.worker_wait_pop(subtask)) {
 				if(task.buffer) subtask.buffer = sub.ptr_map[task.buffer];
 				if(task.rng_state) subtask.rng_state = sub.ptr_map[task.rng_state];
 				if(task.rgba) subtask.rgba = sub.ptr_map[task.rgba];
--- a/intern/cycles/kernel/kernel_accumulate.h
+++ b/intern/cycles/kernel/kernel_accumulate.h
@ -266,7 +266,7 @@ __device_inline void path_radiance_accum_background(PathRadiance *L, float3 thro
 #endif
 }

-__device_inline float3 path_radiance_sum(PathRadiance *L)
+__device_inline float3 path_radiance_sum(KernelGlobals *kg, PathRadiance *L)
 {
 #ifdef __PASSES__
 	if(L->use_light_pass) {
@ -283,9 +283,14 @@ __device_inline float3 path_radiance_sum(PathRadiance *L)
 		L->indirect_glossy *= L->indirect;
 		L->indirect_transmission *= L->indirect;

-		return L->emission + L->background
+		float3 L_sum = L->emission
 			+ L->direct_diffuse + L->direct_glossy + L->direct_transmission
 			+ L->indirect_diffuse + L->indirect_glossy + L->indirect_transmission;
+
+		if(!kernel_data.background.transparent)
+			L_sum += L->background;
+
+		return L_sum;
 	}
 	else
 		return L->emission;
--- a/intern/cycles/kernel/kernel_path.h
+++ b/intern/cycles/kernel/kernel_path.h
@ -223,6 +223,7 @@ __device float4 kernel_path_integrate(KernelGlobals *kg, RNG *rng, int sample, R

 	path_radiance_init(&L, kernel_data.film.use_light_pass);

+	float min_ray_pdf = FLT_MAX;
 	float ray_pdf = 0.0f;
 	PathState state;
 	int rng_offset = PRNG_BASE_NUM;
@ -239,13 +240,17 @@ __device float4 kernel_path_integrate(KernelGlobals *kg, RNG *rng, int sample, R
 			/* eval background shader if nothing hit */
 			if(kernel_data.background.transparent && (state.flag & PATH_RAY_CAMERA)) {
 				L_transparent += average(throughput);
+
+#ifdef __PASSES__
+				if(!(kernel_data.film.pass_flag & PASS_BACKGROUND))
+#endif
+					break;
 			}
+
 #ifdef __BACKGROUND__
-			else {
-				/* sample background shader */
-				float3 L_background = indirect_background(kg, &ray, state.flag, ray_pdf);
-				path_radiance_accum_background(&L, throughput, L_background, state.bounce);
-			}
+			/* sample background shader */
+			float3 L_background = indirect_background(kg, &ray, state.flag, ray_pdf);
+			path_radiance_accum_background(&L, throughput, L_background, state.bounce);
 #endif

 			break;
@ -259,6 +264,18 @@ __device float4 kernel_path_integrate(KernelGlobals *kg, RNG *rng, int sample, R

 		kernel_write_data_passes(kg, buffer, &L, &sd, sample, state.flag, throughput);

+		/* blurring of bsdf after bounces, for rays that have a small likelihood
+		   of following this particular path (diffuse, rough glossy) */
+		if(kernel_data.integrator.filter_glossy != FLT_MAX) {
+			float blur_pdf = kernel_data.integrator.filter_glossy*min_ray_pdf;
+
+			if(blur_pdf < 1.0f) {
+				float blur_roughness = sqrtf(1.0f - blur_pdf)*0.5f;
+				shader_bsdf_blur(kg, &sd, blur_roughness);
+			}
+		}
+
+		/* holdout */
 #ifdef __HOLDOUT__
 		if((sd.flag & SD_HOLDOUT) && (state.flag & PATH_RAY_CAMERA)) {
 			float3 holdout_weight = shader_holdout_eval(kg, &sd);
@ -378,8 +395,10 @@ __device float4 kernel_path_integrate(KernelGlobals *kg, RNG *rng, int sample, R
 		path_radiance_bsdf_bounce(&L, &throughput, &bsdf_eval, bsdf_pdf, state.bounce, label);

 		/* set labels */
-		if(!(label & LABEL_TRANSPARENT))
+		if(!(label & LABEL_TRANSPARENT)) {
 			ray_pdf = bsdf_pdf;
+			min_ray_pdf = fminf(bsdf_pdf, min_ray_pdf);
+		}

 		/* update path state */
 		path_state_next(kg, &state, label);
@ -394,7 +413,7 @@ __device float4 kernel_path_integrate(KernelGlobals *kg, RNG *rng, int sample, R
 #endif
 	}

-	float3 L_sum = path_radiance_sum(&L);
+	float3 L_sum = path_radiance_sum(kg, &L);

 #ifdef __CLAMP_SAMPLE__
 	path_radiance_clamp(&L, &L_sum, kernel_data.integrator.sample_clamp);
--- a/intern/cycles/kernel/kernel_types.h
+++ b/intern/cycles/kernel/kernel_types.h
@ -516,6 +516,7 @@ typedef struct KernelIntegrator {

 	/* caustics */
 	int no_caustics;
+	float filter_glossy;

 	/* seed */
 	int seed;
@ -525,9 +526,6 @@ typedef struct KernelIntegrator {

 	/* clamp */
 	float sample_clamp;
-
-	/* padding */
-	int pad;
 } KernelIntegrator;

 typedef struct KernelBVH {
--- a/intern/cycles/kernel/svm/svm_tex_coord.h
+++ b/intern/cycles/kernel/svm/svm_tex_coord.h
@ -40,6 +40,15 @@ __device void svm_node_tex_coord(KernelGlobals *kg, ShaderData *sd, float *stack
 				data = sd->P;
 			break;
 		}
+		case NODE_TEXCO_NORMAL: {
+			if(sd->object != ~0) {
+				Transform tfm = object_fetch_transform(kg, sd->object, OBJECT_INVERSE_TRANSFORM);
+				data = transform_direction(&tfm, sd->N);
+			}
+			else
+				data = sd->N;
+			break;
+		}
 		case NODE_TEXCO_CAMERA: {
 			Transform tfm = kernel_data.cam.worldtocamera;

@ -85,6 +94,15 @@ __device void svm_node_tex_coord_bump_dx(KernelGlobals *kg, ShaderData *sd, floa
 				data = sd->P + sd->dP.dx;
 			break;
 		}
+		case NODE_TEXCO_NORMAL: {
+			if(sd->object != ~0) {
+				Transform tfm = object_fetch_transform(kg, sd->object, OBJECT_INVERSE_TRANSFORM);
+				data = transform_direction(&tfm, sd->N);
+			}
+			else
+				data = sd->N;
+			break;
+		}
 		case NODE_TEXCO_CAMERA: {
 			Transform tfm = kernel_data.cam.worldtocamera;

@ -133,6 +151,15 @@ __device void svm_node_tex_coord_bump_dy(KernelGlobals *kg, ShaderData *sd, floa
 				data = sd->P + sd->dP.dy;
 			break;
 		}
+		case NODE_TEXCO_NORMAL: {
+			if(sd->object != ~0) {
+				Transform tfm = object_fetch_transform(kg, sd->object, OBJECT_INVERSE_TRANSFORM);
+				data = normalize(transform_direction(&tfm, sd->N));
+			}
+			else
+				data = sd->N;
+			break;
+		}
 		case NODE_TEXCO_CAMERA: {
 			Transform tfm = kernel_data.cam.worldtocamera;

--- a/intern/cycles/kernel/svm/svm_types.h
+++ b/intern/cycles/kernel/svm/svm_types.h
@ -119,6 +119,7 @@ typedef enum NodeLightPath {
 } NodeLightPath;

 typedef enum NodeTexCoord {
+	NODE_TEXCO_NORMAL,
 	NODE_TEXCO_OBJECT,
 	NODE_TEXCO_CAMERA,
 	NODE_TEXCO_WINDOW,
--- a/intern/cycles/render/integrator.cpp
+++ b/intern/cycles/render/integrator.cpp
@ -41,6 +41,7 @@ Integrator::Integrator()
 	transparent_shadows = false;

 	no_caustics = false;
+	filter_glossy = 0.0f;
 	seed = 0;
 	layer_flag = ~0;
 	sample_clamp = 0.0f;
@ -81,6 +82,8 @@ void Integrator::device_update(Device *device, DeviceScene *dscene)
 	kintegrator->transparent_shadows = transparent_shadows;

 	kintegrator->no_caustics = no_caustics;
+	kintegrator->filter_glossy = (filter_glossy == 0.0f)? FLT_MAX: 1.0f/filter_glossy;
+
 	kintegrator->seed = hash_int(seed);
 	kintegrator->layer_flag = layer_flag << PATH_RAY_LAYER_SHIFT;

@ -119,6 +122,7 @@ bool Integrator::modified(const Integrator& integrator)
 		transparent_probalistic == integrator.transparent_probalistic &&
 		transparent_shadows == integrator.transparent_shadows &&
 		no_caustics == integrator.no_caustics &&
+		filter_glossy == integrator.filter_glossy &&
 		layer_flag == integrator.layer_flag &&
 		seed == integrator.seed &&
 		sample_clamp == integrator.sample_clamp);
--- a/intern/cycles/render/integrator.h
+++ b/intern/cycles/render/integrator.h
@ -41,6 +41,7 @@ public:
 	bool transparent_shadows;

 	bool no_caustics;
+	float filter_glossy;

 	int seed;
 	int layer_flag;
--- a/intern/cycles/render/mesh.cpp
+++ b/intern/cycles/render/mesh.cpp
@ -43,6 +43,7 @@ Mesh::Mesh()
 	transform_applied = false;
 	transform_negative_scaled = false;
 	displacement_method = DISPLACE_BUMP;
+	bounds = BoundBox::empty;

 	bvh = NULL;

@ -96,7 +97,7 @@ void Mesh::add_triangle(int v0, int v1, int v2, int shader_, bool smooth_)

 void Mesh::compute_bounds()
 {
-	BoundBox bnds;
+	BoundBox bnds = BoundBox::empty;
 	size_t verts_size = verts.size();

 	for(size_t i = 0; i < verts_size; i++)
@ -697,6 +698,8 @@ void MeshManager::device_update(Device *device, DeviceScene *dscene, Scene *scen
 				progress.set_status(msg, "Building BVH");

 				mesh->compute_bvh(&scene->params, progress);
+
+				i++;
 			}

 			if(progress.get_cancel()) return;
@ -704,8 +707,6 @@ void MeshManager::device_update(Device *device, DeviceScene *dscene, Scene *scen
 			mesh->need_update = false;
 			mesh->need_update_rebuild = false;
 		}
-
-		i++;
 	}
 	
 	foreach(Shader *shader, scene->shaders)
--- a/intern/cycles/render/nodes.cpp
+++ b/intern/cycles/render/nodes.cpp
@ -1503,6 +1503,7 @@ TextureCoordinateNode::TextureCoordinateNode()
 {
 	add_input("Normal", SHADER_SOCKET_NORMAL, ShaderInput::NORMAL, true);
 	add_output("Generated", SHADER_SOCKET_POINT);
+	add_output("Normal", SHADER_SOCKET_NORMAL);
 	add_output("UV", SHADER_SOCKET_POINT);
 	add_output("Object", SHADER_SOCKET_POINT);
 	add_output("Camera", SHADER_SOCKET_POINT);
@ -1551,6 +1552,12 @@ void TextureCoordinateNode::compile(SVMCompiler& compiler)
 		}
 	}

+	out = output("Normal");
+	if(!out->links.empty()) {
+		compiler.stack_assign(out);
+		compiler.add_node(texco_node, NODE_TEXCO_NORMAL, out->stack_offset);
+	}
+
 	out = output("UV");
 	if(!out->links.empty()) {
 		int attr = compiler.attribute(Attribute::STD_UV);
--- a/intern/cycles/render/object.cpp
+++ b/intern/cycles/render/object.cpp
@ -37,6 +37,7 @@ Object::Object()
 	tfm = transform_identity();
 	visibility = ~0;
 	pass_id = 0;
+	bounds = BoundBox::empty;
 }

 Object::~Object()
--- a/intern/cycles/render/session.cpp
+++ b/intern/cycles/render/session.cpp
@ -27,6 +27,7 @@

 #include "util_foreach.h"
 #include "util_function.h"
+#include "util_task.h"
 #include "util_time.h"

 CCL_NAMESPACE_BEGIN
@ -37,6 +38,8 @@ Session::Session(const SessionParams& params_)
 {
 	device_use_gl = ((params.device.type != DEVICE_CPU) && !params.background);

+	TaskScheduler::init(params.threads);
+
 	device = Device::create(params.device, params.background, params.threads);
 	buffers = new RenderBuffers(device);
 	display = new DisplayBuffer(device);
@ -88,6 +91,8 @@ Session::~Session()
 	delete display;
 	delete scene;
 	delete device;
+
+	TaskScheduler::exit();
 }

 void Session::start()
--- a/intern/cycles/subd/subd_patch.cpp
+++ b/intern/cycles/subd/subd_patch.cpp
@ -93,7 +93,7 @@ void LinearQuadPatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u, float

 BoundBox LinearQuadPatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 4; i++)
 		bbox.grow(hull[i]);
@ -115,7 +115,7 @@ void LinearTrianglePatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u, f

 BoundBox LinearTrianglePatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 3; i++)
 		bbox.grow(hull[i]);
@ -132,7 +132,7 @@ void BicubicPatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u, float v)

 BoundBox BicubicPatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 16; i++)
 		bbox.grow(hull[i]);
@ -152,7 +152,7 @@ void BicubicTangentPatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u, f

 BoundBox BicubicTangentPatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 16; i++)
 		bbox.grow(hull[i]);
@ -205,7 +205,7 @@ void GregoryQuadPatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u, floa

 BoundBox GregoryQuadPatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 20; i++)
 		bbox.grow(hull[i]);
@ -276,7 +276,7 @@ void GregoryTrianglePatch::eval(float3 *P, float3 *dPdu, float3 *dPdv, float u,

 BoundBox GregoryTrianglePatch::bound()
 {
-	BoundBox bbox;
+	BoundBox bbox = BoundBox::empty;

 	for(int i = 0; i < 20; i++)
 		bbox.grow(hull[i]);
--- a/intern/cycles/util/CMakeLists.txt
+++ b/intern/cycles/util/CMakeLists.txt
@ -15,6 +15,7 @@ set(SRC
 	util_path.cpp
 	util_string.cpp
 	util_system.cpp
+	util_task.cpp
 	util_time.cpp
 	util_transform.cpp
 )
@ -50,6 +51,7 @@ set(SRC_HEADERS
 	util_set.h
 	util_string.h
 	util_system.h
+	util_task.h
 	util_thread.h
 	util_time.h
 	util_transform.h
--- a/intern/cycles/util/util_boundbox.h
+++ b/intern/cycles/util/util_boundbox.h
@ -23,6 +23,7 @@
 #include <float.h>

 #include "util_math.h"
+#include "util_string.h"
 #include "util_transform.h"
 #include "util_types.h"

@ -35,45 +36,81 @@ class BoundBox
 public:
 	float3 min, max;

-	BoundBox(void)
+	__forceinline BoundBox()
 	{
-		min = make_float3(FLT_MAX, FLT_MAX, FLT_MAX);
-		max = make_float3(-FLT_MAX, -FLT_MAX, -FLT_MAX);
 	}

-	BoundBox(const float3& min_, const float3& max_)
+	__forceinline BoundBox(const float3& pt)
+	: min(pt), max(pt)
+	{
+	}
+
+	__forceinline BoundBox(const float3& min_, const float3& max_)
 	: min(min_), max(max_)
 	{
 	}

-	void grow(const float3& pt)  
+	static struct empty_t {} empty;
+
+	__forceinline BoundBox(empty_t)
+	: min(make_float3(FLT_MAX, FLT_MAX, FLT_MAX)), max(make_float3(-FLT_MAX, -FLT_MAX, -FLT_MAX))
+	{
+	}
+
+	__forceinline void grow(const float3& pt)  
 	{
 		min = ccl::min(min, pt);
 		max = ccl::max(max, pt);
 	}

-	void grow(const BoundBox& bbox)
+	__forceinline void grow(const BoundBox& bbox)
 	{
 		grow(bbox.min);
 		grow(bbox.max);
 	}

-	void intersect(const BoundBox& bbox) 
+	__forceinline void intersect(const BoundBox& bbox) 
 	{
 		min = ccl::max(min, bbox.min);
 		max = ccl::min(max, bbox.max);
 	}

-	float area(void) const
+	/* todo: avoid using this */
+	__forceinline float safe_area() const
 	{
-		if(!valid())
+		if(!((min.x <= max.x) && (min.y <= max.y) && (min.z <= max.z)))
 			return 0.0f;

-		float3 d = max - min;
-		return dot(d, d)*2.0f;
+		return area();
 	}

-	bool valid(void) const
+	__forceinline float area() const
+	{
+		return half_area()*2.0f;
+	}
+
+	__forceinline float half_area() const
+	{
+		float3 d = max - min;
+		return (d.x*d.z + d.y*d.z + d.x*d.y);
+	}
+
+	__forceinline float3 center() const
+	{
+		return 0.5f*(min + max);
+	}
+
+	__forceinline float3 center2() const
+	{
+		return min + max;
+	}
+
+	__forceinline float3 size() const
+	{
+		return max - min;
+	}
+	
+	__forceinline bool valid() const
 	{
 		return (min.x <= max.x) && (min.y <= max.y) && (min.z <= max.z) &&
 		       (isfinite(min.x) && isfinite(min.y) && isfinite(min.z)) &&
@ -82,7 +119,7 @@ public:

 	BoundBox transformed(const Transform *tfm)
 	{
-		BoundBox result;
+		BoundBox result = BoundBox::empty;

 		for(int i = 0; i < 8; i++) {
 			float3 p;
@ -98,6 +135,31 @@ public:
 	}
 };

+__forceinline BoundBox merge(const BoundBox& bbox, const float3& pt)
+{
+	return BoundBox(min(bbox.min, pt), max(bbox.max, pt));
+}
+
+__forceinline BoundBox merge(const BoundBox& a, const BoundBox& b)
+{
+	return BoundBox(min(a.min, b.min), max(a.max, b.max));
+}
+
+__forceinline BoundBox merge(const BoundBox& a, const BoundBox& b, const BoundBox& c, const BoundBox& d)
+{
+	return merge(merge(a, b), merge(c, d));
+}
+
+__forceinline BoundBox intersect(const BoundBox& a, const BoundBox& b)
+{
+	return BoundBox(max(a.min, b.min), min(a.max, b.max));
+}
+
+__forceinline BoundBox intersect(const BoundBox& a, const BoundBox& b, const BoundBox& c)
+{
+	return intersect(a, intersect(b, c));
+}
+
 CCL_NAMESPACE_END

 #endif /* __UTIL_BOUNDBOX_H__ */
--- a/intern/cycles/util/util_math.h
+++ b/intern/cycles/util/util_math.h
@ -182,93 +182,74 @@ __device_inline float average(const float2 a)

 __device_inline float2 operator-(const float2 a)
 {
-	float2 r = {-a.x, -a.y};
-	return r;
+	return make_float2(-a.x, -a.y);
 }

 __device_inline float2 operator*(const float2 a, const float2 b)
 {
-	float2 r = {a.x*b.x, a.y*b.y};
-	return r;
+	return make_float2(a.x*b.x, a.y*b.y);
 }

 __device_inline float2 operator*(const float2 a, float f)
 {
-	float2 r = {a.x*f, a.y*f};
-	return r;
+	return make_float2(a.x*f, a.y*f);
 }

 __device_inline float2 operator*(float f, const float2 a)
 {
-	float2 r = {a.x*f, a.y*f};
-	return r;
+	return make_float2(a.x*f, a.y*f);
 }

 __device_inline float2 operator/(float f, const float2 a)
 {
-	float2 r = {f/a.x, f/a.y};
-	return r;
+	return make_float2(f/a.x, f/a.y);
 }

 __device_inline float2 operator/(const float2 a, float f)
 {
 	float invf = 1.0f/f;
-	float2 r = {a.x*invf, a.y*invf};
-	return r;
+	return make_float2(a.x*invf, a.y*invf);
 }

 __device_inline float2 operator/(const float2 a, const float2 b)
 {
-	float2 r = {a.x/b.x, a.y/b.y};
-	return r;
+	return make_float2(a.x/b.x, a.y/b.y);
 }

 __device_inline float2 operator+(const float2 a, const float2 b)
 {
-	float2 r = {a.x+b.x, a.y+b.y};
-	return r;
+	return make_float2(a.x+b.x, a.y+b.y);
 }

 __device_inline float2 operator-(const float2 a, const float2 b)
 {
-	float2 r = {a.x-b.x, a.y-b.y};
-	return r;
+	return make_float2(a.x-b.x, a.y-b.y);
 }

 __device_inline float2 operator+=(float2& a, const float2 b)
 {
-	a.x += b.x;
-	a.y += b.y;
-	return a;
+	return a = a + b;
 }

 __device_inline float2 operator*=(float2& a, const float2 b)
 {
-	a.x *= b.x;
-	a.y *= b.y;
-	return a;
+	return a = a * b;
 }

 __device_inline float2 operator*=(float2& a, float f)
 {
-	a.x *= f;
-	a.y *= f;
-	return a;
+	return a = a * f;
 }

 __device_inline float2 operator/=(float2& a, const float2 b)
 {
-	a.x /= b.x;
-	a.y /= b.y;
-	return a;
+	return a = a / b;
 }

 __device_inline float2 operator/=(float2& a, float f)
 {
 	float invf = 1.0f/f;
-	a.x *= invf;
-	a.y *= invf;
-	return a;
+	return a = a * invf;
 }


@ -314,14 +295,12 @@ __device_inline bool operator!=(const float2 a, const float2 b)

 __device_inline float2 min(float2 a, float2 b)
 {
-	float2 r = {min(a.x, b.x), min(a.y, b.y)};
-	return r;
+	return make_float2(min(a.x, b.x), min(a.y, b.y));
 }

 __device_inline float2 max(float2 a, float2 b)
 {
-	float2 r = {max(a.x, b.x), max(a.y, b.y)};
-	return r;
+	return make_float2(max(a.x, b.x), max(a.y, b.y));
 }

 __device_inline float2 clamp(float2 a, float2 mn, float2 mx)
@ -361,112 +340,78 @@ __device_inline float2 interp(float2 a, float2 b, float t)

 /* Float3 Vector */

-__device_inline bool is_zero(const float3 a)
-{
-	return (a.x == 0.0f && a.y == 0.0f && a.z == 0.0f);
-}
-
-__device_inline float average(const float3 a)
-{
-	return (a.x + a.y + a.z)*(1.0f/3.0f);
-}
-
 #ifndef __KERNEL_OPENCL__

 __device_inline float3 operator-(const float3 a)
 {
-	float3 r = make_float3(-a.x, -a.y, -a.z);
-	return r;
+	return make_float3(-a.x, -a.y, -a.z);
 }

 __device_inline float3 operator*(const float3 a, const float3 b)
 {
-	float3 r = make_float3(a.x*b.x, a.y*b.y, a.z*b.z);
-	return r;
+	return make_float3(a.x*b.x, a.y*b.y, a.z*b.z);
 }

 __device_inline float3 operator*(const float3 a, float f)
 {
-	float3 r = make_float3(a.x*f, a.y*f, a.z*f);
-	return r;
+	return make_float3(a.x*f, a.y*f, a.z*f);
 }

 __device_inline float3 operator*(float f, const float3 a)
 {
-	float3 r = make_float3(a.x*f, a.y*f, a.z*f);
-	return r;
+	return make_float3(a.x*f, a.y*f, a.z*f);
 }

 __device_inline float3 operator/(float f, const float3 a)
 {
-	float3 r = make_float3(f/a.x, f/a.y, f/a.z);
-	return r;
+	return make_float3(f/a.x, f/a.y, f/a.z);
 }

 __device_inline float3 operator/(const float3 a, float f)
 {
 	float invf = 1.0f/f;
-	float3 r = make_float3(a.x*invf, a.y*invf, a.z*invf);
-	return r;
+	return make_float3(a.x*invf, a.y*invf, a.z*invf);
 }

 __device_inline float3 operator/(const float3 a, const float3 b)
 {
-	float3 r = make_float3(a.x/b.x, a.y/b.y, a.z/b.z);
-	return r;
+	return make_float3(a.x/b.x, a.y/b.y, a.z/b.z);
 }

 __device_inline float3 operator+(const float3 a, const float3 b)
 {
-	float3 r = make_float3(a.x+b.x, a.y+b.y, a.z+b.z);
-	return r;
+	return make_float3(a.x+b.x, a.y+b.y, a.z+b.z);
 }

 __device_inline float3 operator-(const float3 a, const float3 b)
 {
-	float3 r = make_float3(a.x-b.x, a.y-b.y, a.z-b.z);
-	return r;
+	return make_float3(a.x-b.x, a.y-b.y, a.z-b.z);
 }

 __device_inline float3 operator+=(float3& a, const float3 b)
 {
-	a.x += b.x;
-	a.y += b.y;
-	a.z += b.z;
-	return a;
+	return a = a + b;
 }

 __device_inline float3 operator*=(float3& a, const float3 b)
 {
-	a.x *= b.x;
-	a.y *= b.y;
-	a.z *= b.z;
-	return a;
+	return a = a * b;
 }

 __device_inline float3 operator*=(float3& a, float f)
 {
-	a.x *= f;
-	a.y *= f;
-	a.z *= f;
-	return a;
+	return a = a * f;
 }

 __device_inline float3 operator/=(float3& a, const float3 b)
 {
-	a.x /= b.x;
-	a.y /= b.y;
-	a.z /= b.z;
-	return a;
+	return a = a / b;
 }

 __device_inline float3 operator/=(float3& a, float f)
 {
 	float invf = 1.0f/f;
-	a.x *= invf;
-	a.y *= invf;
-	a.z *= invf;
-	return a;
+	return a = a * invf;
 }

 __device_inline float dot(const float3 a, const float3 b)
@ -506,7 +451,11 @@ __device_inline float3 normalize_len(const float3 a, float *t)

 __device_inline bool operator==(const float3 a, const float3 b)
 {
+#ifdef __KERNEL_SSE__
+	return (_mm_movemask_ps(_mm_cmpeq_ps(a.m128, b.m128)) & 7) == 7;
+#else
 	return (a.x == b.x && a.y == b.y && a.z == b.z);
+#endif
 }

 __device_inline bool operator!=(const float3 a, const float3 b)
@ -516,14 +465,20 @@ __device_inline bool operator!=(const float3 a, const float3 b)

 __device_inline float3 min(float3 a, float3 b)
 {
-	float3 r = make_float3(min(a.x, b.x), min(a.y, b.y), min(a.z, b.z));
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_min_ps(a.m128, b.m128);
+#else
+	return make_float3(min(a.x, b.x), min(a.y, b.y), min(a.z, b.z));
+#endif
 }

 __device_inline float3 max(float3 a, float3 b)
 {
-	float3 r = make_float3(max(a.x, b.x), max(a.y, b.y), max(a.z, b.z));
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_max_ps(a.m128, b.m128);
+#else
+	return make_float3(max(a.x, b.x), max(a.y, b.y), max(a.z, b.z));
+#endif
 }

 __device_inline float3 clamp(float3 a, float3 mn, float3 mx)
@ -533,7 +488,12 @@ __device_inline float3 clamp(float3 a, float3 mn, float3 mx)

 __device_inline float3 fabs(float3 a)
 {
+#ifdef __KERNEL_SSE__
+	__m128 mask = _mm_castsi128_ps(_mm_set1_epi32(0x7fffffff));
+	return _mm_and_ps(a.m128, mask);
+#else
 	return make_float3(fabsf(a.x), fabsf(a.y), fabsf(a.z));
+#endif
 }

 #endif
@ -555,6 +515,25 @@ __device_inline void print_float3(const char *label, const float3& a)
 	printf("%s: %.8f %.8f %.8f\n", label, a.x, a.y, a.z);
 }

+__device_inline float reduce_add(const float3& a)
+{
+#ifdef __KERNEL_SSE__
+	return (a.x + a.y + a.z);
+#else
+	return (a.x + a.y + a.z);
+#endif
+}
+
+__device_inline float3 rcp(const float3& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 r = _mm_rcp_ps(a.m128);
+	return _mm_sub_ps(_mm_add_ps(r, r), _mm_mul_ps(_mm_mul_ps(r, r), a));
+#else
+	return make_float3(1.0f/a.x, 1.0f/a.y, 1.0f/a.z);
+#endif
+}
+
 #endif

 __device_inline float3 interp(float3 a, float3 b, float t)
@ -562,122 +541,258 @@ __device_inline float3 interp(float3 a, float3 b, float t)
 	return a + t*(b - a);
 }

+__device_inline bool is_zero(const float3 a)
+{
+#ifdef __KERNEL_SSE__
+	return a == make_float3(0.0f);
+#else
+	return (a.x == 0.0f && a.y == 0.0f && a.z == 0.0f);
+#endif
+}
+
+__device_inline float average(const float3 a)
+{
+	return reduce_add(a)*(1.0f/3.0f);
+}
+
 /* Float4 Vector */

+#ifdef __KERNEL_SSE__
+
+template<size_t index_0, size_t index_1, size_t index_2, size_t index_3> __forceinline const float4 shuffle(const float4& b)
+{
+	return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(b), _MM_SHUFFLE(index_3, index_2, index_1, index_0)));
+}
+
+template<> __forceinline const float4 shuffle<0, 0, 2, 2>(const float4& b)
+{
+	return _mm_moveldup_ps(b);
+}
+
+template<> __forceinline const float4 shuffle<1, 1, 3, 3>(const float4& b)
+{
+	return _mm_movehdup_ps(b);
+}
+
+template<> __forceinline const float4 shuffle<0, 1, 0, 1>(const float4& b)
+{
+	return _mm_castpd_ps(_mm_movedup_pd(_mm_castps_pd(b)));
+}
+
+#endif
+
 #ifndef __KERNEL_OPENCL__

-__device_inline bool is_zero(const float4& a)
-{
-	return (a.x == 0.0f && a.y == 0.0f && a.z == 0.0f && a.w == 0.0f);
-}
-
-__device_inline float average(const float4& a)
-{
-	return (a.x + a.y + a.z + a.w)*(1.0f/4.0f);
-}
-
 __device_inline float4 operator-(const float4& a)
 {
-	float4 r = {-a.x, -a.y, -a.z, -a.w};
-	return r;
+#ifdef __KERNEL_SSE__
+	__m128 mask = _mm_castsi128_ps(_mm_set1_epi32(0x80000000));
+	return _mm_xor_ps(a.m128, mask);
+#else
+	return make_float4(-a.x, -a.y, -a.z, -a.w);
+#endif
 }

 __device_inline float4 operator*(const float4& a, const float4& b)
 {
-	float4 r = {a.x*b.x, a.y*b.y, a.z*b.z, a.w*b.w};
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_mul_ps(a.m128, b.m128);
+#else
+	return make_float4(a.x*b.x, a.y*b.y, a.z*b.z, a.w*b.w);
+#endif
 }

 __device_inline float4 operator*(const float4& a, float f)
 {
-	float4 r = {a.x*f, a.y*f, a.z*f, a.w*f};
-	return r;
+#ifdef __KERNEL_SSE__
+	return a * make_float4(f);
+#else
+	return make_float4(a.x*f, a.y*f, a.z*f, a.w*f);
+#endif
 }

 __device_inline float4 operator*(float f, const float4& a)
 {
-	float4 r = {a.x*f, a.y*f, a.z*f, a.w*f};
-	return r;
+	return a * f;
+}
+
+__device_inline float4 rcp(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 r = _mm_rcp_ps(a.m128);
+	return _mm_sub_ps(_mm_add_ps(r, r), _mm_mul_ps(_mm_mul_ps(r, r), a));
+#else
+	return make_float4(1.0f/a.x, 1.0f/a.y, 1.0f/a.z, 1.0f/a.w);
+#endif
 }

 __device_inline float4 operator/(const float4& a, float f)
 {
-	float invf = 1.0f/f;
-	float4 r = {a.x*invf, a.y*invf, a.z*invf, a.w*invf};
-	return r;
+	return a * (1.0f/f);
 }

 __device_inline float4 operator/(const float4& a, const float4& b)
 {
-	float4 r = {a.x/b.x, a.y/b.y, a.z/b.z, a.w/b.w};
-	return r;
+#ifdef __KERNEL_SSE__
+	return a * rcp(b);
+#else
+	return make_float4(a.x/b.x, a.y/b.y, a.z/b.z, a.w/b.w);
+#endif
+
 }

 __device_inline float4 operator+(const float4& a, const float4& b)
 {
-	float4 r = {a.x+b.x, a.y+b.y, a.z+b.z, a.w+b.w};
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_add_ps(a.m128, b.m128);
+#else
+	return make_float4(a.x+b.x, a.y+b.y, a.z+b.z, a.w+b.w);
+#endif
 }

 __device_inline float4 operator-(const float4& a, const float4& b)
 {
-	float4 r = {a.x-b.x, a.y-b.y, a.z-b.z, a.w-b.w};
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_sub_ps(a.m128, b.m128);
+#else
+	return make_float4(a.x-b.x, a.y-b.y, a.z-b.z, a.w-b.w);
+#endif
 }

 __device_inline float4 operator+=(float4& a, const float4& b)
 {
-	a.x += b.x;
-	a.y += b.y;
-	a.z += b.z;
-	a.w += b.w;
-	return a;
+	return a = a + b;
 }

 __device_inline float4 operator*=(float4& a, const float4& b)
 {
-	a.x *= b.x;
-	a.y *= b.y;
-	a.z *= b.z;
-	a.w *= b.w;
-	return a;
+	return a = a * b;
 }

 __device_inline float4 operator/=(float4& a, float f)
 {
-	float invf = 1.0f/f;
-	a.x *= invf;
-	a.y *= invf;
-	a.z *= invf;
-	a.w *= invf;
-	return a;
+	return a = a / f;
 }

-__device_inline float dot(const float4& a, const float4& b)
+__device_inline int4 operator<(const float4& a, const float4& b)
 {
-	return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
+#ifdef __KERNEL_SSE__
+	return _mm_cvtps_epi32(_mm_cmplt_ps(a.m128, b.m128)); /* todo: avoid cvt */
+#else
+	return make_int4(a.x < b.x, a.y < b.y, a.z < b.z, a.w < b.w);
+#endif
+}
+
+__device_inline int4 operator>=(float4 a, float4 b)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_cvtps_epi32(_mm_cmpge_ps(a.m128, b.m128)); /* todo: avoid cvt */
+#else
+	return make_int4(a.x >= b.x, a.y >= b.y, a.z >= b.z, a.w >= b.w);
+#endif
+}
+
+__device_inline int4 operator<=(const float4& a, const float4& b)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_cvtps_epi32(_mm_cmple_ps(a.m128, b.m128)); /* todo: avoid cvt */
+#else
+	return make_int4(a.x <= b.x, a.y <= b.y, a.z <= b.z, a.w <= b.w);
+#endif
+}
+
+__device_inline bool operator==(const float4 a, const float4 b)
+{
+#ifdef __KERNEL_SSE__
+	return (_mm_movemask_ps(_mm_cmpeq_ps(a.m128, b.m128)) & 15) == 15;
+#else
+	return (a.x == b.x && a.y == b.y && a.z == b.z && a.w == b.w);
+#endif
 }

 __device_inline float4 cross(const float4& a, const float4& b)
 {
-	float4 r = {a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x, 0.0f};
-	return r;
+#ifdef __KERNEL_SSE__
+	return (shuffle<1,2,0,0>(a)*shuffle<2,0,1,0>(b)) - (shuffle<2,0,1,0>(a)*shuffle<1,2,0,0>(b));
+#else
+	return make_float4(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x, 0.0f);
+#endif
 }

 __device_inline float4 min(float4 a, float4 b)
 {
+#ifdef __KERNEL_SSE__
+	return _mm_min_ps(a.m128, b.m128);
+#else
 	return make_float4(min(a.x, b.x), min(a.y, b.y), min(a.z, b.z), min(a.w, b.w));
+#endif
 }

 __device_inline float4 max(float4 a, float4 b)
 {
+#ifdef __KERNEL_SSE__
+	return _mm_max_ps(a.m128, b.m128);
+#else
 	return make_float4(max(a.x, b.x), max(a.y, b.y), max(a.z, b.z), max(a.w, b.w));
+#endif
 }

 #endif

 #ifndef __KERNEL_GPU__

+__device_inline float4 select(const int4& mask, const float4& a, const float4& b)
+{
+#ifdef __KERNEL_SSE__
+	/* blendv is sse4, and apparently broken on vs2008 */
+	return _mm_or_ps(_mm_and_ps(_mm_cvtepi32_ps(mask), a), _mm_andnot_ps(_mm_cvtepi32_ps(mask), b)); /* todo: avoid cvt */
+#else
+	return make_float4((mask.x)? a.x: b.x, (mask.y)? a.y: b.y, (mask.z)? a.z: b.z, (mask.w)? a.w: b.w);
+#endif
+}
+
+__device_inline float4 reduce_min(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 h = min(shuffle<1,0,3,2>(a), a);
+	return min(shuffle<2,3,0,1>(h), h);
+#else
+	return make_float4(min(min(a.x, a.y), min(a.z, a.w)));
+#endif
+}
+
+__device_inline float4 reduce_max(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 h = max(shuffle<1,0,3,2>(a), a);
+	return max(shuffle<2,3,0,1>(h), h);
+#else
+	return make_float4(max(max(a.x, a.y), max(a.z, a.w)));
+#endif
+}
+
+#if 0
+__device_inline float4 reduce_add(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 h = shuffle<1,0,3,2>(a) + a;
+	return shuffle<2,3,0,1>(h) + h;
+#else
+	return make_float4((a.x + a.y) + (a.z + a.w));
+#endif
+}
+#endif
+
+__device_inline float reduce_add(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	float4 h = shuffle<1,0,3,2>(a) + a;
+	return _mm_cvtss_f32(shuffle<2,3,0,1>(h) + h); /* todo: efficiency? */
+#else
+	return ((a.x + a.y) + (a.z + a.w));
+#endif
+}
+
 __device_inline void print_float4(const char *label, const float4& a)
 {
 	printf("%s: %.8f %.8f %.8f %.8f\n", label, a.x, a.y, a.z, a.w);
@ -685,26 +800,67 @@ __device_inline void print_float4(const char *label, const float4& a)

 #endif

+#ifndef __KERNEL_OPENCL__
+
+__device_inline bool is_zero(const float4& a)
+{
+#ifdef __KERNEL_SSE__
+	return a == make_float4(0.0f);
+#else
+	return (a.x == 0.0f && a.y == 0.0f && a.z == 0.0f && a.w == 0.0f);
+#endif
+}
+
+__device_inline float average(const float4& a)
+{
+	return reduce_add(a) * 0.25f;
+}
+
+__device_inline float dot(const float4& a, const float4& b)
+{
+	return reduce_add(a * b);
+}
+
+#endif
+
 /* Int3 */

 #ifndef __KERNEL_OPENCL__

+__device_inline int3 min(int3 a, int3 b)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_min_epi32(a.m128, b.m128);
+#else
+	return make_int3(min(a.x, b.x), min(a.y, b.y), min(a.z, b.z));
+#endif
+}
+
 __device_inline int3 max(int3 a, int3 b)
 {
-	int3 r = {max(a.x, b.x), max(a.y, b.y), max(a.z, b.z)};
-	return r;
+#ifdef __KERNEL_SSE__
+	return _mm_max_epi32(a.m128, b.m128);
+#else
+	return make_int3(max(a.x, b.x), max(a.y, b.y), max(a.z, b.z));
+#endif
 }

 __device_inline int3 clamp(const int3& a, int mn, int mx)
 {
-	int3 r = {clamp(a.x, mn, mx), clamp(a.y, mn, mx), clamp(a.z, mn, mx)};
-	return r;
+#ifdef __KERNEL_SSE__
+	return min(max(a, make_int3(mn)), make_int3(mx));
+#else
+	return make_int3(clamp(a.x, mn, mx), clamp(a.y, mn, mx), clamp(a.z, mn, mx));
+#endif
 }

 __device_inline int3 clamp(const int3& a, int3& mn, int mx)
 {
-	int3 r = {clamp(a.x, mn.x, mx), clamp(a.y, mn.y, mx), clamp(a.z, mn.z, mx)};
-	return r;
+#ifdef __KERNEL_SSE__
+	return min(max(a, mn), make_int3(mx));
+#else
+	return make_int3(clamp(a.x, mn.x, mx), clamp(a.y, mn.y, mx), clamp(a.z, mn.z, mx));
+#endif
 }

 #endif
@ -720,16 +876,63 @@ __device_inline void print_int3(const char *label, const int3& a)

 /* Int4 */

-#ifndef __KERNEL_OPENCL__
+#ifndef __KERNEL_GPU__

-__device_inline int4 operator>=(float4 a, float4 b)
+__device_inline int4 operator+(const int4& a, const int4& b)
 {
-	return make_int4(a.x >= b.x, a.y >= b.y, a.z >= b.z, a.w >= b.w);
+#ifdef __KERNEL_SSE__
+	return _mm_add_epi32(a.m128, b.m128);
+#else
+	return make_int4(a.x+b.x, a.y+b.y, a.z+b.z, a.w+b.w);
+#endif
 }

-#endif
+__device_inline int4 operator+=(int4& a, const int4& b)
+{
+	return a = a + b;
+}

-#ifndef __KERNEL_GPU__
+__device_inline int4 operator>>(const int4& a, int i)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_srai_epi32(a.m128, i);
+#else
+	return make_int4(a.x >> i, a.y >> i, a.z >> i, a.w >> i);
+#endif
+}
+
+__device_inline int4 min(int4 a, int4 b)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_min_epi32(a.m128, b.m128);
+#else
+	return make_int4(min(a.x, b.x), min(a.y, b.y), min(a.z, b.z), min(a.w, b.w));
+#endif
+}
+
+__device_inline int4 max(int4 a, int4 b)
+{
+#ifdef __KERNEL_SSE__
+	return _mm_max_epi32(a.m128, b.m128);
+#else
+	return make_int4(max(a.x, b.x), max(a.y, b.y), max(a.z, b.z), max(a.w, b.w));
+#endif
+}
+
+__device_inline int4 clamp(const int4& a, const int4& mn, const int4& mx)
+{
+	return min(max(a, mn), mx);
+}
+
+__device_inline int4 select(const int4& mask, const int4& a, const int4& b)
+{
+#ifdef __KERNEL_SSE__
+	__m128 m = _mm_cvtepi32_ps(mask);
+	return _mm_castps_si128(_mm_or_ps(_mm_and_ps(m, _mm_castsi128_ps(a)), _mm_andnot_ps(m, _mm_castsi128_ps(b)))); /* todo: avoid cvt */
+#else
+	return make_int4((mask.x)? a.x: b.x, (mask.y)? a.y: b.y, (mask.z)? a.z: b.z, (mask.w)? a.w: b.w);
+#endif
+}

 __device_inline void print_int4(const char *label, const int4& a)
 {
--- a/intern/cycles/util/util_task.cpp
+++ b/intern/cycles/util/util_task.cpp
@ -0,0 +1,223 @@
+/*
+ * Copyright 2011, Blender Foundation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "util_debug.h"
+#include "util_foreach.h"
+#include "util_system.h"
+#include "util_task.h"
+
+CCL_NAMESPACE_BEGIN
+
+/* Task Pool */
+
+TaskPool::TaskPool(const TaskRunFunction& run_)
+{
+	num = 0;
+	num_done = 0;
+
+	do_cancel = false;
+
+	run = run_;
+}
+
+TaskPool::~TaskPool()
+{
+	stop();
+}
+
+void TaskPool::push(Task *task, bool front)
+{
+	TaskScheduler::Entry entry;
+
+	entry.task = task;
+	entry.pool = this;
+
+	TaskScheduler::push(entry, front);
+}
+
+void TaskPool::wait()
+{
+	thread_scoped_lock lock(done_mutex);
+
+	while(num_done != num)
+		done_cond.wait(lock);
+}
+
+void TaskPool::cancel()
+{
+	TaskScheduler::clear(this);
+
+	do_cancel = true;
+	wait();
+	do_cancel = false;
+}
+
+void TaskPool::stop()
+{
+	TaskScheduler::clear(this);
+
+	assert(num_done == num);
+}
+
+bool TaskPool::cancelled()
+{
+	return do_cancel;
+}
+
+void TaskPool::done_increase(int done)
+{
+	done_mutex.lock();
+	num_done += done;
+	done_mutex.unlock();
+
+	assert(num_done <= num);
+	done_cond.notify_all();
+}
+
+/* Task Scheduler */
+
+thread_mutex TaskScheduler::mutex;
+int TaskScheduler::users = 0;
+vector<thread*> TaskScheduler::threads;
+volatile bool TaskScheduler::do_exit = false;
+
+list<TaskScheduler::Entry> TaskScheduler::queue;
+thread_mutex TaskScheduler::queue_mutex;
+thread_condition_variable TaskScheduler::queue_cond;
+
+void TaskScheduler::init(int num_threads)
+{
+	thread_scoped_lock lock(mutex);
+
+	/* multiple cycles instances can use this task scheduler, sharing the same
+	   threads, so we keep track of the number of users. */
+	if(users == 0) {
+		do_exit = false;
+
+		/* launch threads that will be waiting for work */
+		if(num_threads == 0)
+			num_threads = system_cpu_thread_count();
+
+		threads.resize(num_threads);
+
+		for(size_t i = 0; i < threads.size(); i++)
+			threads[i] = new thread(function_bind(&TaskScheduler::thread_run, i));
+	}
+	
+	users++;
+}
+
+void TaskScheduler::exit()
+{
+	thread_scoped_lock lock(mutex);
+
+	users--;
+
+	if(users == 0) {
+		/* stop all waiting threads */
+		do_exit = true;
+		TaskScheduler::queue_cond.notify_all();
+
+		/* delete threads */
+		foreach(thread *t, threads) {
+			t->join();
+			delete t;
+		}
+
+		threads.clear();
+	}
+}
+
+bool TaskScheduler::thread_wait_pop(Entry& entry)
+{
+	thread_scoped_lock lock(queue_mutex);
+
+	while(queue.empty() && !do_exit)
+		queue_cond.wait(lock);
+
+	if(queue.empty()) {
+		assert(do_exit);
+		return false;
+	}
+	
+	entry = queue.front();
+	queue.pop_front();
+
+	return true;
+}
+
+void TaskScheduler::thread_run(int thread_id)
+{
+	Entry entry;
+
+	/* todo: test affinity/denormal mask */
+
+	/* keep popping off tasks */
+	while(thread_wait_pop(entry)) {
+		/* run task */
+		entry.pool->run(entry.task, thread_id);
+
+		/* delete task */
+		delete entry.task;
+
+		/* notify pool task was done */
+		entry.pool->done_increase(1);
+	}
+}
+
+void TaskScheduler::push(Entry& entry, bool front)
+{
+	/* add entry to queue */
+	TaskScheduler::queue_mutex.lock();
+	if(front)
+		TaskScheduler::queue.push_front(entry);
+	else
+		TaskScheduler::queue.push_back(entry);
+	entry.pool->num++;
+	TaskScheduler::queue_mutex.unlock();
+
+	TaskScheduler::queue_cond.notify_one();
+}
+
+void TaskScheduler::clear(TaskPool *pool)
+{
+	thread_scoped_lock lock(TaskScheduler::queue_mutex);
+
+	/* erase all tasks from this pool from the queue */
+	list<TaskScheduler::Entry>::iterator it = TaskScheduler::queue.begin();
+	int done = 0;
+
+	while(it != TaskScheduler::queue.end()) {
+		TaskScheduler::Entry& entry = *it;
+
+		if(entry.pool == pool) {
+			done++;
+			delete entry.task;
+
+			it = TaskScheduler::queue.erase(it);
+		}
+		else
+			it++;
+	}
+
+	/* notify done */
+	pool->done_increase(done);
+}
+
+CCL_NAMESPACE_END
+
--- a/intern/cycles/util/util_task.h
+++ b/intern/cycles/util/util_task.h
@ -0,0 +1,122 @@
+/*
+ * Copyright 2011, Blender Foundation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __UTIL_TASK_H__
+#define __UTIL_TASK_H__
+
+#include "util_list.h"
+#include "util_thread.h"
+#include "util_vector.h"
+
+CCL_NAMESPACE_BEGIN
+
+class Task;
+class TaskPool;
+class TaskScheduler;
+
+typedef boost::function<void(Task*,int)> TaskRunFunction;
+
+/* Task
+ *
+ * Base class for tasks to be executed in threads. */
+
+class Task
+{
+public:
+	Task() {};
+	virtual ~Task() {}
+};
+
+/* Task Pool
+ *
+ * Pool of tasks that will be executed by the central TaskScheduler.For each
+ * pool, we can wait for all tasks to be done, or cancel them before they are
+ * done.
+ *
+ * The run callback that actually executes the task may be create like this:
+ * function_bind(&MyClass::task_execute, this, _1, _2) */
+
+class TaskPool
+{
+public:
+	TaskPool(const TaskRunFunction& run);
+	~TaskPool();
+
+	void push(Task *task, bool front = false);
+
+	void wait();		/* wait until all tasks are done */
+	void cancel();		/* cancel all tasks, keep worker threads running */
+	void stop();		/* stop all worker threads */
+
+	bool cancelled();	/* for worker threads, test if cancelled */
+
+protected:
+	friend class TaskScheduler;
+
+	void done_increase(int done);
+
+	TaskRunFunction run;
+
+	thread_mutex done_mutex;
+	thread_condition_variable done_cond;
+
+	volatile int num, num_done;
+	volatile bool do_cancel;
+};
+
+/* Task Scheduler
+ * 
+ * Central scheduler that holds running threads ready to execute tasks. A singe
+ * queue holds the task from all pools. */
+
+class TaskScheduler
+{
+public:
+	static void init(int num_threads = 0);
+	static void exit();
+
+	static int num_threads() { return threads.size(); }
+
+protected:
+	friend class TaskPool;
+
+	struct Entry {
+		Task *task;
+		TaskPool *pool;
+	};
+
+	static thread_mutex mutex;
+	static int users;
+	static vector<thread*> threads;
+	static volatile bool do_exit;
+
+	static list<Entry> queue;
+	static thread_mutex queue_mutex;
+	static thread_condition_variable queue_cond;
+
+	static void thread_run(int thread_id);
+	static bool thread_wait_pop(Entry& entry);
+
+	static void push(Entry& entry, bool front);
+	static void clear(TaskPool *pool);
+};
+
+CCL_NAMESPACE_END
+
+#endif
+
--- a/intern/cycles/util/util_thread.h
+++ b/intern/cycles/util/util_thread.h
@ -69,133 +69,6 @@ protected:
 	bool joined;
 };

-/* Thread Safe Queue to pass tasks from one thread to another. Tasks should be
- * pushed into the queue, while the worker thread waits to pop the next task
- * off the queue. Once all tasks are into the queue, calling stop() will stop
- * the worker threads from waiting for more tasks once all tasks are done. */
-
-template<typename T> class ThreadQueue
-{
-public:
-	ThreadQueue()
-	{
-		tot = 0;
-		tot_done = 0;
-		do_stop = false;
-		do_cancel = false;
-	}
-
-	/* Main thread functions */
-
-	/* push a task to be executed */
-	void push(const T& value)
-	{
-		thread_scoped_lock lock(queue_mutex);
-		queue.push(value);
-		tot++;
-		lock.unlock();
-
-		queue_cond.notify_one();
-	}
-
-	/* wait until all tasks are done */
-	void wait_done()
-	{
-		thread_scoped_lock lock(done_mutex);
-
-		while(tot_done != tot)
-			done_cond.wait(lock);
-	}
-
-	/* stop all worker threads */
-	void stop()
-	{
-		clear();
-		do_stop = true;
-		queue_cond.notify_all();
-	}
-
-	/* cancel all tasks, but keep worker threads running */
-	void cancel()
-	{
-		clear();
-		do_cancel = true;
-		wait_done();
-		do_cancel = false;
-	}
-
-	/* Worker thread functions
-     *
-	 * while(queue.worker_wait_pop(task)) {
-	 *		for(..) {
-	 *			... do work ...
-	 *
-	 *			if(queue.worker_cancel())
-	 *				break;
-	 *      }
-	 *		
-	 *		queue.worker_done();
-	 * }
-	 */
-
-	bool worker_wait_pop(T& value)
-	{
-		thread_scoped_lock lock(queue_mutex);
-
-		while(queue.empty() && !do_stop)
-			queue_cond.wait(lock);
-
-		if(queue.empty())
-			return false;
-		
-		value = queue.front();
-		queue.pop();
-
-		return true;
-	}
-
-	void worker_done()
-	{
-		thread_scoped_lock lock(done_mutex);
-		tot_done++;
-		lock.unlock();
-
-		assert(tot_done <= tot);
-
-		done_cond.notify_all();
-	}
-
-	bool worker_cancel()
-	{
-		return do_cancel;
-	}
-
-protected:
-	void clear()
-	{
-		thread_scoped_lock lock(queue_mutex);
-
-		while(!queue.empty()) {
-			thread_scoped_lock done_lock(done_mutex);
-			tot_done++;
-			done_lock.unlock();
-
-			queue.pop();
-		}
-
-		done_cond.notify_all();
-	}
-
-	std::queue<T> queue;
-	thread_mutex queue_mutex;
-	thread_mutex done_mutex;
-	thread_condition_variable queue_cond;
-	thread_condition_variable done_cond;
-	volatile bool do_stop;
-	volatile bool do_cancel;
-	volatile int tot, tot_done;
-};
-
 /* Thread Local Storage
 *
 * Boost implementation is a bit slow, and Mac OS X __thread is not supported
--- a/intern/cycles/util/util_transform.cpp
+++ b/intern/cycles/util/util_transform.cpp
@ -129,23 +129,26 @@ static bool transform_matrix4_gj_inverse(float R[][4], float M[][4])

 Transform transform_inverse(const Transform& tfm)
 {
-	union { Transform T; float M[4][4]; } R, M;
-	
-	R.T = transform_identity();
-	M.T = tfm;
+	Transform tfmR = transform_identity();
+	float M[4][4], R[4][4];

-	if(!transform_matrix4_gj_inverse(R.M, M.M)) {
+	memcpy(R, &tfmR, sizeof(R));
+	memcpy(M, &tfm, sizeof(M));
+
+	if(!transform_matrix4_gj_inverse(R, M)) {
 		/* matrix is degenerate (e.g. 0 scale on some axis), ideally we should
 		   never be in this situation, but try to invert it anyway with tweak */
-		M.M[0][0] += 1e-8f;
-		M.M[1][1] += 1e-8f;
-		M.M[2][2] += 1e-8f;
+		M[0][0] += 1e-8f;
+		M[1][1] += 1e-8f;
+		M[2][2] += 1e-8f;

-		if(!transform_matrix4_gj_inverse(R.M, M.M))
+		if(!transform_matrix4_gj_inverse(R, M))
 			return transform_identity();
 	}

-	return R.T;
+	memcpy(&tfmR, R, sizeof(R));
+
+	return tfmR;
 }

 CCL_NAMESPACE_END
--- a/intern/cycles/util/util_types.h
+++ b/intern/cycles/util/util_types.h
@ -36,23 +36,37 @@
 #define __shared
 #define __constant

-#ifdef __GNUC__
-#define __device_inline static inline __attribute__((always_inline))
-#else
+#ifdef _WIN32
 #define __device_inline static __forceinline
+#define __align(...) __declspec(align(__VA_ARGS__))
+#else
+#define __device_inline static inline __attribute__((always_inline))
+#define __forceinline inline __attribute__((always_inline))
+#define __align(...) __attribute__((aligned(__VA_ARGS__)))
 #endif

 #endif

+/* Bitness */
+
+#if defined(__ppc64__) || defined(__PPC64__) || defined(__x86_64__) || defined(__ia64__) || defined(_M_X64)
+#define __KERNEL_64_BIT__
+#endif
+
 /* SIMD Types */

-/* not needed yet, will be for qbvh
-#ifndef __KERNEL_GPU__
+/* not enabled, globally applying it just gives slowdown,
+ * but useful for testing. */
+//#define __KERNEL_SSE__
+#ifdef __KERNEL_SSE__

-#include <emmintrin.h>
-#include <xmmintrin.h>
+#include <xmmintrin.h> /* SSE 1 */
+#include <emmintrin.h> /* SSE 2 */
+#include <pmmintrin.h> /* SSE 3 */
+#include <tmmintrin.h> /* SSE 3 */
+#include <smmintrin.h> /* SSE 4 */

-#endif*/
+#endif

 #ifndef _WIN32
 #ifndef __KERNEL_GPU__
@ -97,6 +111,12 @@ typedef unsigned int uint32_t;
 typedef long long int64_t;
 typedef unsigned long long uint64_t;

+#ifdef __KERNEL_64_BIT__
+typedef int64_t ssize_t;
+#else
+typedef int32_t ssize_t;
+#endif
+
 #endif

 /* Generic Memory Pointer */
@ -108,89 +128,137 @@ typedef uint64_t device_ptr;
 struct uchar2 {
 	uchar x, y;

-	uchar operator[](int i) const { return *(&x + i); }
-	uchar& operator[](int i) { return *(&x + i); }
+	__forceinline uchar operator[](int i) const { return *(&x + i); }
+	__forceinline uchar& operator[](int i) { return *(&x + i); }
 };

 struct uchar3 {
 	uchar x, y, z;

-	uchar operator[](int i) const { return *(&x + i); }
-	uchar& operator[](int i) { return *(&x + i); }
+	__forceinline uchar operator[](int i) const { return *(&x + i); }
+	__forceinline uchar& operator[](int i) { return *(&x + i); }
 };

 struct uchar4 {
 	uchar x, y, z, w;

-	uchar operator[](int i) const { return *(&x + i); }
-	uchar& operator[](int i) { return *(&x + i); }
+	__forceinline uchar operator[](int i) const { return *(&x + i); }
+	__forceinline uchar& operator[](int i) { return *(&x + i); }
 };

 struct int2 {
 	int x, y;

-	int operator[](int i) const { return *(&x + i); }
-	int& operator[](int i) { return *(&x + i); }
+	__forceinline int operator[](int i) const { return *(&x + i); }
+	__forceinline int& operator[](int i) { return *(&x + i); }
 };

+#ifdef __KERNEL_SSE__
+struct __align(16) int3 {
+	union {
+		__m128i m128;
+		struct { int x, y, z, w; };
+	};
+
+	__forceinline int3() {}
+	__forceinline int3(const __m128i a) : m128(a) {}
+	__forceinline operator const __m128i&(void) const { return m128; }
+	__forceinline operator __m128i&(void) { return m128; }
+#else
 struct int3 {
-	int x, y, z;
+	int x, y, z, w;
+#endif

-	int operator[](int i) const { return *(&x + i); }
-	int& operator[](int i) { return *(&x + i); }
+	__forceinline int operator[](int i) const { return *(&x + i); }
+	__forceinline int& operator[](int i) { return *(&x + i); }
 };

+#ifdef __KERNEL_SSE__
+struct __align(16) int4 {
+	union {
+		__m128i m128;
+		struct { int x, y, z, w; };
+	};
+
+	__forceinline int4() {}
+	__forceinline int4(const __m128i a) : m128(a) {}
+	__forceinline operator const __m128i&(void) const { return m128; }
+	__forceinline operator __m128i&(void) { return m128; }
+#else
 struct int4 {
 	int x, y, z, w;
+#endif

-	int operator[](int i) const { return *(&x + i); }
-	int& operator[](int i) { return *(&x + i); }
+	__forceinline int operator[](int i) const { return *(&x + i); }
+	__forceinline int& operator[](int i) { return *(&x + i); }
 };

 struct uint2 {
 	uint x, y;

-	uint operator[](int i) const { return *(&x + i); }
-	uint& operator[](int i) { return *(&x + i); }
+	__forceinline uint operator[](uint i) const { return *(&x + i); }
+	__forceinline uint& operator[](uint i) { return *(&x + i); }
 };

 struct uint3 {
 	uint x, y, z;

-	uint operator[](int i) const { return *(&x + i); }
-	uint& operator[](int i) { return *(&x + i); }
+	__forceinline uint operator[](uint i) const { return *(&x + i); }
+	__forceinline uint& operator[](uint i) { return *(&x + i); }
 };

 struct uint4 {
 	uint x, y, z, w;

-	uint operator[](int i) const { return *(&x + i); }
-	uint& operator[](int i) { return *(&x + i); }
+	__forceinline uint operator[](uint i) const { return *(&x + i); }
+	__forceinline uint& operator[](uint i) { return *(&x + i); }
 };

 struct float2 {
 	float x, y;

-	float operator[](int i) const { return *(&x + i); }
-	float& operator[](int i) { return *(&x + i); }
+	__forceinline float operator[](int i) const { return *(&x + i); }
+	__forceinline float& operator[](int i) { return *(&x + i); }
 };

-struct float3 {
-	float x, y, z;
+#ifdef __KERNEL_SSE__
+struct __align(16) float3 {
+	union {
+		__m128 m128;
+		struct { float x, y, z, w; };
+	};

-#ifdef WITH_OPENCL
-	float w;
+	__forceinline float3() {}
+	__forceinline float3(const __m128 a) : m128(a) {}
+	__forceinline operator const __m128&(void) const { return m128; }
+	__forceinline operator __m128&(void) { return m128; }
+#else
+struct float3 {
+	float x, y, z, w;
 #endif

-	float operator[](int i) const { return *(&x + i); }
-	float& operator[](int i) { return *(&x + i); }
+	__forceinline float operator[](int i) const { return *(&x + i); }
+	__forceinline float& operator[](int i) { return *(&x + i); }
 };

+#ifdef __KERNEL_SSE__
+struct __align(16) float4 {
+	union {
+		__m128 m128;
+		struct { float x, y, z, w; };
+	};
+
+	__forceinline float4() {}
+	__forceinline float4(const __m128 a) : m128(a) {}
+	__forceinline operator const __m128&(void) const { return m128; }
+	__forceinline operator __m128&(void) { return m128; }
+#else
 struct float4 {
 	float x, y, z, w;
+#endif

-	float operator[](int i) const { return *(&x + i); }
-	float& operator[](int i) { return *(&x + i); }
+	__forceinline float operator[](int i) const { return *(&x + i); }
+	__forceinline float& operator[](int i) { return *(&x + i); }
 };

 #endif
@ -201,87 +269,179 @@ struct float4 {
 * 
 * OpenCL does not support C++ class, so we use these instead. */

-__device uchar2 make_uchar2(uchar x, uchar y)
+__device_inline uchar2 make_uchar2(uchar x, uchar y)
 {
 	uchar2 a = {x, y};
 	return a;
 }

-__device uchar3 make_uchar3(uchar x, uchar y, uchar z)
+__device_inline uchar3 make_uchar3(uchar x, uchar y, uchar z)
 {
 	uchar3 a = {x, y, z};
 	return a;
 }

-__device uchar4 make_uchar4(uchar x, uchar y, uchar z, uchar w)
+__device_inline uchar4 make_uchar4(uchar x, uchar y, uchar z, uchar w)
 {
 	uchar4 a = {x, y, z, w};
 	return a;
 }

-__device int2 make_int2(int x, int y)
+__device_inline int2 make_int2(int x, int y)
 {
 	int2 a = {x, y};
 	return a;
 }

-__device int3 make_int3(int x, int y, int z)
+__device_inline int3 make_int3(int x, int y, int z)
 {
-	int3 a = {x, y, z};
+#ifdef __KERNEL_SSE__
+	int3 a;
+	a.m128 = _mm_set_epi32(0, z, y, x);
+#else
+	int3 a = {x, y, z, 0};
+#endif
+
 	return a;
 }

-__device int4 make_int4(int x, int y, int z, int w)
+__device_inline int4 make_int4(int x, int y, int z, int w)
 {
+#ifdef __KERNEL_SSE__
+	int4 a;
+	a.m128 = _mm_set_epi32(w, z, y, x);
+#else
 	int4 a = {x, y, z, w};
+#endif
+
 	return a;
 }

-__device uint2 make_uint2(uint x, uint y)
+__device_inline uint2 make_uint2(uint x, uint y)
 {
 	uint2 a = {x, y};
 	return a;
 }

-__device uint3 make_uint3(uint x, uint y, uint z)
+__device_inline uint3 make_uint3(uint x, uint y, uint z)
 {
 	uint3 a = {x, y, z};
 	return a;
 }

-__device uint4 make_uint4(uint x, uint y, uint z, uint w)
+__device_inline uint4 make_uint4(uint x, uint y, uint z, uint w)
 {
 	uint4 a = {x, y, z, w};
 	return a;
 }

-__device float2 make_float2(float x, float y)
+__device_inline float2 make_float2(float x, float y)
 {
 	float2 a = {x, y};
 	return a;
 }

-__device float3 make_float3(float x, float y, float z)
+__device_inline float3 make_float3(float x, float y, float z)
 {
-#ifdef WITH_OPENCL
-	float3 a = {x, y, z, 0.0f};
+#ifdef __KERNEL_SSE__
+	float3 a;
+	a.m128 = _mm_set_ps(0.0f, z, y, x);
 #else
-	float3 a = {x, y, z};
+	float3 a = {x, y, z, 0.0f};
 #endif
+
 	return a;
 }

-__device float4 make_float4(float x, float y, float z, float w)
+__device_inline float4 make_float4(float x, float y, float z, float w)
 {
+#ifdef __KERNEL_SSE__
+	float4 a;
+	a.m128 = _mm_set_ps(w, z, y, x);
+#else
 	float4 a = {x, y, z, w};
+#endif
+
 	return a;
 }

-__device int align_up(int offset, int alignment)
+__device_inline int align_up(int offset, int alignment)
 {
 	return (offset + alignment - 1) & ~(alignment - 1);
 }

+__device_inline int3 make_int3(int i)
+{
+#ifdef __KERNEL_SSE__
+	int3 a;
+	a.m128 = _mm_set1_epi32(i);
+#else
+	int3 a = {i, i, i, i};
+#endif
+
+	return a;
+}
+
+__device_inline int4 make_int4(int i)
+{
+#ifdef __KERNEL_SSE__
+	int4 a;
+	a.m128 = _mm_set1_epi32(i);
+#else
+	int4 a = {i, i, i, i};
+#endif
+
+	return a;
+}
+
+__device_inline float3 make_float3(float f)
+{
+#ifdef __KERNEL_SSE__
+	float3 a;
+	a.m128 = _mm_set1_ps(f);
+#else
+	float3 a = {f, f, f, f};
+#endif
+
+	return a;
+}
+
+__device_inline float4 make_float4(float f)
+{
+#ifdef __KERNEL_SSE__
+	float4 a;
+	a.m128 = _mm_set1_ps(f);
+#else
+	float4 a = {f, f, f, f};
+#endif
+
+	return a;
+}
+
+__device_inline float4 make_float4(const int4& i)
+{
+#ifdef __KERNEL_SSE__
+	float4 a;
+	a.m128 = _mm_cvtepi32_ps(i.m128);
+#else
+	float4 a = {(float)i.x, (float)i.y, (float)i.z, (float)i.w};
+#endif
+
+	return a;
+}
+
+__device_inline int4 make_int4(const float3& f)
+{
+#ifdef __KERNEL_SSE__
+	int4 a;
+	a.m128 = _mm_cvtps_epi32(f.m128);
+#else
+	int4 a = {(int)f.x, (int)f.y, (int)f.z, (int)f.w};
+#endif
+
+	return a;
+}
+
 #endif

 CCL_NAMESPACE_END
--- a/source/blender/gpu/intern/gpu_shader_material.glsl
+++ b/source/blender/gpu/intern/gpu_shader_material.glsl
@ -2041,12 +2041,13 @@ void node_geometry(vec3 I, vec3 N, mat4 toworld,
 	backfacing = 0.0;
 }

-void node_tex_coord(vec3 I, vec3 N, mat4 toworld,
+void node_tex_coord(vec3 I, vec3 N, mat4 viewinvmat, mat4 obinvmat,
 	vec3 attr_orco, vec3 attr_uv,
-	out vec3 generated, out vec3 uv, out vec3 object,
+	out vec3 generated, out vec3 normal, out vec3 uv, out vec3 object,
 	out vec3 camera, out vec3 window, out vec3 reflection)
 {
 	generated = attr_orco;
+	normal = normalize((obinvmat*(viewinvmat*vec4(N, 0.0))).xyz);
 	uv = attr_uv;
 	object = I;
 	camera = I;
--- a/source/blender/gpu/intern/gpu_shader_material.glsl.c
+++ b/source/blender/gpu/intern/gpu_shader_material.glsl.c
--- a/source/blender/makesdna/DNA_scene_types.h
+++ b/source/blender/makesdna/DNA_scene_types.h
@ -176,14 +176,16 @@ typedef struct SceneRenderLayer {
 	struct Material *mat_override;
 	struct Group *light_override;
 	
-	unsigned int lay;		/* scene->lay itself has priority over this */
-	unsigned int lay_zmask;	/* has to be after lay, this is for Z-masking */
+	unsigned int lay;		  /* scene->lay itself has priority over this */
+	unsigned int lay_zmask;	  /* has to be after lay, this is for Z-masking */
+	unsigned int lay_exclude; /* not used by internal, exclude */
 	int layflag;
 	
-	int pad;
-	
 	int passflag;			/* pass_xor has to be after passflag */
 	int pass_xor;
+
+	int samples;
+	int pad;
 } SceneRenderLayer;

 /* srl->layflag */
--- a/source/blender/makesrna/intern/rna_scene.c
+++ b/source/blender/makesrna/intern/rna_scene.c
@ -1905,6 +1905,19 @@ void rna_def_render_layer_common(StructRNA *srna, int scene)
 	if (scene) RNA_def_property_update(prop, NC_SCENE|ND_RENDER_OPTIONS, "rna_Scene_glsl_update");
 	else RNA_def_property_clear_flag(prop, PROP_EDITABLE);

+	prop = RNA_def_property(srna, "layers_exclude", PROP_BOOLEAN, PROP_LAYER);
+	RNA_def_property_boolean_sdna(prop, NULL, "lay_exclude", 1);
+	RNA_def_property_array(prop, 20);
+	RNA_def_property_ui_text(prop, "Exclude Layers", "Exclude scene layers from having any influence");
+	if (scene) RNA_def_property_update(prop, NC_SCENE|ND_RENDER_OPTIONS, "rna_Scene_glsl_update");
+	else RNA_def_property_clear_flag(prop, PROP_EDITABLE);
+
+	if(scene) {
+		prop = RNA_def_property(srna, "samples", PROP_INT, PROP_UNSIGNED);
+		RNA_def_property_ui_text(prop, "Samples", "Override number of render samples for this render layer, 0 will use the scene setting");
+		RNA_def_property_update(prop, NC_SCENE|ND_RENDER_OPTIONS, NULL);
+	}
+
 	/* layer options */
 	prop = RNA_def_property(srna, "use", PROP_BOOLEAN, PROP_NONE);
 	RNA_def_property_boolean_negative_sdna(prop, NULL, "layflag", SCE_LAY_DISABLE);
--- a/source/blender/render/extern/include/RE_pipeline.h
+++ b/source/blender/render/extern/include/RE_pipeline.h
@ -82,7 +82,7 @@ typedef struct RenderLayer {
 	
 	/* copy of RenderData */
 	char name[RE_MAXNAME];		
-	unsigned int lay, lay_zmask;
+	unsigned int lay, lay_zmask, lay_exclude;
 	int layflag, passflag, pass_xor;		
 	
 	struct Material *mat_override;
--- a/source/blender/render/intern/source/render_result.c
+++ b/source/blender/render/intern/source/render_result.c
@ -458,6 +458,7 @@ RenderResult *render_result_new(Render *re, rcti *partrct, int crop, int savebuf
 		BLI_strncpy(rl->name, srl->name, sizeof(rl->name));
 		rl->lay= srl->lay;
 		rl->lay_zmask= srl->lay_zmask;
+		rl->lay_exclude= srl->lay_exclude;
 		rl->layflag= srl->layflag;
 		rl->passflag= srl->passflag; // for debugging: srl->passflag|SCE_PASS_RAYHITS;
 		rl->pass_xor= srl->pass_xor;