tornavis/source/blender/draw/intern/draw_cache_extract_mesh.cc

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

909 lines
32 KiB
C++
Raw Normal View History

/* SPDX-FileCopyrightText: 2017 Blender Authors
*
* SPDX-License-Identifier: GPL-2.0-or-later */
/** \file
* \ingroup draw
*
* \brief Extraction of Mesh data into VBO to feed to GPU.
*/
#include "MEM_guardedalloc.h"
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
#include <optional>
#include "atomic_ops.h"
#include "DNA_mesh_types.h"
#include "DNA_scene_types.h"
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
#include "BLI_array.hh"
#include "BLI_math_bits.h"
#include "BLI_task.h"
#include "BLI_vector.hh"
#include "BKE_editmesh.hh"
#include "GPU_capabilities.h"
#include "draw_cache_extract.hh"
#include "draw_cache_inline.h"
#include "draw_subdivision.hh"
#include "mesh_extractors/extract_mesh.hh"
// #define DEBUG_TIME
#ifdef DEBUG_TIME
# include "PIL_time_utildefines.h"
#endif
namespace blender::draw {
/* ---------------------------------------------------------------------- */
/** \name Mesh Elements Extract Struct
* \{ */
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
using TaskId = int;
using TaskLen = int;
struct ExtractorRunData {
/* Extractor where this run data belongs to. */
const MeshExtract *extractor;
/* During iteration the VBO/IBO that is being build. */
void *buffer = nullptr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
uint32_t data_offset = 0;
ExtractorRunData(const MeshExtract *extractor) : extractor(extractor) {}
#ifdef WITH_CXX_GUARDEDALLOC
MEM_CXX_CLASS_ALLOC_FUNCS("DRAW:ExtractorRunData")
#endif
};
class ExtractorRunDatas : public Vector<ExtractorRunData> {
public:
void filter_into(ExtractorRunDatas &result, eMRIterType iter_type, const bool is_mesh) const
{
for (const ExtractorRunData &data : *this) {
const MeshExtract *extractor = data.extractor;
if ((iter_type & MR_ITER_LOOPTRI) && *(&extractor->iter_looptri_bm + is_mesh)) {
result.append(data);
continue;
}
if ((iter_type & MR_ITER_POLY) && *(&extractor->iter_face_bm + is_mesh)) {
result.append(data);
continue;
}
if ((iter_type & MR_ITER_LOOSE_EDGE) && *(&extractor->iter_loose_edge_bm + is_mesh)) {
result.append(data);
continue;
}
if ((iter_type & MR_ITER_LOOSE_VERT) && *(&extractor->iter_loose_vert_bm + is_mesh)) {
result.append(data);
continue;
}
}
}
void filter_threaded_extractors_into(ExtractorRunDatas &result)
{
for (const ExtractorRunData &data : *this) {
const MeshExtract *extractor = data.extractor;
if (extractor->use_threading) {
result.append(extractor);
}
}
}
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
eMRIterType iter_types() const
{
eMRIterType iter_type = static_cast<eMRIterType>(0);
for (const ExtractorRunData &data : *this) {
const MeshExtract *extractor = data.extractor;
iter_type |= mesh_extract_iter_type(extractor);
}
return iter_type;
}
2021-06-18 06:27:43 +02:00
uint iter_types_len() const
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
{
const eMRIterType iter_type = iter_types();
uint bits = uint(iter_type);
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
return count_bits_i(bits);
}
eMRDataType data_types()
{
eMRDataType data_type = static_cast<eMRDataType>(0);
for (const ExtractorRunData &data : *this) {
const MeshExtract *extractor = data.extractor;
data_type |= extractor->data_type;
}
return data_type;
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
size_t data_size_total()
{
size_t data_size = 0;
for (const ExtractorRunData &data : *this) {
const MeshExtract *extractor = data.extractor;
data_size += extractor->data_size;
}
return data_size;
}
#ifdef WITH_CXX_GUARDEDALLOC
MEM_CXX_CLASS_ALLOC_FUNCS("DRAW:ExtractorRunDatas")
#endif
};
/** \} */
/* ---------------------------------------------------------------------- */
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
/** \name ExtractTaskData
* \{ */
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
struct ExtractTaskData {
const MeshRenderData *mr = nullptr;
MeshBatchCache *cache = nullptr;
ExtractorRunDatas *extractors = nullptr;
MeshBufferList *mbuflist = nullptr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
eMRIterType iter_type;
bool use_threading = false;
ExtractTaskData(const MeshRenderData &mr,
MeshBatchCache &cache,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
ExtractorRunDatas *extractors,
MeshBufferList *mbuflist,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const bool use_threading)
: mr(&mr),
cache(&cache),
extractors(extractors),
mbuflist(mbuflist),
use_threading(use_threading)
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
{
iter_type = extractors->iter_types();
};
ExtractTaskData(const ExtractTaskData &src) = default;
~ExtractTaskData()
{
delete extractors;
}
#ifdef WITH_CXX_GUARDEDALLOC
MEM_CXX_CLASS_ALLOC_FUNCS("DRW:ExtractTaskData")
#endif
};
static void extract_task_data_free(void *data)
{
ExtractTaskData *task_data = static_cast<ExtractTaskData *>(data);
delete task_data;
}
/** \} */
/* ---------------------------------------------------------------------- */
/** \name Extract Init and Finish
* \{ */
BLI_INLINE void extract_init(const MeshRenderData &mr,
MeshBatchCache &cache,
ExtractorRunDatas &extractors,
MeshBufferList *mbuflist,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *data_stack)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
uint32_t data_offset = 0;
for (ExtractorRunData &run_data : extractors) {
const MeshExtract *extractor = run_data.extractor;
run_data.buffer = mesh_extract_buffer_get(extractor, mbuflist);
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
run_data.data_offset = data_offset;
extractor->init(mr, cache, run_data.buffer, POINTER_OFFSET(data_stack, data_offset));
data_offset += uint32_t(extractor->data_size);
}
}
BLI_INLINE void extract_finish(const MeshRenderData &mr,
MeshBatchCache &cache,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorRunDatas &extractors,
void *data_stack)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : extractors) {
const MeshExtract *extractor = run_data.extractor;
if (extractor->finish) {
extractor->finish(
mr, cache, run_data.buffer, POINTER_OFFSET(data_stack, run_data.data_offset));
}
}
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
/** \} */
/* ---------------------------------------------------------------------- */
/** \name Extract In Parallel Ranges
* \{ */
struct ExtractorIterData {
ExtractorRunDatas extractors;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const MeshRenderData *mr = nullptr;
const void *elems = nullptr;
const int *loose_elems = nullptr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
#ifdef WITH_CXX_GUARDEDALLOC
MEM_CXX_CLASS_ALLOC_FUNCS("DRW:MeshRenderDataUpdateTaskData")
#endif
};
static void extract_task_reduce(const void *__restrict userdata,
void *__restrict chunk_to,
void *__restrict chunk_from)
{
const ExtractorIterData *data = static_cast<const ExtractorIterData *>(userdata);
for (const ExtractorRunData &run_data : data->extractors) {
const MeshExtract *extractor = run_data.extractor;
if (extractor->task_reduce) {
extractor->task_reduce(POINTER_OFFSET(chunk_to, run_data.data_offset),
POINTER_OFFSET(chunk_from, run_data.data_offset));
}
}
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
static void extract_range_iter_looptri_bm(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
void *extract_data = tls->userdata_chunk;
const MeshRenderData &mr = *data->mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
BMLoop **elt = ((BMLoop * (*)[3]) data->elems)[iter];
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_looptri_bm(
mr, elt, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
static void extract_range_iter_looptri_mesh(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const MLoopTri *mlt = &((const MLoopTri *)data->elems)[iter];
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_looptri_mesh(
mr, mlt, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_face_bm(void *__restrict userdata,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const BMFace *f = ((const BMFace **)data->elems)[iter];
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_face_bm(
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
mr, f, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_face_mesh(void *__restrict userdata,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_face_mesh(
Mesh: Replace MPoly struct with offset indices Implements #95967. Currently the `MPoly` struct is 12 bytes, and stores the index of a face's first corner and the number of corners/verts/edges. Polygons and corners are always created in order by Blender, meaning each face's corners will be after the previous face's corners. We can take advantage of this fact and eliminate the redundancy in mesh face storage by only storing a single integer corner offset for each face. The size of the face is then encoded by the offset of the next face. The size of a single integer is 4 bytes, so this reduces memory usage by 3 times. The same method is used for `CurvesGeometry`, so Blender already has an abstraction to simplify using these offsets called `OffsetIndices`. This class is used to easily retrieve a range of corner indices for each face. This also gives the opportunity for sharing some logic with curves. Another benefit of the change is that the offsets and sizes stored in `MPoly` can no longer disagree with each other. Storing faces in the order of their corners can simplify some code too. Face/polygon variables now use the `IndexRange` type, which comes with quite a few utilities that can simplify code. Some: - The offset integer array has to be one longer than the face count to avoid a branch for every face, which means the data is no longer part of the mesh's `CustomData`. - We lose the ability to "reference" an original mesh's offset array until more reusable CoW from #104478 is committed. That will be added in a separate commit. - Since they aren't part of `CustomData`, poly offsets often have to be copied manually. - To simplify using `OffsetIndices` in many places, some functions and structs in headers were moved to only compile in C++. - All meshes created by Blender use the same order for faces and face corners, but just in case, meshes with mismatched order are fixed by versioning code. - `MeshPolygon.totloop` is no longer editable in RNA. This API break is necessary here unfortunately. It should be worth it in 3.6, since that's the best way to allow loading meshes from 4.0, which is important for an LTS version. Pull Request: https://projects.blender.org/blender/blender/pulls/105938
2023-04-04 20:39:28 +02:00
mr, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_loose_edge_bm(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
const int loose_edge_i = data->loose_elems[iter];
const BMEdge *eed = ((const BMEdge **)data->elems)[loose_edge_i];
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_loose_edge_bm(
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
mr, eed, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_loose_edge_mesh(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
const int loose_edge_i = data->loose_elems[iter];
const int2 edge = ((const int2 *)data->elems)[loose_edge_i];
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_loose_edge_mesh(
mr, edge, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_loose_vert_bm(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
const int loose_vert_i = data->loose_elems[iter];
const BMVert *eve = ((const BMVert **)data->elems)[loose_vert_i];
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_loose_vert_bm(
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
mr, eve, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
}
}
static void extract_range_iter_loose_vert_mesh(void *__restrict userdata,
const int iter,
const TaskParallelTLS *__restrict tls)
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
void *extract_data = tls->userdata_chunk;
const ExtractorIterData *data = static_cast<ExtractorIterData *>(userdata);
const MeshRenderData &mr = *data->mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
for (const ExtractorRunData &run_data : data->extractors) {
run_data.extractor->iter_loose_vert_mesh(
Mesh: Move positions to a generic attribute **Changes** As described in T93602, this patch removes all use of the `MVert` struct, replacing it with a generic named attribute with the name `"position"`, consistent with other geometry types. Variable names have been changed from `verts` to `positions`, to align with the attribute name and the more generic design (positions are not vertices, they are just an attribute stored on the point domain). This change is made possible by previous commits that moved all other data out of `MVert` to runtime data or other generic attributes. What remains is mostly a simple type change. Though, the type still shows up 859 times, so the patch is quite large. One compromise is that now `CD_MASK_BAREMESH` now contains `CD_PROP_FLOAT3`. With the general move towards generic attributes over custom data types, we are removing use of these type masks anyway. **Benefits** The most obvious benefit is reduced memory usage and the benefits that brings in memory-bound situations. `float3` is only 3 bytes, in comparison to `MVert` which was 4. When there are millions of vertices this starts to matter more. The other benefits come from using a more generic type. Instead of writing algorithms specifically for `MVert`, code can just use arrays of vectors. This will allow eliminating many temporary arrays or wrappers used to extract positions. Many possible improvements aren't implemented in this patch, though I did switch simplify or remove the process of creating temporary position arrays in a few places. The design clarity that "positions are just another attribute" brings allows removing explicit copying of vertices in some procedural operations-- they are just processed like most other attributes. **Performance** This touches so many areas that it's hard to benchmark exhaustively, but I observed some areas as examples. * The mesh line node with 4 million count was 1.5x (8ms to 12ms) faster. * The Spring splash screen went from ~4.3 to ~4.5 fps. * The subdivision surface modifier/node was slightly faster RNA access through Python may be slightly slower, since now we need a name lookup instead of just a custom data type lookup for each index. **Future Improvements** * Remove uses of "vert_coords" functions: * `BKE_mesh_vert_coords_alloc` * `BKE_mesh_vert_coords_get` * `BKE_mesh_vert_coords_apply{_with_mat4}` * Remove more hidden copying of positions * General simplification now possible in many areas * Convert more code to C++ to use `float3` instead of `float[3]` * Currently `reinterpret_cast` is used for those C-API functions Differential Revision: https://developer.blender.org/D15982
2023-01-10 06:10:43 +01:00
mr, iter, POINTER_OFFSET(extract_data, run_data.data_offset));
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
}
}
BLI_INLINE void extract_task_range_run_iter(const MeshRenderData &mr,
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
ExtractorRunDatas *extractors,
const eMRIterType iter_type,
bool is_mesh,
TaskParallelSettings *settings)
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
ExtractorIterData range_data;
range_data.mr = &mr;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
TaskParallelRangeFunc func;
int stop;
switch (iter_type) {
case MR_ITER_LOOPTRI:
range_data.elems = is_mesh ? mr.looptris.data() : (void *)mr.edit_bmesh->looptris;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
func = is_mesh ? extract_range_iter_looptri_mesh : extract_range_iter_looptri_bm;
stop = mr.tri_len;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
break;
case MR_ITER_POLY:
range_data.elems = is_mesh ? mr.faces.data().data() : (void *)mr.bm->ftable;
func = is_mesh ? extract_range_iter_face_mesh : extract_range_iter_face_bm;
stop = mr.face_len;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
break;
case MR_ITER_LOOSE_EDGE:
range_data.loose_elems = mr.loose_edges.data();
range_data.elems = is_mesh ? mr.edges.data() : (void *)mr.bm->etable;
func = is_mesh ? extract_range_iter_loose_edge_mesh : extract_range_iter_loose_edge_bm;
stop = mr.edge_loose_len;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
break;
case MR_ITER_LOOSE_VERT:
range_data.loose_elems = mr.loose_verts.data();
range_data.elems = is_mesh ? mr.vert_positions.data() : (void *)mr.bm->vtable;
func = is_mesh ? extract_range_iter_loose_vert_mesh : extract_range_iter_loose_vert_bm;
stop = mr.vert_loose_len;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
break;
default:
BLI_assert(false);
return;
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
extractors->filter_into(range_data.extractors, iter_type, is_mesh);
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
BLI_task_parallel_range(0, stop, &range_data, func, settings);
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
static void extract_task_range_run(void *__restrict taskdata)
{
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
ExtractTaskData *data = (ExtractTaskData *)taskdata;
const eMRIterType iter_type = data->iter_type;
const bool is_mesh = data->mr->extract_type != MR_EXTRACT_BMESH;
GPU: Thread safe index buffer builders. Current index builder is designed to be used in a single thread. This makes all index buffer extractions single threaded. This patch adds a thread safe solution enabling multithreaded building of index buffers. To reduce locking the solution would provide a task/thread local index buffer builder (called sub builder). When a thread is finished this thread local index buffer builder can be joined with the initial index buffer builder. `GPU_indexbuf_subbuilder_init`: Initialized a sub builder. The index list is shared between the parent and sub buffer, but the counters are localized. Ensuring that updating counters would not need any locking. `GPU_indexbuf_subbuilder_finish`: merge the information of the sub builder back to the parent builder. Needs to be invoked outside the worker thread, or when sure that all worker threads have been finished. Internal the function is not thread safe. For testing purposes the extract_points extractor has been migrated to the new API. Herefore changes to the mesh extractor were needed. * When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks. * Adding two functions in `MeshExtract`. ** `task_init` will initialize the task specific userdata. ** `task_finish` should merge back the task specific userdata back. * adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking. There is no noticeable change in end user performance. Reviewed By: mano-wii Differential Revision: https://developer.blender.org/D11499
2021-06-08 16:35:33 +02:00
size_t userdata_chunk_size = data->extractors->data_size_total();
void *userdata_chunk = MEM_callocN(userdata_chunk_size, __func__);
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
TaskParallelSettings settings;
BLI_parallel_range_settings_defaults(&settings);
settings.use_threading = data->use_threading;
settings.userdata_chunk = userdata_chunk;
settings.userdata_chunk_size = userdata_chunk_size;
settings.func_reduce = extract_task_reduce;
settings.min_iter_per_thread = MIN_RANGE_LEN;
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
extract_init(*data->mr, *data->cache, *data->extractors, data->mbuflist, userdata_chunk);
if (iter_type & MR_ITER_LOOPTRI) {
extract_task_range_run_iter(*data->mr, data->extractors, MR_ITER_LOOPTRI, is_mesh, &settings);
}
if (iter_type & MR_ITER_POLY) {
extract_task_range_run_iter(*data->mr, data->extractors, MR_ITER_POLY, is_mesh, &settings);
}
if (iter_type & MR_ITER_LOOSE_EDGE) {
extract_task_range_run_iter(
*data->mr, data->extractors, MR_ITER_LOOSE_EDGE, is_mesh, &settings);
}
if (iter_type & MR_ITER_LOOSE_VERT) {
extract_task_range_run_iter(
*data->mr, data->extractors, MR_ITER_LOOSE_VERT, is_mesh, &settings);
}
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
extract_finish(*data->mr, *data->cache, *data->extractors, userdata_chunk);
MEM_freeN(userdata_chunk);
}
/** \} */
/* ---------------------------------------------------------------------- */
/** \name Extract In Parallel Ranges
* \{ */
static TaskNode *extract_task_node_create(TaskGraph *task_graph,
const MeshRenderData &mr,
MeshBatchCache &cache,
ExtractorRunDatas *extractors,
MeshBufferList *mbuflist,
const bool use_threading)
{
ExtractTaskData *taskdata = new ExtractTaskData(mr, cache, extractors, mbuflist, use_threading);
TaskNode *task_node = BLI_task_graph_node_create(
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
task_graph,
extract_task_range_run,
taskdata,
(TaskGraphNodeFreeFunction)extract_task_data_free);
return task_node;
}
/** \} */
/* ---------------------------------------------------------------------- */
/** \name Task Node - Update Mesh Render Data
* \{ */
struct MeshRenderDataUpdateTaskData {
MeshRenderData *mr = nullptr;
MeshBufferCache *cache = nullptr;
eMRIterType iter_type;
eMRDataType data_flag;
MeshRenderDataUpdateTaskData(MeshRenderData &mr,
MeshBufferCache &cache,
eMRIterType iter_type,
eMRDataType data_flag)
: mr(&mr), cache(&cache), iter_type(iter_type), data_flag(data_flag)
2021-06-07 16:06:30 +02:00
{
}
~MeshRenderDataUpdateTaskData()
{
mesh_render_data_free(mr);
}
#ifdef WITH_CXX_GUARDEDALLOC
MEM_CXX_CLASS_ALLOC_FUNCS("DRW:MeshRenderDataUpdateTaskData")
#endif
};
2021-06-07 16:06:30 +02:00
static void mesh_render_data_update_task_data_free(void *data)
{
2021-06-07 16:06:30 +02:00
MeshRenderDataUpdateTaskData *taskdata = static_cast<MeshRenderDataUpdateTaskData *>(data);
BLI_assert(taskdata);
delete taskdata;
}
static void mesh_extract_render_data_node_exec(void *__restrict task_data)
{
MeshRenderDataUpdateTaskData *update_task_data = static_cast<MeshRenderDataUpdateTaskData *>(
task_data);
MeshRenderData &mr = *update_task_data->mr;
const eMRIterType iter_type = update_task_data->iter_type;
const eMRDataType data_flag = update_task_data->data_flag;
2021-06-01 13:18:04 +02:00
mesh_render_data_update_normals(mr, data_flag);
mesh_render_data_update_looptris(mr, iter_type, data_flag);
mesh_render_data_update_loose_geom(mr, *update_task_data->cache, iter_type, data_flag);
mesh_render_data_update_faces_sorted(mr, *update_task_data->cache, data_flag);
}
static TaskNode *mesh_extract_render_data_node_create(TaskGraph *task_graph,
MeshRenderData &mr,
MeshBufferCache &cache,
const eMRIterType iter_type,
const eMRDataType data_flag)
{
2021-06-07 16:06:30 +02:00
MeshRenderDataUpdateTaskData *task_data = new MeshRenderDataUpdateTaskData(
mr, cache, iter_type, data_flag);
TaskNode *task_node = BLI_task_graph_node_create(
task_graph,
mesh_extract_render_data_node_exec,
task_data,
(TaskGraphNodeFreeFunction)mesh_render_data_update_task_data_free);
return task_node;
}
/** \} */
/* ---------------------------------------------------------------------- */
/** \name Extract Loop
* \{ */
void mesh_buffer_cache_create_requested(TaskGraph *task_graph,
MeshBatchCache &cache,
MeshBufferCache &mbc,
Object *object,
Mesh *me,
const bool is_editmode,
const bool is_paint_mode,
const bool is_mode_active,
const float obmat[4][4],
const bool do_final,
const bool do_uvedit,
const Scene *scene,
const ToolSettings *ts,
const bool use_hide)
{
/* For each mesh where batches needs to be updated a sub-graph will be added to the task_graph.
* This sub-graph starts with an extract_render_data_node. This fills/converts the required
* data from Mesh.
*
* Small extractions and extractions that can't be multi-threaded are grouped in a single
* `extract_single_threaded_task_node`.
*
* Other extractions will create a node for each loop exceeding 8192 items. these nodes are
2020-06-25 08:56:49 +02:00
* linked to the `user_data_init_task_node`. the `user_data_init_task_node` prepares the
* user_data needed for the extraction based on the data extracted from the mesh.
* counters are used to check if the finalize of a task has to be called.
*
* Mesh extraction sub graph
*
* +----------------------+
* +-----> | extract_task1_loop_1 |
* | +----------------------+
* +------------------+ +----------------------+ +----------------------+
* | mesh_render_data | --> | | --> | extract_task1_loop_2 |
* +------------------+ | | +----------------------+
* | | | +----------------------+
* | | user_data_init | --> | extract_task2_loop_1 |
* v | | +----------------------+
* +------------------+ | | +----------------------+
* | single_threaded | | | --> | extract_task2_loop_2 |
* +------------------+ +----------------------+ +----------------------+
* | +----------------------+
* +-----> | extract_task2_loop_3 |
* +----------------------+
*/
const bool do_hq_normals = (scene->r.perf_flag & SCE_PERF_HQ_NORMALS) != 0 ||
GPU_use_hq_normals_workaround();
Fix depsgraphs sharing IDs via evaluated edit mesh The evaluated mesh is a result of evaluated modifiers, and referencing other evaluated IDs such as materials. It can not be stored in the EditMesh structure which is intended to be re-used by many areas. Such sharing was causing ownership errors causing bugs like T93855: Cycles crash with edit mode and simultaneous viewport and final render The proposed solution is to store the evaluated edit mesh and its cage in the object's runtime field. The motivation goes as following: - It allows to avoid ownership problems like the ones in the linked report. - Object level is chosen over mesh level is because the evaluated mesh is affected by modifiers, which are on the object level. This patch allows to have modifier stack of an object which shares mesh with an object which is in edit mode to be properly taken into account (before the change the modifier stack from the active object will be used for all objects which share the mesh). There is a change in the way how copy-on-write is handled in the edit mode to allow proper state update when changing active scene (or having two windows with different scenes). Previously, the copt-on-write would have been ignored by skipping tagging CoW component. Now it is ignored from within the CoW operation callback. This allows to update edit pointers for objects which are not from the current depsgraph and where the edit_mesh was never assigned in the case when the depsgraph was evaluated prior the active depsgraph. There is no user level changes changes expected with the CoW handling changes: should not affect on neither performance, nor memory consumption. Tested scenarios: - Various modifiers configurations of objects sharing mesh and be part of the same scene. - Steps from the reports: T93855, T82952, T77359 This also fixes T76609, T72733 and perhaps other reports. Differential Revision: https://developer.blender.org/D13824
2022-01-11 15:42:07 +01:00
const bool override_single_mat = mesh_render_mat_len_get(object, me) <= 1;
/* Create an array containing all the extractors that needs to be executed. */
ExtractorRunDatas extractors;
MeshBufferList *mbuflist = &mbc.buff;
#define EXTRACT_ADD_REQUESTED(type, name) \
do { \
if (DRW_##type##_requested(mbuflist->type.name)) { \
const MeshExtract *extractor = mesh_extract_override_get( \
&extract_##name, do_hq_normals, override_single_mat); \
extractors.append(extractor); \
} \
} while (0)
EXTRACT_ADD_REQUESTED(vbo, pos_nor);
EXTRACT_ADD_REQUESTED(vbo, lnor);
EXTRACT_ADD_REQUESTED(vbo, uv);
EXTRACT_ADD_REQUESTED(vbo, tan);
EXTRACT_ADD_REQUESTED(vbo, sculpt_data);
EXTRACT_ADD_REQUESTED(vbo, orco);
EXTRACT_ADD_REQUESTED(vbo, edge_fac);
EXTRACT_ADD_REQUESTED(vbo, weights);
EXTRACT_ADD_REQUESTED(vbo, edit_data);
EXTRACT_ADD_REQUESTED(vbo, edituv_data);
EXTRACT_ADD_REQUESTED(vbo, edituv_stretch_area);
EXTRACT_ADD_REQUESTED(vbo, edituv_stretch_angle);
EXTRACT_ADD_REQUESTED(vbo, mesh_analysis);
EXTRACT_ADD_REQUESTED(vbo, fdots_pos);
EXTRACT_ADD_REQUESTED(vbo, fdots_nor);
EXTRACT_ADD_REQUESTED(vbo, fdots_uv);
EXTRACT_ADD_REQUESTED(vbo, fdots_edituv_data);
EXTRACT_ADD_REQUESTED(vbo, face_idx);
EXTRACT_ADD_REQUESTED(vbo, edge_idx);
EXTRACT_ADD_REQUESTED(vbo, vert_idx);
EXTRACT_ADD_REQUESTED(vbo, fdot_idx);
EXTRACT_ADD_REQUESTED(vbo, skin_roots);
for (int i = 0; i < GPU_MAX_ATTR; i++) {
EXTRACT_ADD_REQUESTED(vbo, attr[i]);
}
Geometry Nodes: viewport preview This adds support for showing geometry passed to the Viewer in the 3d viewport (instead of just in the spreadsheet). The "viewer geometry" bypasses the group output. So it is not necessary to change the final output of the node group to be able to see the intermediate geometry. **Activation and deactivation of a viewer node** * A viewer node is activated by clicking on it. * Ctrl+shift+click on any node/socket connects it to the viewer and makes it active. * Ctrl+shift+click in empty space deactivates the active viewer. * When the active viewer is not visible anymore (e.g. another object is selected, or the current node group is exit), it is deactivated. * Clicking on the icon in the header of the Viewer node toggles whether its active or not. **Pinning** * The spreadsheet still allows pinning the active viewer as before. When pinned, the spreadsheet still references the viewer node even when it becomes inactive. * The viewport does not support pinning at the moment. It always shows the active viewer. **Attribute** * When a field is linked to the second input of the viewer node it is displayed as an overlay in the viewport. * When possible the correct domain for the attribute is determined automatically. This does not work in all cases. It falls back to the face corner domain on meshes and the point domain on curves. When necessary, the domain can be picked manually. * The spreadsheet now only shows the "Viewer" column for the domain that is selected in the Viewer node. * Instance attributes are visualized as a constant color per instance. **Viewport Options** * The attribute overlay opacity can be controlled with the "Viewer Node" setting in the overlays popover. * A viewport can be configured not to show intermediate viewer-geometry by disabling the "Viewer Node" option in the "View" menu. **Implementation Details** * The "spreadsheet context path" was generalized to a "viewer path" that is used in more places now. * The viewer node itself determines the attribute domain, evaluates the field and stores the result in a `.viewer` attribute. * A new "viewer attribute' overlay displays the data from the `.viewer` attribute. * The ground truth for the active viewer node is stored in the workspace now. Node editors, spreadsheets and viewports retrieve the active viewer from there unless they are pinned. * The depsgraph object iterator has a new "viewer path" setting. When set, the viewed geometry of the corresponding object is part of the iterator instead of the final evaluated geometry. * To support the instance attribute overlay `DupliObject` was extended to contain the information necessary for drawing the overlay. * The ctrl+shift+click operator has been refactored so that it can make existing links to viewers active again. * The auto-domain-detection in the Viewer node works by checking the "preferred domain" for every field input. If there is not exactly one preferred domain, the fallback is used. Known limitations: * Loose edges of meshes don't have the attribute overlay. This could be added separately if necessary. * Some attributes are hard to visualize as a color directly. For example, the values might have to be normalized or some should be drawn as arrays. For now, we encourage users to build node groups that generate appropriate viewer-geometry. We might include some of that functionality in future versions. Support for displaying attribute values as text in the viewport is planned as well. * There seems to be an issue with the attribute overlay for pointclouds on nvidia gpus, to be investigated. Differential Revision: https://developer.blender.org/D15954
2022-09-28 17:54:59 +02:00
EXTRACT_ADD_REQUESTED(vbo, attr_viewer);
EXTRACT_ADD_REQUESTED(ibo, tris);
if (DRW_ibo_requested(mbuflist->ibo.lines_loose)) {
/* `ibo.lines_loose` require the `ibo.lines` buffer. */
if (mbuflist->ibo.lines == nullptr) {
DRW_ibo_request(nullptr, &mbuflist->ibo.lines);
}
const MeshExtract *extractor = DRW_ibo_requested(mbuflist->ibo.lines) ?
&extract_lines_with_lines_loose :
&extract_lines_loose_only;
extractors.append(extractor);
}
else if (DRW_ibo_requested(mbuflist->ibo.lines)) {
const MeshExtract *extractor;
if (mbuflist->ibo.lines_loose != nullptr) {
/* Update `ibo.lines_loose` as it depends on `ibo.lines`. */
extractor = &extract_lines_with_lines_loose;
}
else {
extractor = &extract_lines;
}
extractors.append(extractor);
}
EXTRACT_ADD_REQUESTED(ibo, points);
EXTRACT_ADD_REQUESTED(ibo, fdots);
EXTRACT_ADD_REQUESTED(ibo, lines_paint_mask);
EXTRACT_ADD_REQUESTED(ibo, lines_adjacency);
EXTRACT_ADD_REQUESTED(ibo, edituv_tris);
EXTRACT_ADD_REQUESTED(ibo, edituv_lines);
EXTRACT_ADD_REQUESTED(ibo, edituv_points);
EXTRACT_ADD_REQUESTED(ibo, edituv_fdots);
#undef EXTRACT_ADD_REQUESTED
if (extractors.is_empty()) {
return;
}
#ifdef DEBUG_TIME
double rdata_start = PIL_check_seconds_timer();
#endif
MeshRenderData *mr = mesh_render_data_create(
Fix depsgraphs sharing IDs via evaluated edit mesh The evaluated mesh is a result of evaluated modifiers, and referencing other evaluated IDs such as materials. It can not be stored in the EditMesh structure which is intended to be re-used by many areas. Such sharing was causing ownership errors causing bugs like T93855: Cycles crash with edit mode and simultaneous viewport and final render The proposed solution is to store the evaluated edit mesh and its cage in the object's runtime field. The motivation goes as following: - It allows to avoid ownership problems like the ones in the linked report. - Object level is chosen over mesh level is because the evaluated mesh is affected by modifiers, which are on the object level. This patch allows to have modifier stack of an object which shares mesh with an object which is in edit mode to be properly taken into account (before the change the modifier stack from the active object will be used for all objects which share the mesh). There is a change in the way how copy-on-write is handled in the edit mode to allow proper state update when changing active scene (or having two windows with different scenes). Previously, the copt-on-write would have been ignored by skipping tagging CoW component. Now it is ignored from within the CoW operation callback. This allows to update edit pointers for objects which are not from the current depsgraph and where the edit_mesh was never assigned in the case when the depsgraph was evaluated prior the active depsgraph. There is no user level changes changes expected with the CoW handling changes: should not affect on neither performance, nor memory consumption. Tested scenarios: - Various modifiers configurations of objects sharing mesh and be part of the same scene. - Steps from the reports: T93855, T82952, T77359 This also fixes T76609, T72733 and perhaps other reports. Differential Revision: https://developer.blender.org/D13824
2022-01-11 15:42:07 +01:00
object, me, is_editmode, is_paint_mode, is_mode_active, obmat, do_final, do_uvedit, ts);
mr->use_hide = use_hide;
mr->use_subsurf_fdots = mr->me && !mr->me->runtime->subsurf_face_dot_tags.is_empty();
mr->use_final_mesh = do_final;
#ifdef DEBUG_TIME
double rdata_end = PIL_check_seconds_timer();
#endif
eMRIterType iter_type = extractors.iter_types();
eMRDataType data_flag = extractors.data_types();
TaskNode *task_node_mesh_render_data = mesh_extract_render_data_node_create(
task_graph, *mr, mbc, iter_type, data_flag);
/* Simple heuristic. */
const bool use_thread = (mr->loop_len + mr->loop_loose_len) > MIN_RANGE_LEN;
if (use_thread) {
/* First run the requested extractors that do not support asynchronous ranges. */
for (const ExtractorRunData &run_data : extractors) {
const MeshExtract *extractor = run_data.extractor;
if (!extractor->use_threading) {
ExtractorRunDatas *single_threaded_extractors = new ExtractorRunDatas();
single_threaded_extractors->append(extractor);
TaskNode *task_node = extract_task_node_create(
task_graph, *mr, cache, single_threaded_extractors, mbuflist, false);
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
BLI_task_graph_edge_create(task_node_mesh_render_data, task_node);
}
}
/* Distribute the remaining extractors into ranges per core. */
ExtractorRunDatas *multi_threaded_extractors = new ExtractorRunDatas();
extractors.filter_threaded_extractors_into(*multi_threaded_extractors);
if (!multi_threaded_extractors->is_empty()) {
TaskNode *task_node = extract_task_node_create(
task_graph, *mr, cache, multi_threaded_extractors, mbuflist, true);
Refactor: Draw Cache: use 'BLI_task_parallel_range' This is an adaptation of {D11488}. A disadvantage of manually setting the iter ranges per thread is that we don't know how many threads are running in the background and so we don't know how to best distribute the ranges. To solve this limitation we can use `parallel_reduce` and thus let the driver choose the best distribution of ranges among the threads. This proved to be especially beneficial for computers with few cores. **Benchmarking:** Here's the result on an 4-core laptop: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 5.203638 FPS|Average: 5.398925 FPS ||rdata 15ms iter 43ms (frame 193ms)|rdata 14ms iter 36ms (frame 187ms) Here's the result on an 8-core PC: ||master:|PATCH: |---|---|---| |large_mesh_editing:|Average: 15.267482 FPS|Average: 15.906881 FPS ||rdata 9ms iter 28ms (frame 65ms)|rdata 9ms iter 25ms (frame 63ms) |large_mesh_editing_ledge: |Average: 15.145966 FPS|Average: 15.520474 FPS ||rdata 9ms iter 29ms (frame 65ms)|rdata 9ms iter 25ms (frame 64ms) |looptris_test:|Average: 4.001917 FPS|Average: 4.061105 FPS ||rdata 12ms iter 90ms (frame 236ms)|rdata 12ms iter 87ms (frame 230ms) |subdiv_mesh_cage_and_final:|Average: 1.917769 FPS|Average: 1.971790 FPS ||rdata 7ms iter 37ms (frame 261ms)|rdata 7ms iter 31ms (frame 258ms) ||rdata 7ms iter 38ms (frame 252ms)|rdata 7ms iter 33ms (frame 249ms) |subdiv_mesh_final_only:|Average: 6.387240 FPS|Average: 6.591251 FPS ||rdata 3ms iter 25ms (frame 151ms)|rdata 3ms iter 16ms (frame 145ms) |subdiv_mesh_final_only_ledge:|Average: 6.247393 FPS|Average: 6.596024 FPS ||rdata 3ms iter 26ms (frame 158ms)|rdata 3ms iter 16ms (frame 148ms) **Notes:** - The improvement can only be noticed if all extracts are multithreaded. - This patch touches different areas of the code, so it can be split into another patch if the idea is accepted. These screenshots show how threads behave in a quadcore: Master: {F10164664} Patch: {F10164666} Differential Revision: https://developer.blender.org/D11558
2021-06-10 16:01:36 +02:00
BLI_task_graph_edge_create(task_node_mesh_render_data, task_node);
}
else {
/* No tasks created freeing extractors list. */
delete multi_threaded_extractors;
}
}
else {
/* Run all requests on the same thread. */
ExtractorRunDatas *extractors_copy = new ExtractorRunDatas(extractors);
TaskNode *task_node = extract_task_node_create(
task_graph, *mr, cache, extractors_copy, mbuflist, false);
BLI_task_graph_edge_create(task_node_mesh_render_data, task_node);
}
/* Trigger the sub-graph for this mesh. */
BLI_task_graph_node_push_work(task_node_mesh_render_data);
#ifdef DEBUG_TIME
BLI_task_graph_work_and_wait(task_graph);
double end = PIL_check_seconds_timer();
static double avg = 0;
static double avg_fps = 0;
static double avg_rdata = 0;
static double end_prev = 0;
if (end_prev == 0) {
end_prev = end;
}
avg = avg * 0.95 + (end - rdata_end) * 0.05;
avg_fps = avg_fps * 0.95 + (end - end_prev) * 0.05;
avg_rdata = avg_rdata * 0.95 + (rdata_end - rdata_start) * 0.05;
printf(
"rdata %.0fms iter %.0fms (frame %.0fms)\n", avg_rdata * 1000, avg * 1000, avg_fps * 1000);
end_prev = end;
#endif
}
/** \} */
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
/* ---------------------------------------------------------------------- */
/** \name Subdivision Extract Loop
* \{ */
void mesh_buffer_cache_create_requested_subdiv(MeshBatchCache &cache,
MeshBufferCache &mbc,
DRWSubdivCache &subdiv_cache,
MeshRenderData &mr)
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
{
/* Create an array containing all the extractors that needs to be executed. */
ExtractorRunDatas extractors;
MeshBufferList *mbuflist = &mbc.buff;
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
#define EXTRACT_ADD_REQUESTED(type, name) \
do { \
if (DRW_##type##_requested(mbuflist->type.name)) { \
const MeshExtract *extractor = &extract_##name; \
extractors.append(extractor); \
} \
} while (0)
/* The order in which extractors are added to the list matters somewhat, as some buffers are
* reused when building others. */
EXTRACT_ADD_REQUESTED(ibo, tris);
/* Orcos are extracted at the same time as positions. */
if (DRW_vbo_requested(mbuflist->vbo.pos_nor) || DRW_vbo_requested(mbuflist->vbo.orco)) {
extractors.append(&extract_pos_nor);
}
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
EXTRACT_ADD_REQUESTED(vbo, lnor);
for (int i = 0; i < GPU_MAX_ATTR; i++) {
EXTRACT_ADD_REQUESTED(vbo, attr[i]);
}
/* We use only one extractor for face dots, as the work is done in a single compute shader. */
if (DRW_vbo_requested(mbuflist->vbo.fdots_nor) || DRW_vbo_requested(mbuflist->vbo.fdots_pos) ||
DRW_ibo_requested(mbuflist->ibo.fdots))
{
extractors.append(&extract_fdots_pos);
}
if (DRW_ibo_requested(mbuflist->ibo.lines_loose)) {
/* `ibo.lines_loose` require the `ibo.lines` buffer. */
if (mbuflist->ibo.lines == nullptr) {
DRW_ibo_request(nullptr, &mbuflist->ibo.lines);
}
const MeshExtract *extractor = DRW_ibo_requested(mbuflist->ibo.lines) ?
&extract_lines_with_lines_loose :
&extract_lines_loose_only;
extractors.append(extractor);
}
else if (DRW_ibo_requested(mbuflist->ibo.lines)) {
const MeshExtract *extractor;
if (mbuflist->ibo.lines_loose != nullptr) {
/* Update `ibo.lines_loose` as it depends on `ibo.lines`. */
extractor = &extract_lines_with_lines_loose;
}
else {
extractor = &extract_lines;
}
extractors.append(extractor);
}
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
EXTRACT_ADD_REQUESTED(ibo, edituv_points);
EXTRACT_ADD_REQUESTED(ibo, edituv_tris);
EXTRACT_ADD_REQUESTED(ibo, edituv_lines);
EXTRACT_ADD_REQUESTED(vbo, vert_idx);
EXTRACT_ADD_REQUESTED(vbo, edge_idx);
EXTRACT_ADD_REQUESTED(vbo, face_idx);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
EXTRACT_ADD_REQUESTED(vbo, edge_fac);
EXTRACT_ADD_REQUESTED(ibo, points);
EXTRACT_ADD_REQUESTED(vbo, edit_data);
EXTRACT_ADD_REQUESTED(vbo, edituv_data);
/* Make sure UVs are computed before edituv stuffs. */
EXTRACT_ADD_REQUESTED(vbo, uv);
EXTRACT_ADD_REQUESTED(vbo, tan);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
EXTRACT_ADD_REQUESTED(vbo, edituv_stretch_area);
EXTRACT_ADD_REQUESTED(vbo, edituv_stretch_angle);
EXTRACT_ADD_REQUESTED(ibo, lines_paint_mask);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
EXTRACT_ADD_REQUESTED(ibo, lines_adjacency);
EXTRACT_ADD_REQUESTED(vbo, weights);
EXTRACT_ADD_REQUESTED(vbo, sculpt_data);
#undef EXTRACT_ADD_REQUESTED
if (extractors.is_empty()) {
return;
}
mesh_render_data_update_looptris(mr, MR_ITER_LOOPTRI, MR_DATA_LOOPTRI);
mesh_render_data_update_normals(mr, MR_DATA_TAN_LOOP_NOR);
mesh_render_data_update_loose_geom(
mr, mbc, MR_ITER_LOOSE_EDGE | MR_ITER_LOOSE_VERT, MR_DATA_LOOSE_GEOM);
DRW_subdivide_loose_geom(&subdiv_cache, &mbc);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
void *data_stack = MEM_mallocN(extractors.data_size_total(), __func__);
uint32_t data_offset = 0;
for (const ExtractorRunData &run_data : extractors) {
const MeshExtract *extractor = run_data.extractor;
void *buffer = mesh_extract_buffer_get(extractor, mbuflist);
void *data = POINTER_OFFSET(data_stack, data_offset);
extractor->init_subdiv(subdiv_cache, mr, cache, buffer, data);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
if (extractor->iter_subdiv_mesh || extractor->iter_subdiv_bm) {
int *subdiv_loop_face_index = subdiv_cache.subdiv_loop_face_index;
if (mr.extract_type == MR_EXTRACT_BMESH) {
for (uint i = 0; i < subdiv_cache.num_subdiv_quads; i++) {
/* Multiply by 4 to have the start index of the quad's loop, as subdiv_loop_face_index is
* based on the subdivision loops. */
const int poly_origindex = subdiv_loop_face_index[i * 4];
const BMFace *efa = BM_face_at_index(mr.bm, poly_origindex);
extractor->iter_subdiv_bm(subdiv_cache, mr, data, i, efa);
}
}
else {
for (uint i = 0; i < subdiv_cache.num_subdiv_quads; i++) {
/* Multiply by 4 to have the start index of the quad's loop, as subdiv_loop_face_index is
* based on the subdivision loops. */
const int poly_origindex = subdiv_loop_face_index[i * 4];
Mesh: Replace MPoly struct with offset indices Implements #95967. Currently the `MPoly` struct is 12 bytes, and stores the index of a face's first corner and the number of corners/verts/edges. Polygons and corners are always created in order by Blender, meaning each face's corners will be after the previous face's corners. We can take advantage of this fact and eliminate the redundancy in mesh face storage by only storing a single integer corner offset for each face. The size of the face is then encoded by the offset of the next face. The size of a single integer is 4 bytes, so this reduces memory usage by 3 times. The same method is used for `CurvesGeometry`, so Blender already has an abstraction to simplify using these offsets called `OffsetIndices`. This class is used to easily retrieve a range of corner indices for each face. This also gives the opportunity for sharing some logic with curves. Another benefit of the change is that the offsets and sizes stored in `MPoly` can no longer disagree with each other. Storing faces in the order of their corners can simplify some code too. Face/polygon variables now use the `IndexRange` type, which comes with quite a few utilities that can simplify code. Some: - The offset integer array has to be one longer than the face count to avoid a branch for every face, which means the data is no longer part of the mesh's `CustomData`. - We lose the ability to "reference" an original mesh's offset array until more reusable CoW from #104478 is committed. That will be added in a separate commit. - Since they aren't part of `CustomData`, poly offsets often have to be copied manually. - To simplify using `OffsetIndices` in many places, some functions and structs in headers were moved to only compile in C++. - All meshes created by Blender use the same order for faces and face corners, but just in case, meshes with mismatched order are fixed by versioning code. - `MeshPolygon.totloop` is no longer editable in RNA. This API break is necessary here unfortunately. It should be worth it in 3.6, since that's the best way to allow loading meshes from 4.0, which is important for an LTS version. Pull Request: https://projects.blender.org/blender/blender/pulls/105938
2023-04-04 20:39:28 +02:00
extractor->iter_subdiv_mesh(subdiv_cache, mr, data, i, poly_origindex);
}
}
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
}
if (extractor->iter_loose_geom_subdiv) {
extractor->iter_loose_geom_subdiv(subdiv_cache, mr, buffer, data);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
}
if (extractor->finish_subdiv) {
extractor->finish_subdiv(subdiv_cache, mr, cache, buffer, data);
OpenSubDiv: add support for an OpenGL evaluator This evaluator is used in order to evaluate subdivision at render time, allowing for faster renders of meshes with a subdivision surface modifier placed at the last position in the modifier list. When evaluating the subsurf modifier, we detect whether we can delegate evaluation to the draw code. If so, the subdivision is first evaluated on the GPU using our own custom evaluator (only the coarse data needs to be initially sent to the GPU), then, buffers for the final `MeshBufferCache` are filled on the GPU using a set of compute shaders. However, some buffers are still filled on the CPU side, if doing so on the GPU is impractical (e.g. the line adjacency buffer used for x-ray, whose logic is hardly GPU compatible). This is done at the mesh buffer extraction level so that the result can be readily used in the various OpenGL engines, without having to write custom geometry or tesselation shaders. We use our own subdivision evaluation shaders, instead of OpenSubDiv's vanilla one, in order to control the data layout, and interpolation. For example, we store vertex colors as compressed 16-bit integers, while OpenSubDiv's default evaluator only work for float types. In order to still access the modified geometry on the CPU side, for use in modifiers or transform operators, a dedicated wrapper type is added `MESH_WRAPPER_TYPE_SUBD`. Subdivision will be lazily evaluated via `BKE_object_get_evaluated_mesh` which will create such a wrapper if possible. If the final subdivision surface is not needed on the CPU side, `BKE_object_get_evaluated_mesh_no_subsurf` should be used. Enabling or disabling GPU subdivision can be done through the user preferences (under Viewport -> Subdivision). See patch description for benchmarks. Reviewed By: campbellbarton, jbakker, fclem, brecht, #eevee_viewport Differential Revision: https://developer.blender.org/D12406
2021-12-27 16:34:47 +01:00
}
}
MEM_freeN(data_stack);
}
/** \} */
} // namespace blender::draw