你选择了一个不好的例子,正如都铎王朝所指出的那样。旋转磁盘硬件受到移动盘片和磁头的物理约束,最有效的读取实现是按顺序读取每个块,这减少了移动磁头或等待磁盘对齐的需要。
也就是说,某些操作系统并不总是将内容连续存储在磁盘上,对于那些记住的人来说,如果您的操作系统/文件系统没有为您完成工作,碎片整理可以提供磁盘性能提升。
正如你提到的想要一个能受益的程序,让我建议一个简单的程序,矩阵加法。
假设您为每个内核创建了一个线程,则可以轻松地将任意两个矩阵划分为 N 行(每个线程一个矩阵)。矩阵添加(如果您还记得)的工作原理如下:
A + B = C
或
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ]
[ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ]
[ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
因此,要将其分布在N个线程中,我们只需要将行数和模数除以线程数,即可获得将要添加的“线程id”。
matrix with 20 rows across 3 threads
row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18)
row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19)
row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17)
// row 20 doesn't exist, because we number rows from 0
现在,每个线程“知道”它应该处理哪些行,并且“每行”的结果可以微不足道地计算,因为结果不会交叉到其他线程的计算域。
现在所需要的只是一个“结果”数据结构,它跟踪何时计算了值,当设置了最后一个值时,计算就完成了。在这个具有两个线程的矩阵加法结果的“假”示例中,计算具有两个线程的答案大约需要一半的时间。
// the following assumes that threads don't get rescheduled to different cores for
// illustrative purposes only. Real Threads are scheduled across cores due to
// availability and attempts to prevent unnecessary core migration of a running thread.
[ done, done, done ] // filled in at about the same time as row 2 (runs on core 3)
[ done, done, done ] // filled in at about the same time as row 1 (runs on core 1)
[ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3)
[ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)
更复杂的问题可以通过多线程来解决,不同的问题可以用不同的技术来解决。我特意选择了一个最简单的例子。