<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/find.h, branch v6.1.96</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.1.96</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.1.96'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2022-10-06T12:57:36Z</updated>
<entry>
<title>cpumask: Introduce for_each_cpu_andnot()</title>
<updated>2022-10-06T12:57:36Z</updated>
<author>
<name>Valentin Schneider</name>
<email>vschneid@redhat.com</email>
</author>
<published>2022-10-03T15:34:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5f75ff295c662c1c8fb9e1737e9dc3b9a1e7fb29'/>
<id>urn:sha1:5f75ff295c662c1c8fb9e1737e9dc3b9a1e7fb29</id>
<content type='text'>
for_each_cpu_and() is very convenient as it saves having to allocate a
temporary cpumask to store the result of cpumask_and(). The same issue
applies to cpumask_andnot() which doesn't actually need temporary storage
for iteration purposes.

Following what has been done for for_each_cpu_and(), introduce
for_each_cpu_andnot().

Signed-off-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
</content>
</entry>
<entry>
<title>lib/find_bit: Introduce find_next_andnot_bit()</title>
<updated>2022-10-06T12:57:36Z</updated>
<author>
<name>Valentin Schneider</name>
<email>vschneid@redhat.com</email>
</author>
<published>2022-10-03T15:34:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=90d482908eedd56f01a707325aa541cf9c40f936'/>
<id>urn:sha1:90d482908eedd56f01a707325aa541cf9c40f936</id>
<content type='text'>
In preparation of introducing for_each_cpu_andnot(), add a variant of
find_next_bit() that negate the bits in @addr2 when ANDing them with the
bits in @addr1.

Signed-off-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
</content>
</entry>
<entry>
<title>lib/find: optimize for_each() macros</title>
<updated>2022-10-01T17:22:58Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-19T21:05:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fdae96a3fc7f70eb8ff9619550d7fa604719626a'/>
<id>urn:sha1:fdae96a3fc7f70eb8ff9619550d7fa604719626a</id>
<content type='text'>
Moving an iterator of the macros inside conditional part of for-loop
helps to generate a better code. It had been first implemented in commit
7baac8b91f9871ba ("cpumask: make for_each_cpu_mask a bit smaller").

Now that cpumask for-loops are the aliases to bitmap loops, it's worth
to optimize them the same way.

Bloat-o-meter says:
add/remove: 8/12 grow/shrink: 147/592 up/down: 4876/-24416 (-19540)

Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>lib/bitmap: introduce for_each_set_bit_wrap() macro</title>
<updated>2022-10-01T17:22:57Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-19T21:05:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4fe49b3b97c2640147c46519c2a6fdb06df34f5f'/>
<id>urn:sha1:4fe49b3b97c2640147c46519c2a6fdb06df34f5f</id>
<content type='text'>
Add for_each_set_bit_wrap() macro and use it in for_each_cpu_wrap(). The
new macro is based on __for_each_wrap() iterator, which is simpler and
smaller than cpumask_next_wrap().

Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>lib/find_bit: add find_next{,_and}_bit_wrap</title>
<updated>2022-10-01T17:22:57Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-19T21:05:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6cc18331a987c4a29d66b9c4fd292587fba4d7bd'/>
<id>urn:sha1:6cc18331a987c4a29d66b9c4fd292587fba4d7bd</id>
<content type='text'>
The helper is better optimized for the worst case: in case of empty
cpumask, current code traverses 2 * size:

  next = cpumask_next_and(prev, src1p, src2p);
  if (next &gt;= nr_cpu_ids)
  	next = cpumask_first_and(src1p, src2p);

At bitmap level we can stop earlier after checking 'size + offset' bits.

Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>cpumask: switch for_each_cpu{,_not} to use for_each_bit()</title>
<updated>2022-10-01T17:22:57Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-19T21:05:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=33e67710beda78aed38a2fe10be6088d4aeb1c53'/>
<id>urn:sha1:33e67710beda78aed38a2fe10be6088d4aeb1c53</id>
<content type='text'>
The difference between for_each_cpu() and for_each_set_bit()
is that the latter uses cpumask_next() instead of find_next_bit(),
and so calls cpumask_check().

This check is useless because the iterator value is not provided by
user. It generates false-positives for the very last iteration
of for_each_cpu().

Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>lib: add find_nth{,_and,_andnot}_bit()</title>
<updated>2022-09-26T19:19:12Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-18T03:07:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3cea8d475327756066e2a54f0b651bb7284dd448'/>
<id>urn:sha1:3cea8d475327756066e2a54f0b651bb7284dd448</id>
<content type='text'>
Kernel lacks for a function that searches for Nth bit in a bitmap.
Usually people do it like this:
	for_each_set_bit(bit, mask, size)
		if (n-- == 0)
			return bit;

We can do it more efficiently, if we:
1. find a word containing Nth bit, using hweight(); and
2. find the bit, using a helper fns(), that works similarly to
   __ffs() and ffz().

fns() is implemented as a simple loop. For x86_64, there's PDEP instruction
to do that: ret = clz(pdep(1 &lt;&lt; idx, num)). However, for large bitmaps the
most of improvement comes from using hweight(), so I kept fns() simple.

New find_nth_bit() is ~70 times faster on x86_64/kvm in find_bit benchmark:
find_nth_bit:                  7154190 ns,  16411 iterations
for_each_bit:                505493126 ns,  16315 iterations

With all that, a family of 3 new functions is added, and used where
appropriate in the following patches.

Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>lib/find_bit: optimize find_next_bit() functions</title>
<updated>2022-09-21T19:21:32Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-15T02:07:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e79864f3164c573afce09ec4b72b75ebe061c14d'/>
<id>urn:sha1:e79864f3164c573afce09ec4b72b75ebe061c14d</id>
<content type='text'>
Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.

As find_next_bit() API grows, I expect that more parameters will be added.
Current design would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.

This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimizations become possible.

The common logic is moved to the new macro, and all flavors may be
generated by providing a FETCH macro parameter, like in this example:

  #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...

  find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
  {
	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] &amp; addr3[idx],
				/* nop */, size, start);
  }

The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
and an iterator idx.

MUNGE is here to support _le code generation for BE builds. May be
empty.

I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of 6.0-rc2 + this series. The results for kvm/x86_64 are:

                      v6.0-rc2  Optimized       Difference  Z-score
Random dense bitmap         ns         ns        ns      %
find_next_bit:          787735     670546    117189   14.9     3.97
find_next_zero_bit:     777492     664208    113284   14.6    10.51
find_last_bit:          830925     687573    143352   17.3     2.35
find_first_bit:        3874366    3306635    567731   14.7     1.84
find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
find_next_and_bit:      347865     304456     43409   12.5     1.35

Random sparse bitmap
find_next_bit:           19816      14021      5795   29.2     6.10
find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
find_last_bit:           14573      13514      1059    7.3     6.92
find_first_bit:        1313321    1249024     64297    4.9     1.53
find_first_and_bit:       8921       8098       823    9.2     4.56
find_next_and_bit:        9796       7176      2620   26.7     5.39

Where the statistics is significant (z-score &gt; 3), the improvement
is ~15%.

According to the bloat-o-meter, the Image size is 10-11K less:

x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)

arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)

Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>lib/find_bit: create find_first_zero_bit_le()</title>
<updated>2022-09-21T19:17:18Z</updated>
<author>
<name>Yury Norov</name>
<email>yury.norov@gmail.com</email>
</author>
<published>2022-09-15T02:07:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=14a99e130f2758bc826a7e7a8bdf6f7400b54f0f'/>
<id>urn:sha1:14a99e130f2758bc826a7e7a8bdf6f7400b54f0f</id>
<content type='text'>
find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than 'first' version.

Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.

Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
<entry>
<title>include/linux/find: Fix documentation</title>
<updated>2022-06-03T13:52:57Z</updated>
<author>
<name>Anna-Maria Behnsen</name>
<email>anna-maria@linutronix.de</email>
</author>
<published>2022-04-11T15:05:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6d7131bd52b3e0dd068a0067e5584b3f36cf17eb'/>
<id>urn:sha1:6d7131bd52b3e0dd068a0067e5584b3f36cf17eb</id>
<content type='text'>
The order of the arguments in function documentation doesn't fit the
implementation. Change the documentation so that it corresponds to the
code. This prevent people to get confused when reading the documentation.

Signed-off-by: Anna-Maria Behnsen &lt;anna-maria@linutronix.de&gt;
Reviewed-by: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Signed-off-by: Yury Norov &lt;yury.norov@gmail.com&gt;
</content>
</entry>
</feed>
